Introduction#

These tasks are meant to test retrieval-augmented generation (RAG) architectures on various datasets.

You can check an up-to-date list of retrieval tasks in the registry:

from langchain_benchmarks import registry

registry.filter(Type="RetrievalTask")
Name Type Dataset ID Description
LangChain Docs Q&A RetrievalTask452ccafc-18e1-4314-885b-edd735f17b9dQuestions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured EarningsRetrievalTaskc47d9617-ab99-4d6e-a6e6-92b8daf85a7dQuestions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).

Task resources#

In addition to a name, daset_id, and description, each retrieval task provides a few helper functions you can use to configure your pipeline.

  • get_docs: callable - fetches the original Document objects from the cache. Each task may provide configurable parameters you can use to define how the original documents are fetched.

  • retriever_factories: Dict[str, callable] - define some configurable pipelines you can use to transform the documents, embed them, and add them to a vectorstore (or other retriever object) for downstream use. They use LangChain’s caching index API so you don’t have to re-index for every evaluation. For custom transformations, we ask that you provide a transformation_name to isolate the cache and vectorstore namespace. Currently (2023/11/21) these all use Chroma as a vectorstore, but you can swap this out

  • chain_factories: Dict[str, callable] - define some off-the-shelf architectures you can configure to evaluate.

When evaluating, you don’t have to use any of these factory methods. You can instead define your own custom architecture or ETl pipeline before evaluating. They are meant to facilitate evaluations and comparisons for specific design decisions.

Dataset schema#

Each task corresponds to a LangSmith dataset with the following schema:

Inputs:

  • question: str - the user question

Outputs

  • answer: str - the expected answer