Introduction#

These tasks are meant to test retrieval-augmented generation (RAG) architectures on various datasets.

You can check an up-to-date list of retrieval tasks in the registry:

from langchain_benchmarks import registry

registry.filter(Type="RetrievalTask")

Name	Type	Dataset ID	Description
LangChain Docs Q&A	RetrievalTask	452ccafc-18e1-4314-885b-edd735f17b9d	Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Earnings	RetrievalTask	c47d9617-ab99-4d6e-a6e6-92b8daf85a7d	Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).

Task resources#

In addition to a name, daset_id, and description, each retrieval task provides a few helper functions you can use to configure your pipeline.

get_docs: callable - fetches the original Document objects from the cache. Each task may provide configurable parameters you can use to define how the original documents are fetched.
retriever_factories: Dict[str, callable] - define some configurable pipelines you can use to transform the documents, embed them, and add them to a vectorstore (or other retriever object) for downstream use. They use LangChain’s caching index API so you don’t have to re-index for every evaluation. For custom transformations, we ask that you provide a transformation_name to isolate the cache and vectorstore namespace. Currently (2023/11/21) these all use Chroma as a vectorstore, but you can swap this out
chain_factories: Dict[str, callable] - define some off-the-shelf architectures you can configure to evaluate.

When evaluating, you don’t have to use any of these factory methods. You can instead define your own custom architecture or ETl pipeline before evaluating. They are meant to facilitate evaluations and comparisons for specific design decisions.

Dataset schema#

Each task corresponds to a LangSmith dataset with the following schema:

Inputs:

question: str - the user question

Outputs

answer: str - the expected answer

Introduction

Contents

Introduction#

Task resources#

Dataset schema#