Evaluating RAG Architectures on Benchmark Tasks#

Introduction#

If you ever wanted to compare different approaches to Q&A over docs, you’ll find this notebook helpful to get started evaluating different configurations and common RAG architectures on benchmark tasks. The goal is to make it easy for you to experiment with different techniques, understand their tradeoffs, and make informed decisions for your specific use case.

What is RAG?#

LLMs have a knowledge cutoff. For them to accurately respond to user queries, they need access to relevant information. Retrieval Augmented Generation (RAG) (aka “give an LLM a search engine”) is a common design pattern to address this. The key components are:

Retriever: fetches information from a knowledge base, which can be a vector search engine, a database, or any search engine.
Generator: synthesizes responses using a blend of learned knowledge and the retrieved information.

The overall quality of the system depends on both components.

Benchmark Tasks and Datasets (As of 2023/11/21)#

The following datasets are currently available:

LangChain Docs Q&A - technical questions based on the LangChain python documentation
Semi-structured Earnings - financial questions and answers on financial PDFs containing tables and graphs

Each task comes with a labeled dataset of questions and answers. They also provide configurable factory functions for easy customization of chunking and indexing for the relevant source documents.

And with that, let’s get started!

Pre-requisites#

We will install quite a few prerequisites for this example since we are comparing many techniques and models.

We will be using LangSmith to capture the evaluation traces. You can make a free account at smith.langchain.com. Once you’ve done so, you can make an API key and set it below.

We are comparing many methods throughout this notebook, so the list of dependencies we will install is long.

%pip install -U --quiet langchain langsmith langchainhub langchain_benchmarks
%pip install --quiet chromadb openai huggingface pandas langchain_experimental sentence_transformers pyarrow anthropic tiktoken

import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your API key
os.environ["OPENAI_API_KEY"] = "sk-..."  # Your OpenAI API key
os.environ["ANTHROPIC_API_KEY"] = "sk-..."  # Your Anthropic API key
# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import uuid

# Generate a unique run ID for these experiments
run_uid = uuid.uuid4().hex[:6]

Review Q&A tasks#

The registry provides configurations to test out common architectures on curated datasets. Below is a list of the available tasks at the time of writing.

from langchain_benchmarks import clone_public_dataset, registry

registry.filter(Type="RetrievalTask")

Name	Type	Dataset ID	Description
LangChain Docs Q&A	RetrievalTask	452ccafc-18e1-4314-885b-edd735f17b9d	Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Reports	RetrievalTask	c47d9617-ab99-4d6e-a6e6-92b8daf85a7d	Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).

langchain_docs = registry["LangChain Docs Q&A"]
langchain_docs

clone_public_dataset(langchain_docs.dataset_id, dataset_name=langchain_docs.name)

Basic Vector Retrieval#

For our first example, we will generate a single embedding for each document in the dataset, without chunking or indexing, and then provide that retriever to an LLM for inference.

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-base",
    model_kwargs={"device": 0},  # Comment out to use CPU
)

retriever_factory = langchain_docs.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
# Note that this does not apply any chunking to the docs,
# which means the documents can be of arbitrary length
retriever = retriever_factory(embeddings)

# Factory for creating a conversational retrieval QA chain

chain_factory = langchain_docs.architecture_factories["conversational-retrieval-qa"]

from langchain.chat_models import ChatAnthropic

# Example
llm = ChatAnthropic(model="claude-2", temperature=1)

chain_factory(retriever, llm=llm).invoke({"question": "what's lcel?"})

from functools import partial

from langsmith.client import Client

from langchain_benchmarks.rag import get_eval_config

client = Client()
RAG_EVALUATION = get_eval_config()

test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain simple-index {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "thenlper/gte-base",
        "llm": "claude-2",
    },
    verbose=True,
)

test_run.get_aggregate_feedback()

Comparing with other indexing strategies#

The index used above retrieves the raw documents based on a single vector per document. It doesn’t perform any additional chunking. You can try changing the chunking parameters when generating the index.

Customizing Chunking#

The simplest change you can make to the index is configure how you split the documents.

from langchain.text_splitter import RecursiveCharacterTextSplitter


def transform_docs(docs):
    splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
    yield from splitter.split_documents(docs)


# Used for the cache
transformation_name = "recursive-text-cs4k-ol200"

retriever_factory = langchain_docs.retriever_factories["basic"]

chunked_retriever = retriever_factory(
    embeddings,
    transform_docs=transform_docs,
    transformation_name=transformation_name,
    search_kwargs={"k": 4},
)

chunked_results = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, chunked_retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain chunked {run_uid}",
    project_metadata={
        "index_method": "basic",
        "chunk_size": 4000,
        "chunk_overlap": 200,
        "embedding_model": "thenlper/gte-base",
        "llm": "claude-2",
    },
    verbose=True,
)

chunked_results.get_aggregate_feedback()

Parent Document Retriever#

This indexing technique chunks documents and generates 1 vector per chunk. At retrieval time, the K “most similar” chunks are fetched, then the full parent documents are returned for the LLM to reason over.

This ensures the chunk is surfaced in its full natural context. It also can potentially improve the initial retrieval quality since the similarity scores are scoped to individual chunks.

Let’s see if this technique is effective in our case.

retriever_factory = langchain_docs.retriever_factories["parent-doc"]

# Indexes the documents with the specified embeddings
parent_doc_retriever = retriever_factory(embeddings)

parent_doc_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, parent_doc_retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain parent-doc {run_uid}",
    project_metadata={
        "index_method": "parent-doc",
        "embedding_model": "thenlper/gte-base",
        "llm": "claude-2",
    },
    verbose=True,
)

parent_doc_test_run.get_aggregate_feedback()

HyDE#

HyDE (Hypothetical document embeddings) refers to the technique of using an LLM to generate example queries that my be used to retrieve a doc. By doing so, the resulting embeddings are automatically “more aligned” with the embeddings generated from the query. This comes with an additional indexing cost, since each document requires an additoinal call to an LLM while indexing.

retriever_factory = langchain_docs.retriever_factories["hyde"]

retriever = retriever_factory(embeddings)

hyde_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, retriever=retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    verbose=True,
    project_name=f"claude-2 qa-chain HyDE {run_uid}",
    project_metadata={
        "index_method": "HyDE",
        "embedding_model": "thenlper/gte-base",
        "llm": "claude-2",
    },
)

hyde_test_run.get_aggregate_feedback()

Comparing Embeddings#

We’ve been using off-the-shelf GTE-Base embeddings so far to retrieve the docs, but you may get better results with other embeddings. You could even try fine-tuning embedddings on your own documentation and evaluating here.

Let’s compare our results so far to OpenAI’s embeddings.

from langchain.embeddings.openai import OpenAIEmbeddings

openai_embeddings = OpenAIEmbeddings()

openai_retriever = langchain_docs.retriever_factories["basic"](openai_embeddings)

openai_embeddings_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, openai_retriever),
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain oai-emb basic {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "openai/text-embedding-ada-002",
        "llm": "claude-2",
    },
    verbose=True,
)

openai_embeddings_test_run.get_aggregate_feedback()

Comparing Models#

We used Anthropic’s Claude-2 model in our previous tests, but lets try with some other models.

You can swap in any LangChain LLM within the response generator below. We’ll try a long-context llama 2 model first (using Ollama).

from langchain.chat_models import ChatOllama

# A llama2-based model with 128k context
# (in theory) In practice, we will see how well
# it actually leverages that context.
ollama = ChatOllama(model="yarn-llama2:7b-128k")

# We'll go back to the GTE embeddings for now

retriever_factory = langchain_docs.retriever_factories["basic"]
retriever = retriever_factory(embeddings)

ollama_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, llm=ollama, retriever=retriever),
    evaluation=RAG_EVALUATION,
    project_name=f"yarn-llama2:7b-128k qa-chain basic {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "thenlper/gte-base",
        "llm": "ollama/yarn-llama2:7b-128k",
    },
    verbose=True,
)

Changing the prompt in the response generator#

The default prompt was tested primariily on OpenAI’s gpt-3.5 model. When switching models, you may get better results if you modify the prompt. Let’s try a simple one.

from langchain import hub
from langchain.schema.output_parser import StrOutputParser

prompt = hub.pull("wfh/rag-simple")

generator = prompt | ChatAnthropic(model="claude-2", temperature=1) | StrOutputParser()
new_chain = chain_factory(response_generator=generator, retriever=openai_retriever)

claude_simple_prompt_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(
        chain_factory, response_generator=generator, retriever=retriever, llm=llm
    ),
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain basic rag-simple {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "thenlper/gte-base",
        "prompt": "wfh/rag-simple",
        "llm": "claude-2",
    },
    verbose=True,
)

Testing Agents#

Agents use an LLM to decide actions and generate responses. There are two obvious ways they could potentially succeed where the approaches above fail:

The above chains do not “rephrase” the user query. It could be that the rephrased question will result in more relevant documents.
The above chains must respond based on a single retrieval step. Agents can iteratively query the retriever or subdivide the query into different parts to synthesize at the end. Our dataset has a number of questions that require information from different documents - if the

Let’s evaluate to see whether the “plausible” statements above are worth the tradeoffs. We will use the basic retriever as a tool for them.

from typing import List, Tuple

from langchain.agents import AgentExecutor
from langchain.agents.format_scratchpad import format_to_openai_functions
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.pydantic_v1 import BaseModel, Field
from langchain.schema.messages import AIMessage, HumanMessage
from langchain.tools import tool
from langchain.tools.render import format_tool_to_openai_function

# This is used to tell the model how to best use the retriever.


@tool
def search(query, callbacks=None):
    """Search the LangChain docs with the retriever."""
    return retriever.get_relevant_documents(query, callbacks=callbacks)


tools = [search]

llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)
assistant_system_message = """You are a helpful assistant tasked with answering technical questions about LangChain. \
Use tools (only if necessary) to best answer the users questions. Do not make up information if you cannot find the answer using your tools."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", assistant_system_message),
        MessagesPlaceholder(variable_name="chat_history"),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

llm_with_tools = llm.bind(functions=[format_tool_to_openai_function(t) for t in tools])


def _format_chat_history(chat_history: List[Tuple[str, str]]):
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer


agent = (
    {
        "input": lambda x: x["input"],
        "chat_history": lambda x: _format_chat_history(x["chat_history"]),
        "agent_scratchpad": lambda x: format_to_openai_functions(
            x["intermediate_steps"]
        ),
    }
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)


class AgentInput(BaseModel):
    input: str
    chat_history: List[Tuple[str, str]] = Field(..., extra={"widget": {"type": "chat"}})


agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False).with_types(
    input_type=AgentInput
)


class ChainInput(BaseModel):
    question: str


def mapper(input: dict):
    return {"input": input["question"], "chat_history": []}


agent_executor = (mapper | agent_executor | (lambda x: x["output"])).with_types(
    input_type=ChainInput
)

oai_functions_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=agent_executor,
    evaluation=RAG_EVALUATION,
    project_name=f"oai-functions basic rag-simple {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "thenlper/gte-base",
        "llm": "gpt-4-1106-preview",
        "architecture": "oai-functions-agent",
    },
    verbose=True,
)

Assistant#

OpenAI provides a hosted agent service through their Assistants API.

You can connect your LangChain retriever to an OpenAI’s Assistant API and evaluate its performance. Let’s test below:

import json

from langchain.agents import AgentExecutor
from langchain.tools import tool
from langchain_experimental.openai_assistant import OpenAIAssistantRunnable


@tool
def search(query, callbacks=None) -> str:
    """Search the LangChain docs with the retriever."""
    docs = retriever.get_relevant_documents(query, callbacks=callbacks)
    return json.dumps([doc.dict() for doc in docs])


tools = [search]

agent = OpenAIAssistantRunnable.create_assistant(
    name="langchain docs assistant",
    instructions="You are a helpful assistant tasked with answering technical questions about LangChain.",
    tools=tools,
    model="gpt-4-1106-preview",
    as_agent=True,
)


assistant_exector = (
    (lambda x: {"content": x["question"]})
    | AgentExecutor(agent=agent, tools=tools)
    | (lambda x: x["output"])
)

assistant_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=assistant_exector,
    evaluation=RAG_EVALUATION,
    project_name=f"oai-assistant basic rag-simple {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "thenlper/gte-base",
        "llm": "gpt-4-1106-preview",
        "architecture": "oai-assistant",
    },
    verbose=True,
)

assistant_test_run.get_aggregate_feedback()

Evaluating RAG Architectures on Benchmark Tasks

Contents