Semi-structured RAG#

Let’s evaluate your architecture on a small semi-structured Q&A dataset. This dataset is composed of QA pairs over pdfs that contain tables.

Pre-requisites#

We will install quite a few prerequisites for this example since we are comparing various techinques and models.

%pip install -U langchain langsmith langchainhub  langchain_benchmarks langchain_experimental
%pip install --quiet chromadb openai huggingface pandas "unstructured[all-docs]"

For this code to work, please configure LangSmith environment variables with your credentials.

import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your API key

# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Review Q&A Tasks#

The registry provides configurations to test out common architectures on curated datasets.

from langchain_benchmarks import clone_public_dataset, registry

registry = registry.filter(Type="RetrievalTask")
registry

Name	Type	Dataset ID	Description
LangChain Docs Q&A	RetrievalTask	452ccafc-18e1-4314-885b-edd735f17b9d	Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Reports	RetrievalTask	c47d9617-ab99-4d6e-a6e6-92b8daf85a7d	Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).

task = registry["Semi-structured Reports"]
task

Name	Semi-structured Reports
Type	RetrievalTask
Dataset ID	c47d9617-ab99-4d6e-a6e6-92b8daf85a7d
Description	Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Retriever Factories	basic, parent-doc, hyde
Architecture Factories
get_docs

clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Semi-structured Reports already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962.

Now, index the documents#

You can see the raw filepaths, or use unstructured to process the pdfs.

from langchain_benchmarks.rag.tasks.semi_structured_reports import get_file_names

# If you want to completely customize the document processing, you can use the files directly
file_names = list(get_file_names())

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-base",
    model_kwargs={"device": 0},  # Comment out to use CPU
)

# Arguments to pass to partition_pdf
unstructured_config = {
    # Unstructured first finds embedded image blocks
    "extract_images_in_pdf": False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    "infer_table_structure": True,
    # Post processing to aggregate text once we have the title
    "chunking_strategy": "by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    "max_characters": 4000,
    "new_after_n_chars": 3800,
    "combine_text_under_n_chars": 2000,
}
docs = list(task.get_docs(unstructured_config=unstructured_config))

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

retriever_factory = task.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
retriever = retriever_factory(embeddings, docs=docs)

Chroma/semi-structured-earnings-b_Chroma_HuggingFaceEmbeddings_raw
[]

Time to evaluate#

We will compose our retriever with a simple Llama based LLM.

from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "Answer based solely on the retrieved documents below:\n\n<Documents>\n{docs}</Documents>",
            ),
            ("user", "{question}"),
        ]
    )
    llm = ChatAnthropic(model="claude-2")
    return (
        RunnableAssign({"docs": (lambda x: next(iter(x.values()))) | retriever})
        | prompt
        | llm
        | StrOutputParser()
    )

from langsmith.client import Client

from langchain_benchmarks.rag import get_eval_config

client = Client()
RAG_EVALUATION = get_eval_config()
chain = create_chain(retriever)
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain,
    evaluation=RAG_EVALUATION,
    verbose=True,
)

View the evaluation results for project 'cold-attachment-88' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/d8e512b7-b63d-4eb5-8d73-d95f7fa7ffc2?eval=true

View all tests for Dataset Semi-structured Reports at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962
[------------------------------------------------->] 5/5
 Eval quantiles:
                                          inputs.question  \
count                                                   5   
unique                                                  5   
top     Analyzing the operating expenses for Q3 2023, ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

        feedback.embedding_cosine_distance  feedback.faithfulness  \
count                             5.000000                    5.0   
unique                                 NaN                    NaN   
top                                    NaN                    NaN   
freq                                   NaN                    NaN   
mean                              0.137066                    1.0   
std                               0.011379                    0.0   
min                               0.123112                    1.0   
25%                               0.129089                    1.0   
50%                               0.137871                    1.0   
75%                               0.143398                    1.0   
max                               0.151860                    1.0   

        feedback.score_string:accuracy error  execution_time  
count                              5.0     0        5.000000  
unique                             NaN     0             NaN  
top                                NaN   NaN             NaN  
freq                               NaN   NaN             NaN  
mean                               0.1   NaN        7.940625  
std                                0.0   NaN        1.380190  
min                                0.1   NaN        6.416387  
25%                                0.1   NaN        7.272528  
50%                                0.1   NaN        7.324673  
75%                                0.1   NaN        8.831243  
max                                0.1   NaN        9.858293  

Example processing the docs#

RAG apps are as good as the information they are able to retrieve. Let’s try indexing the tables’ summaries to improve the likelihood that they are retrieved whenever a user asks a relevant question.

We will use unstructured’s partition_pdf functionality and generate summaries using an LLM.

You can define your own indexing pipeline to see how it impacts the downstream performance.

from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are summarizing semi-structured tables or text in a pdf.\n\n```document\n{doc}\n```",
        ),
        ("user", "Write a concise summary."),
    ]
)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")


def create_doc(x) -> Document:
    return Document(
        page_content=x["output"],
        metadata=x["doc"].metadata,
    )


summarize_chain = (
    {"doc": lambda x: x}
    | RunnableAssign({"prompt": prompt})
    | {
        "output": itemgetter("prompt") | model | StrOutputParser(),
        "doc": itemgetter("doc"),
    }
    | create_doc
)

summaries = summarize_chain.batch(
    [doc for doc in docs if doc.metadata["element_type"] == "table"]
)

Index the documents and create the retriever. We will re

# Indexes the documents with the specified embeddings
retriever_with_summaries = retriever_factory(
    embeddings,
    docs=docs + summaries,
    # Specify a unique transformation name to avoid local cache collisions with other indices.
    transformation_name="docs-with_summaries",
)

Evaluate#

We’ll evaluate the new chain on the same dataset.

chain_2 = create_chain(retriever_with_summaries)

test_run_with_summaries = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain_2,
    evaluation=RAG_EVALUATION,
    verbose=True,
)

View the evaluation results for project 'crazy-harmony-39' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/b69d796f-6ba4-4cde-822f-db363cf81f8f?eval=true

View all tests for Dataset Semi-structured Reports at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962
[------------------------------------------------->] 5/5
 Eval quantiles:
                                          inputs.question  \
count                                                   5   
unique                                                  5   
top     Analyzing the operating expenses for Q3 2023, ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

        feedback.score_string:accuracy  feedback.faithfulness  \
count                         5.000000                    5.0   
unique                             NaN                    NaN   
top                                NaN                    NaN   
freq                               NaN                    NaN   
mean                          0.720000                    1.0   
std                           0.408656                    0.0   
min                           0.100000                    1.0   
25%                           0.500000                    1.0   
50%                           1.000000                    1.0   
75%                           1.000000                    1.0   
max                           1.000000                    1.0   

        feedback.embedding_cosine_distance error  execution_time  
count                             5.000000     0        5.000000  
unique                                 NaN     0             NaN  
top                                    NaN   NaN             NaN  
freq                                   NaN   NaN             NaN  
mean                              0.069363   NaN        8.659120  
std                               0.023270   NaN        2.611724  
min                               0.039593   NaN        6.283505  
25%                               0.050176   NaN        6.723136  
50%                               0.078912   NaN        7.441743  
75%                               0.084389   NaN       10.673265  
max                               0.093747   NaN       12.173952  

Semi-structured RAG

Contents