{
"cells": [
{
"cell_type": "markdown",
"id": "60bb467d-861d-4b07-a48d-8e5aa177c969",
"metadata": {},
"source": [
"# Semi-structured RAG\n",
"\n",
"\n",
"Let's evaluate your architecture on a small semi-structured Q&A dataset. This dataset is composed of QA pairs over pdfs that contain tables."
]
},
{
"cell_type": "markdown",
"id": "f49db759-7ce6-4ab7-a58f-7fc3a6a7c8ec",
"metadata": {},
"source": [
"## Pre-requisites\n",
"\n",
"We will install quite a few prerequisites for this example since we are comparing various techinques and models."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9f44b59b",
"metadata": {},
"outputs": [],
"source": [
"%pip install -U langchain langsmith langchainhub langchain_benchmarks langchain_experimental\n",
"%pip install --quiet chromadb openai huggingface pandas \"unstructured[all-docs]\""
]
},
{
"cell_type": "markdown",
"id": "0aae13f6-cd40-41e6-bd02-bd683e91cbff",
"metadata": {},
"source": [
"For this code to work, please configure LangSmith environment variables with your credentials."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "62b518cf-99fb-44be-8acb-ee0a8ba62272",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\"\n",
"os.environ[\"LANGCHAIN_API_KEY\"] = \"sk-...\" # Your API key\n",
"\n",
"# Silence warnings from HuggingFace\n",
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\""
]
},
{
"cell_type": "markdown",
"id": "2e8a666d-8bf5-4bfd-8b20-8b7defdb8cd5",
"metadata": {},
"source": [
"## Review Q&A Tasks\n",
"\n",
"The registry provides configurations to test out common architectures on curated datasets."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b39159d0-9ea1-414f-a9d8-4a7b22b3d2cc",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain_benchmarks import clone_public_dataset, registry"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3644d211-382e-41aa-b282-21b01d28fc35",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"Name | Type | Dataset ID | Description |
\n",
"\n",
"\n",
"LangChain Docs Q&A | RetrievalTask | 452ccafc-18e1-4314-885b-edd735f17b9d | Questions and answers based on a snapshot of the LangChain python docs.\n",
"\n",
"The environment provides the documents and the retriever information.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer.\n",
"We also measure the faithfulness of the model's response relative to the retrieved documents (if any). |
\n",
"Semi-structured Reports | RetrievalTask | c47d9617-ab99-4d6e-a6e6-92b8daf85a7d | Questions and answers based on PDFs containing tables and charts.\n",
"\n",
"The task provides the raw documents as well as factory methods to easily index them\n",
"and create a retriever.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer.\n",
"We also measure the faithfulness of the model's response relative to the retrieved documents (if any). |
\n",
"\n",
"
"
],
"text/plain": [
"Registry(tasks=[RetrievalTask(name='LangChain Docs Q&A', dataset_id='https://smith.langchain.com/public/452ccafc-18e1-4314-885b-edd735f17b9d/d', description=\"Questions and answers based on a snapshot of the LangChain python docs.\\n\\nThe environment provides the documents and the retriever information.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\nWe also measure the faithfulness of the model's response relative to the retrieved documents (if any).\\n\", retriever_factories={'basic': , 'parent-doc': , 'hyde': }, architecture_factories={'conversational-retrieval-qa': }, get_docs=), RetrievalTask(name='Semi-structured Reports', dataset_id='https://smith.langchain.com/public/c47d9617-ab99-4d6e-a6e6-92b8daf85a7d/d', description=\"Questions and answers based on PDFs containing tables and charts.\\n\\nThe task provides the raw documents as well as factory methods to easily index them\\nand create a retriever.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\nWe also measure the faithfulness of the model's response relative to the retrieved documents (if any).\\n\", retriever_factories={'basic': , 'parent-doc': , 'hyde': }, architecture_factories={}, get_docs=)])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"registry = registry.filter(Type=\"RetrievalTask\")\n",
"registry"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "671282f8-c455-4390-b018-e53bbd833093",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"Name | Semi-structured Reports |
\n",
"Type | RetrievalTask |
\n",
"Dataset ID | c47d9617-ab99-4d6e-a6e6-92b8daf85a7d |
\n",
"Description | Questions and answers based on PDFs containing tables and charts.\n",
"\n",
"The task provides the raw documents as well as factory methods to easily index them\n",
"and create a retriever.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer.\n",
"We also measure the faithfulness of the model's response relative to the retrieved documents (if any). |
\n",
"Retriever Factories | basic, parent-doc, hyde |
\n",
"Architecture Factories | |
\n",
"get_docs | |
\n",
"\n",
"
"
],
"text/plain": [
"RetrievalTask(name='Semi-structured Reports', dataset_id='https://smith.langchain.com/public/c47d9617-ab99-4d6e-a6e6-92b8daf85a7d/d', description=\"Questions and answers based on PDFs containing tables and charts.\\n\\nThe task provides the raw documents as well as factory methods to easily index them\\nand create a retriever.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\nWe also measure the faithfulness of the model's response relative to the retrieved documents (if any).\\n\", retriever_factories={'basic': , 'parent-doc': , 'hyde': }, architecture_factories={}, get_docs=)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"task = registry[\"Semi-structured Reports\"]\n",
"task"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "70369f67-deb4-467a-801a-6d38c3d0460d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset Semi-structured Reports already exists. Skipping.\n",
"You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962.\n"
]
}
],
"source": [
"clone_public_dataset(task.dataset_id, dataset_name=task.name)"
]
},
{
"cell_type": "markdown",
"id": "4b4fafb2-63d0-40b4-b803-0095c5b22ca6",
"metadata": {},
"source": [
"### Now, index the documents\n",
"\n",
"You can see the raw filepaths, or use unstructured to process the pdfs."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c9657b27-4e10-4ba5-ab20-1f05f22fdbd4",
"metadata": {},
"outputs": [],
"source": [
"from langchain_benchmarks.rag.tasks.semi_structured_reports import get_file_names\n",
"\n",
"# If you want to completely customize the document processing, you can use the files directly\n",
"file_names = list(get_file_names())"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f5ad4b23-fbd4-4ebc-b5a5-d3d05efd0b9c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']\n",
"- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b8a52c9983274c21a713ac8742e9c99b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/26 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"\n",
"embeddings = HuggingFaceEmbeddings(\n",
" model_name=\"thenlper/gte-base\",\n",
" model_kwargs={\"device\": 0}, # Comment out to use CPU\n",
")\n",
"\n",
"# Arguments to pass to partition_pdf\n",
"unstructured_config = {\n",
" # Unstructured first finds embedded image blocks\n",
" \"extract_images_in_pdf\": False,\n",
" # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
" # Titles are any sub-section of the document\n",
" \"infer_table_structure\": True,\n",
" # Post processing to aggregate text once we have the title\n",
" \"chunking_strategy\": \"by_title\",\n",
" # Chunking params to aggregate text blocks\n",
" # Attempt to create a new chunk 3800 chars\n",
" # Attempt to keep chunks > 2000 chars\n",
" \"max_characters\": 4000,\n",
" \"new_after_n_chars\": 3800,\n",
" \"combine_text_under_n_chars\": 2000,\n",
"}\n",
"docs = list(task.get_docs(unstructured_config=unstructured_config))"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "fe0a05ad-5b57-40b0-aac4-e2d9cd9e6b4b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Chroma/semi-structured-earnings-b_Chroma_HuggingFaceEmbeddings_raw\n",
"[]\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "06a11e1a4d50416596d9dd953fdabafa",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/26 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"retriever_factory = task.retriever_factories[\"basic\"]\n",
"# Indexes the documents with the specified embeddings\n",
"retriever = retriever_factory(embeddings, docs=docs)"
]
},
{
"cell_type": "markdown",
"id": "57efac89-12f9-47e3-b60f-65d9279ebc1e",
"metadata": {},
"source": [
"### Time to evaluate\n",
"\n",
"We will compose our retriever with a simple Llama based LLM."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8d1bc360-d822-43a8-b6b7-ff66dc27caf4",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatAnthropic\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.runnable.passthrough import RunnableAssign\n",
"\n",
"\n",
"def create_chain(retriever):\n",
" prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"Answer based solely on the retrieved documents below:\\n\\n\\n{docs}\",\n",
" ),\n",
" (\"user\", \"{question}\"),\n",
" ]\n",
" )\n",
" llm = ChatAnthropic(model=\"claude-2\")\n",
" return (\n",
" RunnableAssign({\"docs\": (lambda x: next(iter(x.values()))) | retriever})\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "935cacd9-e841-4c76-ac16-f3f0cf18df62",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'cold-attachment-88' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/d8e512b7-b63d-4eb5-8d73-d95f7fa7ffc2?eval=true\n",
"\n",
"View all tests for Dataset Semi-structured Reports at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962\n",
"[------------------------------------------------->] 5/5\n",
" Eval quantiles:\n",
" inputs.question \\\n",
"count 5 \n",
"unique 5 \n",
"top Analyzing the operating expenses for Q3 2023, ... \n",
"freq 1 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN \n",
"\n",
" feedback.embedding_cosine_distance feedback.faithfulness \\\n",
"count 5.000000 5.0 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.137066 1.0 \n",
"std 0.011379 0.0 \n",
"min 0.123112 1.0 \n",
"25% 0.129089 1.0 \n",
"50% 0.137871 1.0 \n",
"75% 0.143398 1.0 \n",
"max 0.151860 1.0 \n",
"\n",
" feedback.score_string:accuracy error execution_time \n",
"count 5.0 0 5.000000 \n",
"unique NaN 0 NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 0.1 NaN 7.940625 \n",
"std 0.0 NaN 1.380190 \n",
"min 0.1 NaN 6.416387 \n",
"25% 0.1 NaN 7.272528 \n",
"50% 0.1 NaN 7.324673 \n",
"75% 0.1 NaN 8.831243 \n",
"max 0.1 NaN 9.858293 \n"
]
}
],
"source": [
"from langsmith.client import Client\n",
"\n",
"from langchain_benchmarks.rag import get_eval_config\n",
"\n",
"client = Client()\n",
"RAG_EVALUATION = get_eval_config()\n",
"chain = create_chain(retriever)\n",
"test_run = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=chain,\n",
" evaluation=RAG_EVALUATION,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3b54ef6c-0194-410a-aae9-f30c2097548a",
"metadata": {},
"source": [
"## Example processing the docs\n",
"\n",
"RAG apps are as good as the information they are able to retrieve. Let's try indexing the tables' summaries to\n",
"improve the likelihood that they are retrieved whenever a user asks a relevant question.\n",
"\n",
"We will use unstructured's `partition_pdf` functionality and generate summaries using an LLM.\n",
"\n",
"You can define your own indexing pipeline to see how it impacts the downstream performance."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3378eddb-0a8d-4179-8e9c-54343469eef6",
"metadata": {},
"outputs": [],
"source": [
"from operator import itemgetter\n",
"\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.document import Document\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.runnable.passthrough import RunnableAssign\n",
"\n",
"# Prompt\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are summarizing semi-structured tables or text in a pdf.\\n\\n```document\\n{doc}\\n```\",\n",
" ),\n",
" (\"user\", \"Write a concise summary.\"),\n",
" ]\n",
")\n",
"\n",
"# Summary chain\n",
"model = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-16k\")\n",
"\n",
"\n",
"def create_doc(x) -> Document:\n",
" return Document(\n",
" page_content=x[\"output\"],\n",
" metadata=x[\"doc\"].metadata,\n",
" )\n",
"\n",
"\n",
"summarize_chain = (\n",
" {\"doc\": lambda x: x}\n",
" | RunnableAssign({\"prompt\": prompt})\n",
" | {\n",
" \"output\": itemgetter(\"prompt\") | model | StrOutputParser(),\n",
" \"doc\": itemgetter(\"doc\"),\n",
" }\n",
" | create_doc\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "07a2f070-3b5a-4de0-b3da-ddfb6e6f8c2b",
"metadata": {},
"outputs": [],
"source": [
"summaries = summarize_chain.batch(\n",
" [doc for doc in docs if doc.metadata[\"element_type\"] == \"table\"]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "22dc0bf8-fa50-4be3-8d23-04f6129548e0",
"metadata": {},
"source": [
"Index the documents and create the retriever. We will re"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "35a1ccf6-2c2f-46f2-838e-5a5bf89515f5",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "029921ed3d7c4f389c666583a7192144",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/36 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Indexes the documents with the specified embeddings\n",
"retriever_with_summaries = retriever_factory(\n",
" embeddings,\n",
" docs=docs + summaries,\n",
" # Specify a unique transformation name to avoid local cache collisions with other indices.\n",
" transformation_name=\"docs-with_summaries\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3821e4b0-8e67-418a-840c-470fcde42df0",
"metadata": {},
"source": [
"### Evaluate\n",
"\n",
"We'll evaluate the new chain on the same dataset."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "e6f08c4c-a738-4449-9190-5a4f0b65b99a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'crazy-harmony-39' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/b69d796f-6ba4-4cde-822f-db363cf81f8f?eval=true\n",
"\n",
"View all tests for Dataset Semi-structured Reports at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962\n",
"[------------------------------------------------->] 5/5\n",
" Eval quantiles:\n",
" inputs.question \\\n",
"count 5 \n",
"unique 5 \n",
"top Analyzing the operating expenses for Q3 2023, ... \n",
"freq 1 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN \n",
"\n",
" feedback.score_string:accuracy feedback.faithfulness \\\n",
"count 5.000000 5.0 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.720000 1.0 \n",
"std 0.408656 0.0 \n",
"min 0.100000 1.0 \n",
"25% 0.500000 1.0 \n",
"50% 1.000000 1.0 \n",
"75% 1.000000 1.0 \n",
"max 1.000000 1.0 \n",
"\n",
" feedback.embedding_cosine_distance error execution_time \n",
"count 5.000000 0 5.000000 \n",
"unique NaN 0 NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 0.069363 NaN 8.659120 \n",
"std 0.023270 NaN 2.611724 \n",
"min 0.039593 NaN 6.283505 \n",
"25% 0.050176 NaN 6.723136 \n",
"50% 0.078912 NaN 7.441743 \n",
"75% 0.084389 NaN 10.673265 \n",
"max 0.093747 NaN 12.173952 \n"
]
}
],
"source": [
"chain_2 = create_chain(retriever_with_summaries)\n",
"\n",
"test_run_with_summaries = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=chain_2,\n",
" evaluation=RAG_EVALUATION,\n",
" verbose=True,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}