{
"cells": [
{
"cell_type": "markdown",
"id": "9fa3470d-9448-4792-9f65-6978fc58cf84",
"metadata": {},
"source": [
"# Multi-modal eval: Baseline\n",
"\n",
"`Multi-modal slide decks` is a public dataset that contains a dataset of question-answer pairs from slide decks with visual content.\n",
"\n",
"The question-answer pairs are derived from the visual content in the decks, testing the ability of RAG to perform visual reasoning.\n",
"\n",
"As a baseline, we evaluate this dataset using text-based RAG pipeline, below.\n",
"\n",
"This will not reason about visual content and will simply load the text from the slides. \n",
"\n",
"## Pre-requisites"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47220461-d4e9-4f1d-9c57-672ca947ca0d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# %pip install -U langchain langsmith langchain_benchmarks\n",
"# %pip install --quiet chromadb openai pypdf pandas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "196de967-6de6-40da-aa75-e836923ab5e3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\"\n",
"env_vars = [\"LANGCHAIN_API_KEY\", \"OPENAI_API_KEY\"]\n",
"for var in env_vars:\n",
" if var not in os.environ:\n",
" os.environ[var] = getpass.getpass(prompt=f\"Enter your {var}: \")"
]
},
{
"cell_type": "markdown",
"id": "10da8e11-6288-4131-bd60-d5aa86928acc",
"metadata": {},
"source": [
"## Dataset\n",
"\n",
"We can browse the available LangChain benchmark datasets for retrieval."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2ff97905-14a6-413c-99be-58b7a9c8d4c1",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"Name | Type | Dataset ID | Description |
\n",
"\n",
"\n",
"LangChain Docs Q&A | RetrievalTask | 452ccafc-18e1-4314-885b-edd735f17b9d | Questions and answers based on a snapshot of the LangChain python docs.\n",
"\n",
"The environment provides the documents and the retriever information.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer.\n",
"We also measure the faithfulness of the model's response relative to the retrieved documents (if any). |
\n",
"Semi-structured Reports | RetrievalTask | c47d9617-ab99-4d6e-a6e6-92b8daf85a7d | Questions and answers based on PDFs containing tables and charts.\n",
"\n",
"The task provides the raw documents as well as factory methods to easily index them\n",
"and create a retriever.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer.\n",
"We also measure the faithfulness of the model's response relative to the retrieved documents (if any). |
\n",
"Multi-modal slide decks | RetrievalTask | 40afc8e7-9d7e-44ed-8971-2cae1eb59731 | This public dataset is a work-in-progress and will be extended over time.\n",
" \n",
"Questions and answers based on slide decks containing visual tables and charts.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer. |
\n",
"\n",
"
"
],
"text/plain": [
"Registry(tasks=[RetrievalTask(name='LangChain Docs Q&A', dataset_id='https://smith.langchain.com/public/452ccafc-18e1-4314-885b-edd735f17b9d/d', description=\"Questions and answers based on a snapshot of the LangChain python docs.\\n\\nThe environment provides the documents and the retriever information.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\nWe also measure the faithfulness of the model's response relative to the retrieved documents (if any).\\n\", get_docs=, retriever_factories={'basic': , 'parent-doc': , 'hyde': }, architecture_factories={'conversational-retrieval-qa': }), RetrievalTask(name='Semi-structured Reports', dataset_id='https://smith.langchain.com/public/c47d9617-ab99-4d6e-a6e6-92b8daf85a7d/d', description=\"Questions and answers based on PDFs containing tables and charts.\\n\\nThe task provides the raw documents as well as factory methods to easily index them\\nand create a retriever.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\nWe also measure the faithfulness of the model's response relative to the retrieved documents (if any).\\n\", get_docs=, retriever_factories={'basic': , 'parent-doc': , 'hyde': }, architecture_factories={}), RetrievalTask(name='Multi-modal slide decks', dataset_id='https://smith.langchain.com/public/40afc8e7-9d7e-44ed-8971-2cae1eb59731/d', description='This public dataset is a work-in-progress and will be extended over time.\\n \\nQuestions and answers based on slide decks containing visual tables and charts.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\n', get_docs={}, retriever_factories={}, architecture_factories={})])"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_benchmarks import clone_public_dataset, registry\n",
"\n",
"registry = registry.filter(Type=\"RetrievalTask\")\n",
"registry"
]
},
{
"cell_type": "markdown",
"id": "2fb7dc3d-28f1-4c28-b0d0-3784d04b81ce",
"metadata": {},
"source": [
"`Multi-modal slide decks` is the relevant dataset for our task."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "219a4141-4a5f-48e4-ae05-5a824e2193fd",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"Name | Multi-modal slide decks |
\n",
"Type | RetrievalTask |
\n",
"Dataset ID | 40afc8e7-9d7e-44ed-8971-2cae1eb59731 |
\n",
"Description | This public dataset is a work-in-progress and will be extended over time.\n",
" \n",
"Questions and answers based on slide decks containing visual tables and charts.\n",
"\n",
"Each example is composed of a question and reference answer.\n",
"\n",
"Success is measured based on the accuracy of the answer relative to the reference answer. |
\n",
"Retriever Factories | |
\n",
"Architecture Factories | |
\n",
"get_docs | {} |
\n",
"\n",
"
"
],
"text/plain": [
"RetrievalTask(name='Multi-modal slide decks', dataset_id='https://smith.langchain.com/public/40afc8e7-9d7e-44ed-8971-2cae1eb59731/d', description='This public dataset is a work-in-progress and will be extended over time.\\n \\nQuestions and answers based on slide decks containing visual tables and charts.\\n\\nEach example is composed of a question and reference answer.\\n\\nSuccess is measured based on the accuracy of the answer relative to the reference answer.\\n', get_docs={}, retriever_factories={}, architecture_factories={})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"task = registry[\"Multi-modal slide decks\"]\n",
"task"
]
},
{
"cell_type": "markdown",
"id": "2d6569b5-e79a-41b7-9745-c2f8a1dd704e",
"metadata": {},
"source": [
"Clone the dataset so that it's available in our LangSmith datasets."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d2caa086-9549-4c74-bba9-ba80d5a7b218",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset Multi-modal slide decks already exists. Skipping.\n",
"You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306.\n"
]
}
],
"source": [
"clone_public_dataset(task.dataset_id, dataset_name=task.name)"
]
},
{
"cell_type": "markdown",
"id": "bf350917-a1e5-46f4-81cd-c1678ab9220f",
"metadata": {},
"source": [
"Fetch the associated PDFs from remote cache for the dataset so that we can perform ingestion."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "99ce6afb-2317-4bc1-9faf-4f828095ad91",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain_benchmarks.rag.tasks.multi_modal_slide_decks import get_file_names\n",
"\n",
"file_names = list(get_file_names()) # PosixPath"
]
},
{
"cell_type": "markdown",
"id": "848a4cdb-6c08-4c01-81ce-16ab83a7fdff",
"metadata": {},
"source": [
"## Load\n",
"\n",
"Load and split the files for indexing."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6ce85810-98a7-406e-b44e-ce860ac35986",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 98 text elements in DDOG_Q3_earnings_deck.pdf\n"
]
}
],
"source": [
"from langchain.document_loaders import PyPDFLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"\n",
"def load_and_split(file):\n",
" \"\"\"\n",
" Load and split PDF files\n",
" :param file: PosixPath path for pdf\n",
" :return: A list of text chunks\n",
" \"\"\"\n",
"\n",
" loader = PyPDFLoader(str(file))\n",
" pdf_pages = loader.load()\n",
"\n",
" text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
" chunk_size=100, chunk_overlap=50\n",
" )\n",
"\n",
" # Get chunks\n",
" docs = text_splitter.split_documents(pdf_pages)\n",
" texts = [d.page_content for d in docs]\n",
" print(f\"There are {len(texts)} text elements in {file.name}\")\n",
" return texts\n",
"\n",
"\n",
"texts = []\n",
"for fi in file_names:\n",
" texts.extend(load_and_split(fi))"
]
},
{
"cell_type": "markdown",
"id": "eb01925d-b7d1-47a1-bd90-805178d3c4a9",
"metadata": {},
"source": [
"## Index\n",
"\n",
"Embed (OpenAIEmbeddings) and store splits in a vectorstore (Chroma)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ceb31f71-45fb-4b12-bc1c-31981de334bb",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.vectorstores import Chroma\n",
"\n",
"vectorstore_baseline = Chroma.from_texts(\n",
" texts=texts, collection_name=\"baseline-multi-modal\", embedding=OpenAIEmbeddings()\n",
")\n",
"\n",
"retriever_baseline = vectorstore_baseline.as_retriever()"
]
},
{
"cell_type": "markdown",
"id": "e6dcbb01-f480-456d-b972-c732eb26c393",
"metadata": {},
"source": [
"## RAG\n",
"\n",
"Create a pipeline for retrieval of relevant chunks based on semantic similarity to the input question.\n",
"\n",
"Pass the images to GPT-4 for answer synthesis."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ea233664-e527-42f1-a820-0c2271e16c20",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"\n",
"\n",
"def rag_chain(retriever):\n",
" \"\"\"\n",
" RAG pipeline for the indexed presentations\n",
" :param retriever: PosixPath path for pdf\n",
" \"\"\"\n",
"\n",
" # Prompt template\n",
" template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
" {context}\n",
" Question: {question}\n",
" \"\"\"\n",
" prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
" # LLM\n",
" model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
"\n",
" # RAG pipeline\n",
" chain = (\n",
" {\n",
" \"context\": retriever | (lambda x: \"\\n\\n\".join([i.page_content for i in x])),\n",
" \"question\": RunnablePassthrough(),\n",
" }\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
" )\n",
" return chain\n",
"\n",
"\n",
"# Create RAG chain\n",
"chain = rag_chain(retriever_baseline)"
]
},
{
"cell_type": "markdown",
"id": "95df1446-143d-4f4c-a15b-2a379266d8bf",
"metadata": {},
"source": [
"## Eval\n",
"\n",
"Run evaluation on our dataset:\n",
"\n",
"* `task.name` is the dataset of QA pairs that we cloned\n",
"* `eval_config` specifies the [LangSmith evaluator](https://docs.smith.langchain.com/evaluation/evaluator-implementations#correctness-qa-evaluation) for our dataset, which will use GPT-4 as a grader\n",
"* The grader will evaluate the chain-generated answer to each question relative to ground truth"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "479ce09d-642e-4b3b-9e4e-e9c2b7f0e9ca",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project '866f-baseline' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=30199d47-50d7-4c5c-a55a-e74157e05951\n",
"\n",
"View all tests for Dataset Multi-modal slide decks at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306\n",
"[------------------------------------------------->] 10/10"
]
},
{
"data": {
"text/html": [
"Experiment Results:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" output | \n",
" feedback.COT Contextual Accuracy | \n",
" error | \n",
" execution_time | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 10 | \n",
" 10.000000 | \n",
" 0 | \n",
" 10.000000 | \n",
"
\n",
" \n",
" unique | \n",
" 10 | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" Datadog has 20 total customers. | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" NaN | \n",
" 0.200000 | \n",
" NaN | \n",
" 4.674478 | \n",
"
\n",
" \n",
" std | \n",
" NaN | \n",
" 0.421637 | \n",
" NaN | \n",
" 0.864273 | \n",
"
\n",
" \n",
" min | \n",
" NaN | \n",
" 0.000000 | \n",
" NaN | \n",
" 3.307960 | \n",
"
\n",
" \n",
" 25% | \n",
" NaN | \n",
" 0.000000 | \n",
" NaN | \n",
" 4.113816 | \n",
"
\n",
" \n",
" 50% | \n",
" NaN | \n",
" 0.000000 | \n",
" NaN | \n",
" 4.700962 | \n",
"
\n",
" \n",
" 75% | \n",
" NaN | \n",
" 0.000000 | \n",
" NaN | \n",
" 5.018359 | \n",
"
\n",
" \n",
" max | \n",
" NaN | \n",
" 1.000000 | \n",
" NaN | \n",
" 6.188082 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" output feedback.COT Contextual Accuracy \\\n",
"count 10 10.000000 \n",
"unique 10 NaN \n",
"top Datadog has 20 total customers. NaN \n",
"freq 1 NaN \n",
"mean NaN 0.200000 \n",
"std NaN 0.421637 \n",
"min NaN 0.000000 \n",
"25% NaN 0.000000 \n",
"50% NaN 0.000000 \n",
"75% NaN 0.000000 \n",
"max NaN 1.000000 \n",
"\n",
" error execution_time \n",
"count 0 10.000000 \n",
"unique 0 NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean NaN 4.674478 \n",
"std NaN 0.864273 \n",
"min NaN 3.307960 \n",
"25% NaN 4.113816 \n",
"50% NaN 4.700962 \n",
"75% NaN 5.018359 \n",
"max NaN 6.188082 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import uuid\n",
"\n",
"from langchain.smith import RunEvalConfig\n",
"from langsmith.client import Client\n",
"\n",
"# Evaluator configuration\n",
"client = Client()\n",
"eval_config = RunEvalConfig(\n",
" evaluators=[\"cot_qa\"],\n",
")\n",
"\n",
"# Experiments\n",
"chain_map = {\n",
" \"baseline\": chain,\n",
"}\n",
"\n",
"# Run evaluation\n",
"run_id = uuid.uuid4().hex[:4]\n",
"test_runs = {}\n",
"for project_name, chain in chain_map.items():\n",
" test_runs[project_name] = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=lambda: (lambda x: x[\"Question\"]) | chain,\n",
" evaluation=eval_config,\n",
" verbose=True,\n",
" project_name=f\"{run_id}-{project_name}\",\n",
" project_metadata={\"chain\": project_name},\n",
" )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}