Introduction#
These tasks refer to an LLM’s ability to extract structured output from an unstructured source, such as emails, websites, or other text. Below are a list of supported datasets.
from langchain_benchmarks import registry
registry.filter(Type="ExtractionTask")
Name | Type | Dataset ID | Description |
---|---|---|---|
Email Extraction | ExtractionTask | a1742786-bde5-4f51-a1d8-e148e5251ddb | A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals. |
Chat Extraction | ExtractionTask | 00f4444c-9460-4a82-b87a-f50096f1cfef | A dataset meant to test the ability of an LLM to extract and infer structured information from a dialogue. The dialogue is between a user and a support engineer. Outputs should be structured as a JSON object and test both the ability of the LLM to correctly structure the information and its ability to perform simple classification tasks. |
Task resources#
In addition to the dataset_id, name, and description, each extraction task provides the following:
schema
- a pydantic base model defining the schema (or schemas) the model should extract
Dataset schema#
Each task corresponds to a LangSmith dataset with the following schema:
Inputs:
input: str
- the input text
Outputs
output: str
- the expected extraction result, as a json object
Evaluation#
The extraction tasks also have an evaluation config, which defines default LangSmith evaluators to apply when benchmarking your architecture.
from langchain.chat_models import ChatOpenAI
from langchain_benchmarks.extraction import get_eval_config
eval_llm = ChatOpenAI(model="gpt-4", model_kwargs={"seed": 42})
eval_config = get_eval_config(eval_llm)