Introduction#

These tasks refer to an LLM’s ability to extract structured output from an unstructured source, such as emails, websites, or other text. Below are a list of supported datasets.

from langchain_benchmarks import registry

registry.filter(Type="ExtractionTask")
Name Type Dataset ID Description
Email ExtractionExtractionTaska1742786-bde5-4f51-a1d8-e148e5251ddbA dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals.
Chat Extraction ExtractionTask00f4444c-9460-4a82-b87a-f50096f1cfefA dataset meant to test the ability of an LLM to extract and infer structured information from a dialogue. The dialogue is between a user and a support engineer. Outputs should be structured as a JSON object and test both the ability of the LLM to correctly structure the information and its ability to perform simple classification tasks.

Task resources#

In addition to the dataset_id, name, and description, each extraction task provides the following:

  • schema - a pydantic base model defining the schema (or schemas) the model should extract

Dataset schema#

Each task corresponds to a LangSmith dataset with the following schema:

Inputs:

  • input: str - the input text

Outputs

  • output: str - the expected extraction result, as a json object

Evaluation#

The extraction tasks also have an evaluation config, which defines default LangSmith evaluators to apply when benchmarking your architecture.

from langchain.chat_models import ChatOpenAI

from langchain_benchmarks.extraction import get_eval_config

eval_llm = ChatOpenAI(model="gpt-4", model_kwargs={"seed": 42})
eval_config = get_eval_config(eval_llm)