Email Extraction#

Let’s evaluate an LLM on its ability to extract structured information from email texts.

%pip install -U langchain langchain_benchmarks openai rapidfuzz
import os

# Get your API key from https://smith.langchain.com/settings
os.environ["LANGCHAIN_API_KEY"] = "sk-..."
os.environ["OPENAI_API_KEY"] = "sk-..."
from langchain_benchmarks import clone_public_dataset, registry

For this code to work, please configure LangSmith environment variables with your credentials.

task = registry["Email Extraction"]
task
Name Email Extraction
Type ExtractionTask
Dataset ID a1742786-bde5-4f51-a1d8-e148e5251ddb
DescriptionA dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals.
print(task.description)
A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail.

Some additional cleanup of the data was done by hand after the initial pass.

See https://github.com/jacoblee93/oss-model-extraction-evals.
    

Clone the dataset associated with this task

clone_public_dataset(task.dataset_id, dataset_name=task.name)
Dataset Email Extraction already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/309a2fce-ce68-43aa-befb-67f94d0c3570.

Schema#

Each extraction task has an expected output schema defined in a Pydantic BaseModel object, which we can use to get a JSON schema object.

import pprint

pprint.pprint(task.schema.schema())
{'definitions': {'ToneEnum': {'description': 'The tone of the email.',
                              'enum': ['positive', 'negative'],
                              'title': 'ToneEnum',
                              'type': 'string'}},
 'description': 'Relevant information about an email.',
 'properties': {'action_items': {'description': 'A list of action items '
                                                'requested by the email',
                                 'items': {'type': 'string'},
                                 'title': 'Action Items',
                                 'type': 'array'},
                'sender': {'description': "The sender's name, if available",
                           'title': 'Sender',
                           'type': 'string'},
                'sender_address': {'description': "The sender's address, if "
                                                  'available',
                                   'title': 'Sender Address',
                                   'type': 'string'},
                'sender_phone_number': {'description': "The sender's phone "
                                                       'number, if available',
                                        'title': 'Sender Phone Number',
                                        'type': 'string'},
                'tone': {'allOf': [{'$ref': '#/definitions/ToneEnum'}],
                         'description': 'The tone of the email.'},
                'topic': {'description': 'High level description of what the '
                                         'email is about',
                          'title': 'Topic',
                          'type': 'string'}},
 'required': ['action_items', 'topic', 'tone'],
 'title': 'Email',
 'type': 'object'}

Define an extraction chain#

Let’s build the extraction chain that we can use to get structured information from the emails.

from langchain.chat_models import ChatOpenAI
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0).bind_functions(
    functions=[task.schema],
    function_call=task.schema.schema()["title"],
)

output_parser = JsonOutputFunctionsParser()
extraction_chain = task.instructions | llm | output_parser | (lambda x: {"output": x})
extraction_chain.invoke(
    {
        "input": "Hello Dear MR. I want you to send me gold to get rich."
        " First buy an envelope. Then open it and put some gold inside. "
        "Then close it and finally mail it to my address at 12345 My Gold Way."
        " You can call me any time at 000-1212-1111."
    }
)
{'output': {'sender': 'Unknown',
  'sender_phone_number': '000-1212-1111',
  'sender_address': '12345 My Gold Way',
  'action_items': ['Buy an envelope',
   'Put gold inside',
   'Close the envelope',
   "Mail it to sender's address"],
  'topic': 'Request to send gold',
  'tone': 'positive'}}

Now it’s time to measure our chain’s effectiveness!

Evaluate#

Let’s evaluate the chain now.

from langsmith.client import Client

from langchain_benchmarks.extraction import get_eval_config
client = Client()
eval_llm = ChatOpenAI(model="gpt-4", model_kwargs={"seed": 42})
eval_config = get_eval_config(eval_llm)
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=extraction_chain,
    evaluation=eval_config,
    verbose=True,
    project_metadata={
        "arch": "openai-functions",
    },
)
View the evaluation results for project 'monthly-look-12' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/177d564f-516d-4b65-bae0-37154b529470?eval=true

View all tests for Dataset Email Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/309a2fce-ce68-43aa-befb-67f94d0c3570
[------------------------------------------------->] 42/42
 Eval quantiles:
                                             inputs.input  \
count                                                  42   
unique                                                 42   
top     ---  \n|\n\nEvery business faces its set of cu...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

                                           outputs.output  \
count                                                  42   
unique                                                 42   
top     {'sender': 'EMC Financial', 'sender_address': ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

        feedback.json_edit_distance  feedback.score_string:accuracy error  \
count                     42.000000                       42.000000     0   
unique                          NaN                             NaN     0   
top                             NaN                             NaN   NaN   
freq                            NaN                             NaN   NaN   
mean                       0.566434                        0.485714   NaN   
std                        0.178473                        0.235374   NaN   
min                        0.190883                        0.100000   NaN   
25%                        0.441978                        0.300000   NaN   
50%                        0.581750                        0.300000   NaN   
75%                        0.687949                        0.700000   NaN   
max                        0.901852                        0.900000   NaN   

        execution_time  
count        42.000000  
unique             NaN  
top                NaN  
freq               NaN  
mean          3.527634  
std           0.518258  
min           2.579424  
25%           3.153659  
50%           3.525745  
75%           3.796416  
max           5.144408  

Compare to another LLM#

Let’s compare to an Anthropic LLM.

from langchain.chat_models import ChatAnthropic
from langchain.output_parsers.xml import XMLOutputParser
from langchain.prompts import ChatPromptTemplate

# This is the schema the model will populate
xsd = """<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:element name="Email">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="sender" type="xs:string" minOccurs="0"/>
                <xs:element name="sender_phone_number" type="xs:string" minOccurs="0"/>
                <xs:element name="sender_address" type="xs:string" minOccurs="0"/>
                <xs:element name="action_items" type="ActionItemsType" minOccurs="1"/>
                <xs:element name="topic" type="xs:string" minOccurs="1"/>
                <xs:element name="tone" type="ToneEnumType" minOccurs="1"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:complexType name="ActionItemsType">
        <xs:sequence>
            <xs:element name="item" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:simpleType name="ToneEnumType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="positive"/>
            <xs:enumeration value="negative"/>
        </xs:restriction>
    </xs:simpleType>

</xs:schema>"""

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a data extraction bot. Always respond "
            "only with XML of the following schema:\n{xsd}",
        ),
        (
            "user",
            "Extract Email from the folowing Document:\n"
            "<Document>\n{input}\n</Document>\n"
            "RESPOND ONLY IN XML THEN STOP.",
        ),
    ]
).partial(xsd=xsd)

claude = ChatAnthropic(model="claude-2", temperature=1)


def convert_parsed_email(email_dict: dict) -> dict:
    """Conver the XML-parsed dictionary to a flattened dict."""
    if "Email" not in email_dict:
        return email_dict
    # Flatten the tags
    result = {k: v for item in email_dict["Email"] for k, v in item.items()}
    result["action_items"] = [
        item["item"] for item in (result.get("action_items") or [])
    ]
    return {"output": result}


claude_extraction_chain = prompt | claude | XMLOutputParser() | convert_parsed_email
result = claude_extraction_chain.invoke(
    {
        "input": "Hello Dear MR. I want you to send me gold to get rich."
        " First buy an envelope. Then open it and put some gold inside. "
        "Then close it and finally mail it to my address at 12345 My Gold Way."
        " You can call me any time at 000-1212-1111."
    }
)
result
{'output': {'sender': None,
  'sender_phone_number': '000-1212-1111',
  'sender_address': '12345 My Gold Way',
  'action_items': ['buy an envelope',
   'open it',
   'put some gold inside',
   'close it',
   'mail it to my address'],
  'topic': 'sending gold',
  'tone': 'negative'}}
claude_test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=claude_extraction_chain,
    evaluation=eval_config,
    verbose=True,
    project_metadata={
        "arch": "claude-xml",
    },
)
View the evaluation results for project 'frosty-moon-4' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/81d41017-bcda-450d-8991-9bf744c7ebb8?eval=true

View all tests for Dataset Email Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/309a2fce-ce68-43aa-befb-67f94d0c3570
[-------------------------------------->           ] 33/42
Chain failed for example 9a707fca-4ba7-4f7d-8912-b9fd71e9901e with inputs {'input': "---|---|---|---  \n  \nBook with Fall Sale Extras Through November 21! Savings! OBC! Visa Gift Card\n+ More  \n  \n---  \n|\n\n|  |  |  |  |  |  |  \n---  \n|  |  \n---  \n|  | SHOP THE FALL CRUISE SALE  \n---  \n|  |  \n---  \n  \n**Celebrity Cruises** Celebrity Cruises receive **Exclusive Pricing** with\nup to **$450 BONUS Savings per Stateroom** based on double\noccupancyand even more for extra guests! Enjoy **Exclusive Tips**\non 2024 sailings, up to**$2150 Onboard Credit** , and up to a **$1700 Visa\nGiftCard** on Galapagos sailings or up to a **$650 Visa Gift Card** on\nother departures. **Drinks** and **Wi-Fi** are All Included, too!   **See=\nThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Viking** Enjoy your favorite Viking voyages with up to=C2=A0 **$1200\nShipboard Credit** from Online Vacation Center when you book by Nov 21!\nPlus, select sailings get **Airfare** , **Stateroom Upgrades** , **Special\nFares** =C2=A0and only **$25 Deposits** on the world's #1 Cruise Line for\nOceans, Rivers & Expeditions! Guided Tours, Wi-Fi, Select Beverages, Meals &\nMore Included.   **SeeThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Royal Caribbean** Sail Away on Royal Caribbean withup to **$1000 BONUS\nOnboard Credit** and **Specialty Dining** exclusively from Online Vacation\nCenter!=C2=A0Plus, up to **30% SAVINGS** on all Cruises, **Kids Sail =\nFree** on select sailings and up to **$500 Savings on Airfare** on select\nAlaska and Europe sailings. **SeeThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Oceania Cruises** Choose Your Offer! Receive **Prepaid Gratuities** on\nselect sailings OR receive up to **$1000 Onboard Credit** on 30 Europe\nvoyages. Enjoy _simply_ MOREâ„¢ with **2 for 1** Cruise Fares, **Roundtrip\nAirfare** , Transfers & Taxes, **Unlimited Wi-Fi** , up to **$1600 Shore\nExcursion Credit** , Specialty Dining, Champagne, Wine, and more. Plus,\nreceive up to a **$1500 Visa Gift Card** from Online Vacation Center!\n**SeeThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Regent Seven Seas Cruises** Book your luxury cruise on Regent Seven Seas\nby Nov 21 and receive up to **$2000** in **Exclusive Savings** per Suite on\nall sailings through June 2026! Plus, enjoy **Bonus Savings =** worth up to\n**30%** on select 2024 sailings when you book by Nov 12. Receive up to a\n**$1400 Visa Gift Card** from us, and enjoy Regent standard inclusions like\n**Business Class Airfare** on intercontinental flights and **Airfare** on\ndomestic flights, **Shore Excursions** , **Gratuities** and More. **See This\nOffer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Azamara** Enjoy up to **$1500 Onboard Credit** , up to an=C2=A0 **$800\nVisa Gift Card** , **Stateroom Upgrades** and **20% Off Suites** onselect\nsailings, and More on Azamara during our Fall Sale! Plus up to a **$200\nBONUS Visa Gift Card** on our Exclusive Cruise Packages. Receive Azamara\nstandard inclusions like select **Beverages , **Gratuities** and More.   **See\nThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Norwegian Cruise Line** Enjoy up to **$1000 Onboard Credit** and\n**Gratuities** on 7+ night Balconies or higher during our Fall Sale! Plus\n**50% OFF** Cruise Fares and **Free at Sea:** Open Bar, Specialty Dining, =\nWi-Fi, Shore Excursion Credits and extra guests on select sailings. **See=\nThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Luxury Hotels** Whether your personal definition ofluxury is an urban\noasis or an opulent villa, a wine-country cottage or a Caribbean hammock,\nOnline Vacation Center has the perfect accommodations for your next\nvacation. Book now for **Exclusive Offers**  **Discounts** ,\n**Extra Nights** , **Resort Credits** , **Complimentary Amenities** and\nMore! **SeeThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Enrichment Journeys** Book an **Enrichment Journey** on Celebrity Cruises\nfor up to **$2150 Onboard Credit** , up to **$450 Off** per stateroom and up\nto a **$650 Visa Gift Card** with **Exclusive Tips** on 2024 sailings +\n**Drinks** and **Wi-Fi** All Included. Journeys include **Airfare**\n, 4-star+ **Hotel** Stays, **Transfers** , **Taxes** , select **Meals**\nand More. **SeeThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Princess Cruises** Enjoy up to **$1200 Onboard Credit** , up to **50% Off\nCruise Fares =** & **50% Off Deposits** during our Fall Sale! Choose =\nPrincess Plus to receive Included **Drinks, Crew Appreciation** & **Wi-Fi**\n_(over $950 in added value!)_ OR skip the frills for the lowest rate. **See\nThis Offer =E2=96=B8**  \n  \n|  |  \n---  \n  \n**Holland America Line** Get more on your Holland America cruise with up to\n**$1450 Onboard Credit** and **Gratuities** on select sailings, exclusively\nfrom us! Plus, **Have It All** with **Wi-Fi, Beverages, Specialty Dining**\nand **Shore Excursions** or skip the frills for a lower cruise fare. For a\nlimited time, enjoy **BONUS Shore Excursion** & **Air Credits** , $99\nDeposits and **Kids Sail Free** on select 2024 sailings. **SeeThis Offer\n=E2=96=B8**  \n  \n|  |  \n---  \n|  |  \n---  \n|  |  \n---  \n|\n\n### Hours of Operation\n\n**Monday=E2=80=93Friday** 9 am=E2=80=936 pm ET **Saturday** 10 am=E2=80=934\npm ET **Sunday** Closed  \n  \n---  \n|  |  \n---  \n  \n**Terms and Conditions** : New Bookings Only. Select Sailings Apply.\nRates, itinerary and any available amenities are by sail date and are\nsubject to change. **Repricing an existing reservation or requesting a\ncancel/rebook is not permitted for this promotion. This promotion is not\napplicable for reservations that used FCCs or utilized Lift & Shift program.\nCall to see what you qualify for (please note that any modifications may\nresult in a $100 per person change fee). Fall Sale**: Offer expires\n11/21/23. Airfare is included on select sailings from select gateways.\nAdditional gateways may be available for lowadd-ons. The identity of the air\ncarrier, which may include the carrier's code-share partner, will be\nassigned and disclosed at a later date. Purchases made onboard plane or in\nterminal not included. Onboard Credit isper stateroom on select sailings.\nPrices are per person, double occupancy.Prices and itineraries are based on\navailability and are subject to changewithout notice. Offer can be withdrawn\nat any time. All fares may be subject to fuel surcharges if imposed by\ncruise lines and airlines. Government taxes, air taxes, transfers, service\nfees and other ancillary charges are additional unless otherwise noted.\nAdditional terms, conditionsand restrictions apply; view individual offers\nfor more information. Online Vacation Center reserves the right to cancel\nthe Offer at any time, correct any errors, inaccuracies or omissions, and\nchange or update fares, fees and surcharges at any time without prior\nnotice. Online Vacation Center is a registered Seller of Travel with the\nStates of Florida (ST-32947), California (CST-2064227-40) and Washington (WA\nSOT 602250083). 110823CB  \n  \n|  |  \n---  \n  \n* * *\n\nThis message was sent to address: jacob@gmail.com  \n  \nMore Travel Deals \\- Sign Up \\- Forward to Friend \\- Unsubscribe \\- Privacy \\-\nDisclaimers  \n  \n(C) 2023 Dunhill Vacations Inc. - 2307 W. Broward Blvd, Ste 402 - Fort\nLauderdale, FL 33312  \n  \n---  \n\\----_NmP-64d90535a0e2740e-Part_1--\n\n"}
Error Type: ValueError, Message: Could not parse output: <Email>
  <sender></sender>
  <sender_phone_number></sender_phone_number>
  <sender_address></sender_address>
  <action_items>
    <item>Book Celebrity Cruises by Nov 21 for exclusive pricing, bonuses, and gifts</item>
    <item>Book Viking by Nov 21 for bonuses and special offers</item>
    <item>Book Royal Caribbean by Nov 21 for onboard credits, dining, and savings</item>
    <item>Book Oceania Cruises by Nov 21 for prepaid gratuities or onboard credits</item>
    <item>Book Regent Seven Seas by Nov 21 for exclusive savings and gift cards</item>
    <item>Book Azamara by Nov 21 for onboard credits, upgrades, and savings</item>
    <item>Book Norwegian Cruise Line for discounts, amenities, and savings</item>
    <item>Book luxury hotels for exclusive offers and discounts</item>
    <item>Book an Enrichment Journey on Celebrity Cruises for bonuses and inclusions</item>
    <item>Book Princess Cruises for discounts, amenities, and onboard credits</item>
    <item>Book Holland America Line for bonuses,
[------------------------------------------------->] 42/42
 Eval quantiles:
                                             inputs.input  \
count                                                  42   
unique                                                 42   
top     ---  \n|\n\nEvery business faces its set of cu...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

                                           outputs.output  \
count                                                  41   
unique                                                 41   
top     {'sender': 'Sam', 'sender_phone_number': '800....   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

        feedback.json_edit_distance  feedback.score_string:accuracy  \
count                     41.000000                       41.000000   
unique                          NaN                             NaN   
top                             NaN                             NaN   
freq                            NaN                             NaN   
mean                       0.382352                        0.565854   
std                        0.164442                        0.238338   
min                        0.107011                        0.100000   
25%                        0.252252                        0.300000   
50%                        0.375427                        0.700000   
75%                        0.532982                        0.700000   
max                        0.753704                        1.000000   

                                                    error  execution_time  
count                                                   1       42.000000  
unique                                                  1             NaN  
top     Could not parse output: <Email>\n  <sender></s...             NaN  
freq                                                    1             NaN  
mean                                                  NaN        9.082149  
std                                                   NaN        2.192165  
min                                                   NaN        6.203642  
25%                                                   NaN        7.807354  
50%                                                   NaN        8.497452  
75%                                                   NaN        9.632442  
max                                                   NaN       19.564479  

Inspect#

Here, we’ll take a look at the underlying results a little bit.

A few things to note:

  • For this run, Anthropic is doing better on average

  • The correctness is low - getting the exact information right can be difficult

df = test_run.to_dataframe().join(claude_test_run.to_dataframe(), rsuffix="_claude")
df.head(5)
inputs.input outputs.output reference feedback.json_edit_distance feedback.score_string:accuracy error execution_time inputs.input_claude outputs.output_claude reference_claude feedback.json_edit_distance_claude feedback.score_string:accuracy_claude error_claude execution_time_claude
61c40266-b994-49a2-8768-d54704cee079 --- \n|\n\nEvery business faces its set of cu... {'sender': 'EMC Financial', 'sender_address': ... {'output': {'tone': 'positive', 'topic': 'Busi... 0.562112 0.7 None 4.358837 --- \n|\n\nEvery business faces its set of cu... {'sender': 'Sam', 'sender_phone_number': '800.... {'output': {'tone': 'positive', 'topic': 'Busi... 0.301242 0.7 None 10.501042
2dcfadff-51dc-458c-8af0-f47a795d0c9b Hello Jacob!\n\n \n\nHave you noticed thesurg... {'sender': 'Sam at EMC', 'action_items': ['Fil... {'output': {'tone': 'positive', 'topic': 'Gree... 0.505338 0.7 None 3.946547 Hello Jacob!\n\n \n\nHave you noticed thesurg... {'sender': 'Sam at EMC', 'sender_phone_number'... {'output': {'tone': 'positive', 'topic': 'Gree... 0.113879 0.7 None 8.511848
a9c481ba-9ca5-408c-8c9c-f29127a70f7b Hi there,\n\n | \n--- \n \nWe've updated ou... {'sender': 'Crunchbase Team', 'action_items': ... {'output': {'tone': 'positive', 'topic': 'Upda... 0.245283 0.9 None 3.972396 Hi there,\n\n | \n--- \n \nWe've updated ou... {'sender': None, 'sender_phone_number': None, ... {'output': {'tone': 'positive', 'topic': 'Upda... 0.343434 0.7 None 9.739630
98358188-6e36-42ef-9298-83acf8d9dd12 Consider all ways to give to \nSave the Redwo... {'sender': 'Tim Whalen', 'sender_address': 'Sa... {'output': {'tone': 'positive', 'topic': 'Dona... 0.280556 0.7 None 3.890567 Consider all ways to give to \nSave the Redwo... {'sender': None, 'sender_phone_number': None, ... {'output': {'tone': 'positive', 'topic': 'Dona... 0.255556 0.3 None 9.640687
0f29e857-fc08-45dd-b1ea-dde1e00c4a62 Some travelers plan ahead; others prefer a bit... {'sender': 'Dunhill Vacations Inc.', 'sender_a... {'output': {'tone': 'positive', 'topic': 'Trav... 0.552463 0.7 None 4.252478 Some travelers plan ahead; others prefer a bit... {'sender': 'Dunhill Vacations Inc.', 'sender_p... {'output': {'tone': 'positive', 'topic': 'Trav... 0.584582 0.3 None 6.803259
(
    df["feedback.json_edit_distance"].mean(),
    df["feedback.json_edit_distance_claude"].mean(),
)
(0.5664337704936568, 0.382351925386955)
(
    df["feedback.score_string:accuracy"].mean(),
    df["feedback.score_string:accuracy_claude"].mean(),
)
(0.48571428571428565, 0.5658536585365853)
# Rows for which OAI > Claude by at least 30%, according to the LLM-based evaluator
oai_beats_claude = df[
    (df["feedback.score_string:accuracy"] - df["feedback.score_string:accuracy_claude"])
    >= 0.3
]

oai_beats_claude[["inputs.input", "outputs.output", "outputs.output_claude"]]
inputs.input outputs.output outputs.output_claude
98358188-6e36-42ef-9298-83acf8d9dd12 Consider all ways to give to \nSave the Redwo... {'sender': 'Tim Whalen', 'sender_address': 'Sa... {'sender': None, 'sender_phone_number': None, ...
0f29e857-fc08-45dd-b1ea-dde1e00c4a62 Some travelers plan ahead; others prefer a bit... {'sender': 'Dunhill Vacations Inc.', 'sender_a... {'sender': 'Dunhill Vacations Inc.', 'sender_p...
35414bbc-4d38-41ed-876f-2a6a067e66d5 --- \n \n|\n\nWe Passed the Stop Dangerous P... {'sender': 'Matt Haney', 'sender_address': '10... {'sender': 'Matt Haney', 'sender_phone_number'...
ff1b2ed6-26a7-4501-96aa-6e3e10eadc72 --- \n|\n\n# We Provide Unique Financing Opti... {'sender': 'info@championadvance.com', 'sender... {'sender': None, 'sender_phone_number': None, ...
# Rows for which Claude > OAI by at least 50%, according to the LLM-based evaluator
oai_beats_claude = df[
    (df["feedback.score_string:accuracy_claude"] - df["feedback.score_string:accuracy"])
    >= 0.5
]

oai_beats_claude[["inputs.input", "outputs.output", "outputs.output_claude"]]
inputs.input outputs.output outputs.output_claude
02cfdfc4-c3dc-47e6-ad44-8e437ebf2dce ---|---|---|--- \n \n| \n--- \n **Limited ... {'action_items': [], 'topic': 'Limited Time Up... {'sender': 'Dunhill Vacations Inc.', 'sender_p...
198dc232-8f98-484a-a65e-048cfb517282 Hello Jacob,\n\n \n\nFor many small businesse... {'sender': 'Sam at EMC', 'action_items': ['Kic... {'sender': 'Sam at EMC', 'sender_phone_number'...
c222957f-cc7e-46af-9cca-1270f3fa5621 Hello Jacob,\n\n \n\nDo you know what Fortune... {'sender': 'Sam at EMC', 'action_items': ['qua... {'sender': 'Sam at EMC', 'sender_phone_number'...
119ef037-8744-4eb9-93df-64458278e4f8 --- \n| | QUALIFY NOW \n--- \n \n \nHell... {'sender': 'Sam at EMC', 'action_items': ['Che... {'sender': 'Sam at EMC id:2023-09-19-20:17:53:...