Running Locally#

The LangChain benchmarks package is best used with LangSmith. You can create a free account here and read the docs here.

If you are unable to make an account, you can still run these benchmarks locally without an account.

Below is an example.

# Prove that we can run without LangSmith
import os

_ = [
    os.environ.pop(key)
    for key in list(os.environ.keys())
    if key.startswith("LANGCHAIN_")
]
from langchain_benchmarks import registry

task = registry["Multiverse Math"]
task
Name Multiverse Math
Type ToolUsageTask
Dataset ID 594f9f60-30a0-49bf-b075-f44beabf546a
DescriptionAn environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math.

Eval#

Let’s evaluate an agent now. Nothing will be saved to langsmith, so be sure to save the test results to your file system if you want to use them later.

import uuid

from langchain_benchmarks.tool_usage import agents, get_eval_config
from langchain_benchmarks.utils import run_without_langsmith

experiment_uuid = uuid.uuid4().hex[:4]


models = ["gpt-3.5-turbo-1106"]

for model in models:
    print()
    eval_config = get_eval_config(output_evaluation="qa_math")
    agent_factory = agents.OpenAIAgentFactory(task, model=model)
    test_run = run_without_langsmith(
        # This will clone the dataset locally if not already there
        path_or_token_id=task.dataset_id,
        llm_or_chain_factory=agent_factory,
        evaluation=eval_config,
        verbose=True,
    )

Experiment Results:

inputs.question outputs.input outputs.output outputs.intermediate_steps feedback.Intermediate steps correctness feedback.# steps / # expected steps feedback.correctness error execution_time
count 10 10 10 10 10.0 10.0 10.0 0 10.000000
unique 10 10 1 1 NaN NaN NaN 0 NaN
top multiply the result of (log of 100 to base 10)... multiply the result of (log of 100 to base 10)... [] NaN NaN NaN NaN NaN
freq 1 1 10 10 NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN 0.0 0.0 0.0 NaN 1.453172
std NaN NaN NaN NaN 0.0 0.0 0.0 NaN 0.496547
min NaN NaN NaN NaN 0.0 0.0 0.0 NaN 0.763208
25% NaN NaN NaN NaN 0.0 0.0 0.0 NaN 0.963885
50% NaN NaN NaN NaN 0.0 0.0 0.0 NaN 1.593439
75% NaN NaN NaN NaN 0.0 0.0 0.0 NaN 1.870549
max NaN NaN NaN NaN 0.0 0.0 0.0 NaN 1.957470
# You can interact with the object directly or as a flattened dataframe
df = test_run.to_dataframe()
df.head()
inputs.question outputs.input outputs.output outputs.intermediate_steps feedback.Intermediate steps correctness feedback.# steps / # expected steps feedback.correctness error execution_time
27c44572-6c67-4129-a95a-fe1509c350be multiply the result of (log of 100 to base 10)... multiply the result of (log of 100 to base 10)... [] 0 0.0 0 None 0.763208
2a20a13d-050e-4a16-84ff-22d9582f1449 after calculating the sin of 1.5 radians, divi... after calculating the sin of 1.5 radians, divi... [] 0 0.0 0 None 1.413695
67867526-791a-452f-b534-ef2c1f5efd20 ecoli divides every 20 minutes. How many cells... ecoli divides every 20 minutes. How many cells... [] 0 0.0 0 None 1.773183
4ac33c1a-62f0-4da4-9455-07b582f6ff52 calculate 101 to the power of 0.5 to 4 digits ... calculate 101 to the power of 0.5 to 4 digits ... [] 0 0.0 0 None 1.819677
2e82a924-8382-425e-8738-daa2d912e9fe convert 15 degrees to radians convert 15 degrees to radians [] 0 0.0 0 None 1.957470
df.to_csv("output.csv", index=False)