Running Locally#

The LangChain benchmarks package is best used with LangSmith. You can create a free account here and read the docs here.

If you are unable to make an account, you can still run these benchmarks locally without an account.

Below is an example.

# Prove that we can run without LangSmith
import os

_ = [
    os.environ.pop(key)
    for key in list(os.environ.keys())
    if key.startswith("LANGCHAIN_")
]

from langchain_benchmarks import registry

task = registry["Multiverse Math"]
task

Name	Multiverse Math
Type	ToolUsageTask
Dataset ID	594f9f60-30a0-49bf-b075-f44beabf546a
Description	An environment that contains a few basic math operations, but with altered results. For example, multiplication of 53 will be re-interpreted as 53*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math.

Eval#

Let’s evaluate an agent now. Nothing will be saved to langsmith, so be sure to save the test results to your file system if you want to use them later.

import uuid

from langchain_benchmarks.tool_usage import agents, get_eval_config
from langchain_benchmarks.utils import run_without_langsmith

experiment_uuid = uuid.uuid4().hex[:4]


models = ["gpt-3.5-turbo-1106"]

for model in models:
    print()
    eval_config = get_eval_config(output_evaluation="qa_math")
    agent_factory = agents.OpenAIAgentFactory(task, model=model)
    test_run = run_without_langsmith(
        # This will clone the dataset locally if not already there
        path_or_token_id=task.dataset_id,
        llm_or_chain_factory=agent_factory,
        evaluation=eval_config,
        verbose=True,
    )

Experiment Results:

	inputs.question	outputs.input	outputs.output	outputs.intermediate_steps	feedback.Intermediate steps correctness	feedback.# steps / # expected steps	feedback.correctness	error	execution_time
count	10	10	10	10	10.0	10.0	10.0	0	10.000000
unique	10	10	1	1	NaN	NaN	NaN	0	NaN
top	multiply the result of (log of 100 to base 10)...	multiply the result of (log of 100 to base 10)...		[]	NaN	NaN	NaN	NaN	NaN
freq	1	1	10	10	NaN	NaN	NaN	NaN	NaN
mean	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	1.453172
std	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	0.496547
min	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	0.763208
25%	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	0.963885
50%	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	1.593439
75%	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	1.870549
max	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	1.957470

# You can interact with the object directly or as a flattened dataframe
df = test_run.to_dataframe()
df.head()

	inputs.question	outputs.input	outputs.intermediate_steps	error	execution_time
27c44572-6c67-4129-a95a-fe1509c350be	multiply the result of (log of 100 to base 10)...	multiply the result of (log of 100 to base 10)...	[]	None	0.763208
2a20a13d-050e-4a16-84ff-22d9582f1449	after calculating the sin of 1.5 radians, divi...	after calculating the sin of 1.5 radians, divi...	[]	None	1.413695
67867526-791a-452f-b534-ef2c1f5efd20	ecoli divides every 20 minutes. How many cells...	ecoli divides every 20 minutes. How many cells...	[]	None	1.773183
4ac33c1a-62f0-4da4-9455-07b582f6ff52	calculate 101 to the power of 0.5 to 4 digits ...	calculate 101 to the power of 0.5 to 4 digits ...	[]	None	1.819677
2e82a924-8382-425e-8738-daa2d912e9fe	convert 15 degrees to radians	convert 15 degrees to radians	[]	None	1.957470

df.to_csv("output.csv", index=False)

Running Locally

Contents

Running Locally#

Eval#

Experiment Results: