Running Locally#
The LangChain benchmarks package is best used with LangSmith. You can create a free account here and read the docs here.
If you are unable to make an account, you can still run these benchmarks locally without an account.
Below is an example.
# Prove that we can run without LangSmith
import os
_ = [
os.environ.pop(key)
for key in list(os.environ.keys())
if key.startswith("LANGCHAIN_")
]
from langchain_benchmarks import registry
task = registry["Multiverse Math"]
task
Name | Multiverse Math |
Type | ToolUsageTask |
Dataset ID | 594f9f60-30a0-49bf-b075-f44beabf546a |
Description | An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math. |
Eval#
Let’s evaluate an agent now. Nothing will be saved to langsmith, so be sure to save the test results to your file system if you want to use them later.
import uuid
from langchain_benchmarks.tool_usage import agents, get_eval_config
from langchain_benchmarks.utils import run_without_langsmith
experiment_uuid = uuid.uuid4().hex[:4]
models = ["gpt-3.5-turbo-1106"]
for model in models:
print()
eval_config = get_eval_config(output_evaluation="qa_math")
agent_factory = agents.OpenAIAgentFactory(task, model=model)
test_run = run_without_langsmith(
# This will clone the dataset locally if not already there
path_or_token_id=task.dataset_id,
llm_or_chain_factory=agent_factory,
evaluation=eval_config,
verbose=True,
)
Experiment Results:
inputs.question | outputs.input | outputs.output | outputs.intermediate_steps | feedback.Intermediate steps correctness | feedback.# steps / # expected steps | feedback.correctness | error | execution_time | |
---|---|---|---|---|---|---|---|---|---|
count | 10 | 10 | 10 | 10 | 10.0 | 10.0 | 10.0 | 0 | 10.000000 |
unique | 10 | 10 | 1 | 1 | NaN | NaN | NaN | 0 | NaN |
top | multiply the result of (log of 100 to base 10)... | multiply the result of (log of 100 to base 10)... | [] | NaN | NaN | NaN | NaN | NaN | |
freq | 1 | 1 | 10 | 10 | NaN | NaN | NaN | NaN | NaN |
mean | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.453172 |
std | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.496547 |
min | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.763208 |
25% | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.963885 |
50% | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.593439 |
75% | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.870549 |
max | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.957470 |
# You can interact with the object directly or as a flattened dataframe
df = test_run.to_dataframe()
df.head()
inputs.question | outputs.input | outputs.output | outputs.intermediate_steps | feedback.Intermediate steps correctness | feedback.# steps / # expected steps | feedback.correctness | error | execution_time | |
---|---|---|---|---|---|---|---|---|---|
27c44572-6c67-4129-a95a-fe1509c350be | multiply the result of (log of 100 to base 10)... | multiply the result of (log of 100 to base 10)... | [] | 0 | 0.0 | 0 | None | 0.763208 | |
2a20a13d-050e-4a16-84ff-22d9582f1449 | after calculating the sin of 1.5 radians, divi... | after calculating the sin of 1.5 radians, divi... | [] | 0 | 0.0 | 0 | None | 1.413695 | |
67867526-791a-452f-b534-ef2c1f5efd20 | ecoli divides every 20 minutes. How many cells... | ecoli divides every 20 minutes. How many cells... | [] | 0 | 0.0 | 0 | None | 1.773183 | |
4ac33c1a-62f0-4da4-9455-07b582f6ff52 | calculate 101 to the power of 0.5 to 4 digits ... | calculate 101 to the power of 0.5 to 4 digits ... | [] | 0 | 0.0 | 0 | None | 1.819677 | |
2e82a924-8382-425e-8738-daa2d912e9fe | convert 15 degrees to radians | convert 15 degrees to radians | [] | 0 | 0.0 | 0 | None | 1.957470 |
df.to_csv("output.csv", index=False)