Introduction#
Tool Usage tasks are designed to evaluate how well an agent can use tools to accomplish an objective.
Each task defines an environment in which the agent operates. The environment consists of a set of tools and a way to read the state of the environment (more on that below).
The tasks allow you to stress test the agent in different ways:
Can the agent use a single tool effectively?
Can the agent use more than 10 tools effectively?
Can the agent correctly incorporate information returned by the tool (and ignore internal knowledge)?
To help in this evaluation, each task is associated with a LangSmith dataset that includes input/output examples of varying difficulties.
Schema#
To make it possible to evaluate different agent implementations, we’re using a standardized schema, we’ll illustrate it with the following example taken from tool usage.
Dataset#
Each task corresponds to a LangSmith dataset with the following schema:
Inputs:
name |
type |
meaning |
---|---|---|
question |
str |
the user question |
Outputs:
name |
type |
meaning |
---|---|---|
reference |
str |
the expected answer |
expected_steps |
List[str] |
the list of tools that should be invoked |
order_matters |
bool |
whether the tools should be invoked in the specific order |
state |
Optional[Any] |
the state of the system after the agent has taken its actions |
Here’s an example contains the following keys/values:
{
"input": {"question": "weather in LA right now?"},
"output": {
"reference": "Sunny, Temperature: 75°F",
"order_matters": true,
"expected_steps": [
"find_locations_by_name",
"get_current_weather_for_location"
],
}
}
Agent#
To work with the evaluators provided by LangChain Benchmarks (of course you’re free to write your own evaluators!).
An agent must accept question
as an input and return:
{
"output": "It's super sunny. Like 75F", // the output from the agent
"intermediate_steps": [... "find_locations_by_name" ...], // list of the intermediate steps taken by the agent (see format in LangChain)
"state": .., // Can be anything, this is the state fo the environment after the agent has taken all of its actions (optional key)
}
Tasks#
You can check an up-to-date list of tool usage tasks in the registry:
from langchain_benchmarks import registry
registry.filter(Type="ToolUsageTask")
Name | Type | Dataset ID | Description |
---|---|---|---|
Tool Usage - Typewriter (1 tool) | ToolUsageTask | 59577193-8938-4ccf-92a7-e8a96bcf4f86 | Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper. The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. |
Tool Usage - Typewriter (26 tools) | ToolUsageTask | 128af05e-aa00-4e3b-a958-d166dd450581 | Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument. |
Tool Usage - Relational Data | ToolUsageTask | 1d89f4b3-5f73-48cf-a127-2fdeb22f6d84 | Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently. |
Multiverse Math | ToolUsageTask | 47ed57bc-e852-4f84-a23e-cce4793864e9 | An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math. This task is associated with 20 test examples. |
Let’s understand what a tool usage task is in a bit more detail
task = registry["Tool Usage - Typewriter (26 tools)"]
task
Name | Tool Usage - Typewriter (26 tools) |
Type | ToolUsageTask |
Dataset ID | 128af05e-aa00-4e3b-a958-d166dd450581 |
Description | Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument. |
Tool usage tasks are associated with an environment
@dataclasses.dataclass(frozen=True)
class ToolUsageEnvironment:
"""An instance of an environment for tool usage."""
tools: List[BaseTool]
"""The tools that can be used in the environment."""
read_state: Optional[Callable[[], Any]] = None
"""A function that returns the current state of the environment."""
Here, we’ll dig into the typewriter task a bit to explain what the environment state represents.
The typewrite task has 26 tools each of which prints a letter on a piece of virtual paper
env = task.create_environment()
env.tools[:4]
[StructuredTool(name='a', description='a() -> str - Run to Type the letter "a".', args_schema=<class 'pydantic.v1.main.aSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62c9a0>),
StructuredTool(name='b', description='b() -> str - Run to Type the letter "b".', args_schema=<class 'pydantic.v1.main.bSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62c5e0>),
StructuredTool(name='c', description='c() -> str - Run to Type the letter "c".', args_schema=<class 'pydantic.v1.main.cSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62cae0>),
StructuredTool(name='d', description='d() -> str - Run to Type the letter "d".', args_schema=<class 'pydantic.v1.main.dSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62cb80>)]
env.tools[0].invoke({}) # Invoke a()
env.tools[0].invoke({}) # invoke a()
env.tools[2].invoke({}) # invoke c()
'OK'
env.read_state() # Shows the content of the virtual paper
'aac'
Create an Agent!#
Now that you know how the test environment works, let’s create an agent that we can test!
Because an agent interacts with the environment via tools and can change the state of the environment during the course of an agent run, what we actually want is the ability to create a fresh agent and a fresh environment for each test run.
We’ll do this using a factory. A factory is just a fancy name in computer science for an object that can create other objects. In this case, we’ll have an Agent Factory that we can call and it’ll create a fresh agent for us on each call.
We’ll use the StandardAgentFactory which under the hood creates a standard LangChain tool calling agent. It can be used with any Chat Model that support tool calling.
from langchain_anthropic.chat_models import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_benchmarks.tool_usage.agents import StandardAgentFactory
model = ChatAnthropic(model="claude-3-opus-20240229", temperature=0)
prompt = ChatPromptTemplate.from_messages(
[
("system", "{instructions}"), # Populated from task.instructions automatically
(
"human",
"{question}",
), # Each evaluation example is associated with a question
("placeholder", "{agent_scratchpad}"), # Space for the agent to do work
]
)
agent_factory = StandardAgentFactory(task, model, prompt)
Here, were the instructions for the task
task.instructions
"Repeat the given string by using the provided tools. Do not write anything else or provide any explanations. For example, if the string is 'abc', you must invoke the tools 'a', 'b', and 'c' in that order. Please invoke the functions without any arguments."
Let’s test it out
from langchain import globals
globals.set_verbose(True)
agent = agent_factory()
agent.invoke({"question": "abc"})
globals.set_verbose(False)
> Entering new AgentExecutor chain...
Invoking: `a` with `{}`
responded: [{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_01MQ6oTx2j2uNGCR5LBVeKui', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01AytT1jvNNR67VodMkhbq7r', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_015VkTYUV5hWcobtduqssi9k', 'input': {}, 'name': 'c', 'type': 'tool_use'}]
OK
Invoking: `b` with `{}`
responded: [{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_01MQ6oTx2j2uNGCR5LBVeKui', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01AytT1jvNNR67VodMkhbq7r', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_015VkTYUV5hWcobtduqssi9k', 'input': {}, 'name': 'c', 'type': 'tool_use'}]
OK
Invoking: `c` with `{}`
responded: [{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_01MQ6oTx2j2uNGCR5LBVeKui', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01AytT1jvNNR67VodMkhbq7r', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_015VkTYUV5hWcobtduqssi9k', 'input': {}, 'name': 'c', 'type': 'tool_use'}]
OK[]
> Finished chain.
Benchmarking#
How does one evaluate an agent? Given a particular task and input, an agent uses tools to produce an output AND/OR change the state of the environment.
To evaluate an agent, we can check the following:
Did the agent use the expected tools?
Did the agent use the tools in the most effective way; e.g., was the order of tool invocation correct?
Did the environment end up in the correct final state after the agent used the tools? (e.g., does my calendar contain all the scheduled meetings?)
Did the agent output match the expected reference output?
Each task is associated with a standard evaluator that does evaluation that’s appropriate for the task; for example,
Use an LLM to grade Compare output to reference using an LLM that grades the response.
Compare equality of expected_steps to the list of tools in intermediate_steps – simple list equality
Compare the state of the environment against expected state (if present in the dataset and in the agent)
Each task is associated with its own task specific evaluator!
eval_config = task.get_eval_config()
eval_config
RunEvalConfig(evaluators=[], custom_evaluators=[<langchain_benchmarks.tool_usage.evaluators.AgentTrajectoryEvaluator object at 0x7b3a9ea5b110>], batch_evaluators=None, reference_key=None, prediction_key=None, input_key=None, eval_llm=None)
Set up code to run against all tasks
import datetime
import uuid
from langsmith.client import Client
from langchain_benchmarks import (
__version__,
clone_public_dataset,
model_registry,
registry,
)
from langchain_benchmarks.rate_limiting import RateLimiter
Create an experiment ID. we’ll use it to tag our runs, which we can later use to retrieve run data from LangSmith.
experiment_id = uuid.uuid4().hex[:]
Run evaluation against all tasks.
client = Client() # Launch langsmith client for cloning datasets
today = datetime.date.today().isoformat()
# You can use an optional rate limiter to rate limit your requests!
rate_limiter = RateLimiter(requests_per_second=1)
# Set up 2-tuples of (model name, model instance)
# You can update this list with any model that supports tool calling.
# See list here: https://python.langchain.com/docs/integrations/chat/
tests = [
(
"claude-3-haiku-20240307",
ChatAnthropic(model="claude-3-haiku-20240307", temperature=0),
)
]
for task in registry.tasks:
if task.type != "ToolUsageTask":
continue
dataset_name = task.name + f" ({today})"
clone_public_dataset(task.dataset_id, dataset_name=dataset_name)
for model_name, model in tests:
print()
print(f"Benchmarking {task.name} with model: {model_name}")
eval_config = task.get_eval_config()
agent_factory = StandardAgentFactory(
task, model, prompt, rate_limiter=rate_limiter
)
client.run_on_dataset(
dataset_name=dataset_name,
llm_or_chain_factory=agent_factory,
evaluation=eval_config,
verbose=False,
project_name=f"{model_name}-{task.name}-{today}-{experiment_id}",
concurrency_level=5,
project_metadata={
"model": model_name,
"id": experiment_uuid,
"task": task.name,
"date": today,
"langchain_benchmarks_version": __version__,
},
)
Advanced Usage#
The following sections demonstrate slightly more “advanced” usage if you want to completely customize the agent runtime in a way that is compatible with our test runner.
We’ll also apply an adapter to the agent which will will capture its inputs and outputs (e.g, add information the agent’s environment at the end of the run) so that it we can evaluate it.
Custom Agent Factory#
If you want even more configurability beyond what the CustomRunnableAgentFactory
provides, you can create your owne AgentFactory
using the following pattern.
The AgentExecutor
should accept question
as an input and include the fields output
, intermediate_steps
and potentially state
in its response – for this we
will wrap the agent executor in an adapter (apply_agent_executor_adapter
) that will help match the expected schema.
from typing import Optional
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_benchmarks.schema import ExtractionTask
from langchain_benchmarks.tool_usage.agents import apply_agent_executor_adapter
class CustomAgentFactory:
def __init__(
self,
task: ExtractionTask,
*,
# It can be useful to add a rate-limiter
# which will limit ther number of requests per second
# when running evaluation.
rate_limiter: Optional[RateLimiter] = None,
) -> None:
self.task = task
self.rate_limiter = rate_limiter
def __call__(self):
# This factory creates a new environment for every agent run.
# The reason is that the environment may be associated with an environment state (e.g., typewriter)
# which is changed by the actions of the agent.
# At the end of the run, the environment state will be read.
env = task.create_environment() # Create a new environment for every agent run!
tools = env.tools
model = ChatAnthropic(model="claude-3-opus-20240229", temperature=0)
prompt = ChatPromptTemplate.from_messages(
[
("system", self.task.instructions),
(
"human",
"{question}",
), # Populated from task.instructions automatically
("placeholder", "{agent_scratchpad}"),
]
)
# This is the standard tool calling agent implementation
# Feel free to replace it with any other implementation you want!
# https://python.langchain.com/docs/modules/agents/how_to/custom_agent/
agent = create_tool_calling_agent(model, env.tools, prompt)
if self.rate_limiter:
agent = with_rate_limit(agent, self.rate_limiter)
executor = AgentExecutor(
agent=agent,
tools=env.tools,
handle_parsing_errors=True,
return_intermediate_steps=True,
)
# Apply the adapters so that inputs and outputs match dataset schema
# state_reader automatically adds the state of the environment at the end of the run.
return apply_agent_executor_adapter(executor, state_reader=env.read_state)
task
Name | Tool Usage - Typewriter (26 tools) |
Type | ToolUsageTask |
Dataset ID | 128af05e-aa00-4e3b-a958-d166dd450581 |
Description | Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument. |
custom_agent_factory = CustomAgentFactory(task)
agent = custom_agent_factory()
agent.invoke({"question": "abc"})
{'question': 'abc',
'output': [],
'intermediate_steps': [(ToolAgentAction(tool='a', tool_input={}, log='\nInvoking: `a` with `{}`\nresponded: [{\'text\': \'<thinking>\\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\\n</thinking>\', \'type\': \'text\'}, {\'id\': \'toolu_016f6CZwwFmdz2h8KbdGRVjj\', \'input\': {}, \'name\': \'a\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01JvfeTpU3hEuS7PknFk5a8S\', \'input\': {}, \'name\': \'b\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01NbBCY5Fg62RsyAAUd4n2g1\', \'input\': {}, \'name\': \'c\', \'type\': \'tool_use\'}]\n\n', message_log=[AIMessageChunk(content=[{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'input': {}, 'name': 'c', 'type': 'tool_use'}], id='run-42ea263e-e52a-4fc7-8aa3-71e16a9db42b', tool_calls=[{'name': 'a', 'args': {}, 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj'}, {'name': 'b', 'args': {}, 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S'}, {'name': 'c', 'args': {}, 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'index': 2}])], tool_call_id='toolu_016f6CZwwFmdz2h8KbdGRVjj'),
'OK'),
(ToolAgentAction(tool='b', tool_input={}, log='\nInvoking: `b` with `{}`\nresponded: [{\'text\': \'<thinking>\\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\\n</thinking>\', \'type\': \'text\'}, {\'id\': \'toolu_016f6CZwwFmdz2h8KbdGRVjj\', \'input\': {}, \'name\': \'a\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01JvfeTpU3hEuS7PknFk5a8S\', \'input\': {}, \'name\': \'b\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01NbBCY5Fg62RsyAAUd4n2g1\', \'input\': {}, \'name\': \'c\', \'type\': \'tool_use\'}]\n\n', message_log=[AIMessageChunk(content=[{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'input': {}, 'name': 'c', 'type': 'tool_use'}], id='run-42ea263e-e52a-4fc7-8aa3-71e16a9db42b', tool_calls=[{'name': 'a', 'args': {}, 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj'}, {'name': 'b', 'args': {}, 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S'}, {'name': 'c', 'args': {}, 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'index': 2}])], tool_call_id='toolu_01JvfeTpU3hEuS7PknFk5a8S'),
'OK'),
(ToolAgentAction(tool='c', tool_input={}, log='\nInvoking: `c` with `{}`\nresponded: [{\'text\': \'<thinking>\\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\\n</thinking>\', \'type\': \'text\'}, {\'id\': \'toolu_016f6CZwwFmdz2h8KbdGRVjj\', \'input\': {}, \'name\': \'a\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01JvfeTpU3hEuS7PknFk5a8S\', \'input\': {}, \'name\': \'b\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01NbBCY5Fg62RsyAAUd4n2g1\', \'input\': {}, \'name\': \'c\', \'type\': \'tool_use\'}]\n\n', message_log=[AIMessageChunk(content=[{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'input': {}, 'name': 'c', 'type': 'tool_use'}], id='run-42ea263e-e52a-4fc7-8aa3-71e16a9db42b', tool_calls=[{'name': 'a', 'args': {}, 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj'}, {'name': 'b', 'args': {}, 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S'}, {'name': 'c', 'args': {}, 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'index': 2}])], tool_call_id='toolu_01NbBCY5Fg62RsyAAUd4n2g1'),
'OK')],
'state': 'abc'}