Typewriter: 26 Tools#

This is a variation of the typewriter task in which the agent has access to 26 parameterless tools.

Each tool represents a letter of the alphabet (e.g., ‘a’, ‘b’, ‘c’).

The agent can use each tool to “print” the corresponding letter on a piece of virtual paper.

The objective for the agent is to “print” the user’s input on the paper exactly.


For this code to work, please configure LangSmith environment variables with your credentials.

import os

os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your api key.
from langchain_benchmarks import registry
task = registry["Tool Usage - Typewriter (26 tools)"]
task
Name Tool Usage - Typewriter (26 tools)
Type ToolUsageTask
Dataset ID 128af05e-aa00-4e3b-a958-d166dd450581
DescriptionEnvironment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument.

Let’s build an agent that we can use for evaluation.

The Environment#

The environment consists of 26 tools and a virtual paper.

Each tool is responsible for printing a letter on the paper that corresponds to it.

env = task.create_environment()
env.tools[:5]
[StructuredTool(name='a', description='a() -> str - Run to Type the letter "a".', args_schema=<class 'pydantic.v1.main.aSchema'>, func=<function _create_typing_func.<locals>.func at 0x75aa9defc180>),
 StructuredTool(name='b', description='b() -> str - Run to Type the letter "b".', args_schema=<class 'pydantic.v1.main.bSchema'>, func=<function _create_typing_func.<locals>.func at 0x75aa9defc220>),
 StructuredTool(name='c', description='c() -> str - Run to Type the letter "c".', args_schema=<class 'pydantic.v1.main.cSchema'>, func=<function _create_typing_func.<locals>.func at 0x75aa9defc2c0>),
 StructuredTool(name='d', description='d() -> str - Run to Type the letter "d".', args_schema=<class 'pydantic.v1.main.dSchema'>, func=<function _create_typing_func.<locals>.func at 0x75aa9defc360>),
 StructuredTool(name='e', description='e() -> str - Run to Type the letter "e".', args_schema=<class 'pydantic.v1.main.eSchema'>, func=<function _create_typing_func.<locals>.func at 0x75aa9defc400>)]
env.tools[0].invoke({})
'OK'
env.tools[3].invoke({})
'OK'
env.read_state()
'ad'

Explore the task#

For evaluation, we need an agent factory that will create a new instance of an agent executor for every evaluation run.

We’ll use the StandardAgentFactory – look at the intro for more information about what it does and/or how to create a custom one.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai.chat_models import ChatOpenAI

from langchain_benchmarks.tool_usage.agents import StandardAgentFactory

model = ChatOpenAI(temperature=0)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "{instructions}"),  # Populated from task.instructions automatically
        ("human", "{question}"),  # Populated from the test data
        (
            "placeholder",
            "{agent_scratchpad}",
        ),  # Work where the agent can do its work (e.g., call multiple tools)
    ]
)

agent_factory = StandardAgentFactory(task, model, prompt)
from langchain import globals

globals.set_verbose(True)

agent = agent_factory()
agent.invoke({"question": "abc"})
> Entering new AgentExecutor chain...

Invoking: `a` with `{}`


OK
Invoking: `b` with `{}`


OK
Invoking: `c` with `{}`


OKabcabcabc

> Finished chain.
{'question': 'abc',
 'output': 'abcabcabc',
 'intermediate_steps': [(ToolAgentAction(tool='a', tool_input={}, log='\nInvoking: `a` with `{}`\n\n\n', message_log=[AIMessageChunk(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI', 'function': {'arguments': '{}', 'name': 'a'}, 'type': 'function'}, {'index': 1, 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW', 'function': {'arguments': '{}', 'name': 'b'}, 'type': 'function'}, {'index': 2, 'id': 'call_MRAOAgbi8vT445clqC8OybMR', 'function': {'arguments': '{}', 'name': 'c'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls'}, id='run-9a1af767-29e4-4759-ab28-5b29236e8f22', tool_calls=[{'name': 'a', 'args': {}, 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI'}, {'name': 'b', 'args': {}, 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW'}, {'name': 'c', 'args': {}, 'id': 'call_MRAOAgbi8vT445clqC8OybMR'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'call_MRAOAgbi8vT445clqC8OybMR', 'index': 2}])], tool_call_id='call_OrpjShN5uNzw2Rsb1tWF6swI'),
   'OK'),
  (ToolAgentAction(tool='b', tool_input={}, log='\nInvoking: `b` with `{}`\n\n\n', message_log=[AIMessageChunk(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI', 'function': {'arguments': '{}', 'name': 'a'}, 'type': 'function'}, {'index': 1, 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW', 'function': {'arguments': '{}', 'name': 'b'}, 'type': 'function'}, {'index': 2, 'id': 'call_MRAOAgbi8vT445clqC8OybMR', 'function': {'arguments': '{}', 'name': 'c'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls'}, id='run-9a1af767-29e4-4759-ab28-5b29236e8f22', tool_calls=[{'name': 'a', 'args': {}, 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI'}, {'name': 'b', 'args': {}, 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW'}, {'name': 'c', 'args': {}, 'id': 'call_MRAOAgbi8vT445clqC8OybMR'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'call_MRAOAgbi8vT445clqC8OybMR', 'index': 2}])], tool_call_id='call_2XO5RNgt9FjGvTXztgD0tKqW'),
   'OK'),
  (ToolAgentAction(tool='c', tool_input={}, log='\nInvoking: `c` with `{}`\n\n\n', message_log=[AIMessageChunk(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI', 'function': {'arguments': '{}', 'name': 'a'}, 'type': 'function'}, {'index': 1, 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW', 'function': {'arguments': '{}', 'name': 'b'}, 'type': 'function'}, {'index': 2, 'id': 'call_MRAOAgbi8vT445clqC8OybMR', 'function': {'arguments': '{}', 'name': 'c'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls'}, id='run-9a1af767-29e4-4759-ab28-5b29236e8f22', tool_calls=[{'name': 'a', 'args': {}, 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI'}, {'name': 'b', 'args': {}, 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW'}, {'name': 'c', 'args': {}, 'id': 'call_MRAOAgbi8vT445clqC8OybMR'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'call_OrpjShN5uNzw2Rsb1tWF6swI', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'call_2XO5RNgt9FjGvTXztgD0tKqW', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'call_MRAOAgbi8vT445clqC8OybMR', 'index': 2}])], tool_call_id='call_MRAOAgbi8vT445clqC8OybMR'),
   'OK')],
 'state': 'abc'}

Benchmarking#

See introduction and benchmark all for information on how to run benchmarks. This notebook is just to here to explain and explore the task.