🤖Agent Evaluation Tutorial
This guide shows how to evaluate AI agent trajectories using gllm-evals. To perform agent evaluation, use the AgentEvaluator to assess agent performance. AgentEvaluator combines tool correctness assessment with generation quality evaluation and provides flexible configuration options. Results can also be monitored via Langfuse. for more details about Langfuse, check out Langfuse Experiment Tracker.
Prerequisites
Before you can start evaluating AI agent, prepare the following:
Install the Required Libraries
pip install gllm-evals[deepeval,langchain]Setup Environment and Configuration
# OpenAI API Key & Google API Key for evaluation models
OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_google_api_key_herePrepare Dataset
Input Fields
querygenerated_responseexpected_responsetools_called(optional)expected_tools(optional)agent_trajectory(optional)expected_agent_trajectory(optional)
Example Dataset Structure (with Tool Calls)
Example Dataset Structure (with Agent Trajectory)
Evaluating Agent
Configure the AgentEvaluator
By default, AgentEvaluator already construct DeepEvalToolCorrectnessMetric and GEvalGenerationEvaluator to be used.
Load and Prepare Your Dataset
Run the Evaluation
Using Evaluate Helper Function
Customizing Tool Correctness Parameters
DeepevalToolCorrectnessMetric supports various parameters to configure the behavior of the metric. Below are the parameters that are configureable:
threshold (float): passing threshold between 0-1 that classify tool calls as good/bad. Defaults to 0.5
model (str): Model used for evaluation, this model will only be used if
available_toolsis provided.model_credentials (str): API Key for the model used for evaluation
available_tools (list[dict], optional): list of tools schema/definition that are allowed to be called by the agent evaluated
strict_mode (bool): If True, scores return as 0 or 1. Default to False
should_exact_match (bool): If True, requires each tool call in actual and reference to be exact match in tool name, argument, and output. Defaults to False
should_consider_ordering (bool): If True, ordering of the tools will be considered in the evaluation. Defaults to False
evaluation_params (list[str]): The parameters in a tool call to be evaluated. Defaults to evalaute tool calls input parameter (
args) and output (output). This will only be evaluated if the data is present.include_reason (bool): Include explanation in the scoring result. Defaults to True
For more details about DeepEvalToolCorrectnessMetric, see here.
Using Available Tools for Tool Correctness
By default, the tool correctness metric evaluates whether the agent called the right tools by comparing to the reference. However, providing available_tools context significantly improves evaluation accuracy by evaluating with LLM if the tools provided to the agent are the most fit.
Why Provide Available Tools?
Without available_tools, the evaluator can only assess if the called tools match the expected tools. With available_tools, the evaluator can also judge:
Whether the agent selected the most appropriate tool from available options
If the agent missed better tool alternatives
Context-aware reasoning about tool selection
Tool Schema
Tool Schema are dictionary that defines tools that are available to use by the agent. Each tool schema should have at least name, description, parameters that are accepted.
How to use
To use tool schema as available_tools you only need to load the tool schemas and feed it to available_tools parameter on DeepEvalToolCorrectnessMetric.
DeepEvalToolCorrectnessMetric will compare the tool selection score compared to available_tools and the comparison score between the tool calls to the reference. The final result returned will be the lowest between both score.
Enabling Langchain Agent Trajectory Evaluator
The trajectory accuracy metric evaluates the agent's full trajectory using LangChain's agentevals approach. It's disabled by default and only runs when LangChainAgentTrajectoryAccuracyMetric is provided agent_trajectory to the AgentEvaluator. using LangChainAgentTrajectoryAccuracyMetric requires you to provide agent_trajectory and expected_agent_trajectory.
Agent Trajectory Evaluator will not affect the final score of AgentEvaluator and purely used to evaluate the trajectory only.
Using LangChainAgentTrajectoryAccuracyMetric may be costly as it will compare the full trajectory and the referenced trajectory that are relatively long.
There are several configuration can be done to LangChainAgentTrajectoryAccuracyMetric via constructor, thus:
model (str): Model used for evaluation. Current recommended model for Agent Trajectory evaluator is
gpt-4.1model_credentials (str): the API Key for the models provided
use_reference (bool): if True, it will compare agent trajectory to the reference in the expected agent trajectory. If False, the evaluation over the agent trajectory will not use the expected agent trajectory. Defaults to True
continuous (bool): If True, score will return as float between 0 - 1. Defaults to False
use_reasoning (bool): If True, explanation will be included in the output
few_shot_examples (list[FewShotExample], optional): list of few shot examples that will be provided as context
For more details on LangChainAgentTrajectoryAccuracyMetric, please see here.
Last updated
Was this helpful?