AgentEvaluator

Overview

Use when: You want to evaluate AI agent's overall performance, including tool usage and the quality of the agent’s outputs. This evaluator also uses GEvalGenerationEvaluator as the agent output quality evaluator. Additionally this evaluator uses DeepEval Tool Correctness metric for tool call evaluation.

Fields

query (str) — The question given by the user to the agent.
generated_response (str) — The agent's output to be evaluated.
expected_response (str, optional) — The reference or ground truth answer.
tools_called (list[dict[str, Any]], optional) — The list of actual tools called by the agent
expected_tools (list[dict[str, Any]], optional) — The list of the tools expected to be called by the as reference for comparison.
agent_trajectory (list[dict[str, Any]], optional) — The actual agent trajectory to be evaluated. If tools_called are not provided, the agent_trajectory will be parsed as tools_called
expected_agent_trajectory (list[dict[str, Any]], optional) — The reference trajectory for comparison. If expected_tools are not provided, the expected_agent_trajectory will be parsed as expected_tools.

Configuration Options

tool_correctness_metric (DeepEvalToolCorrectnessMetric): This configuration allows providing configured DeepEvalToolCorrectnessMetric that will be used to evaluate agent tool calls. If not provided, a default Tool Correctness Metric will be used.
generation_evaluator (GEvalGenerationEvaluator): This configuration allows configuring GEvalGenerationEvaluator that will be used to evaluate agent output quality. If not provided, a default GEvalGenerationEvaluator will be used.
trajectory_accuracy_metric (LangChainAgentTrajectoryAccuracyMetric): This configuration allows enabling agent trajectory evaluation. If not provided, this metric will not be used (disabled by default)

For more information about the metrics configuration, see Metric.

Output

AgentEvaluator outputs the scores of each metrics as individuals. Additionaly, several aggregated scores are also provided such as:

multiply_score : This score is a multiplication between the score of DeepEval's Tool Correctness Metric and score of GEvalGenerationEvaluators metric.
avg_score : This score is an average score between DeepEval's Tool Correctness Metric and score of GEvalGenerationEvaluators metric.

Example Usage

import asyncio
import json

from gllm_evals.dataset import load_simple_agent_tool_call_dataset
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator


async def main():
    """Main function."""
    data = load_simple_agent_tool_call_dataset()
    evaluator = AgentEvaluator()
    result = await evaluator.evaluate(data[0])
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "agent_evals": {
    "global_explanation": "The following metrics failed to meet expectations:\n1. Deepeval Tool Correctness is 0.00 (should be 0.50)",
    "multiply_score": 0.0,
    "avg_score": 0.5,
    "relevancy_rating": "bad",
    "possible_issues": [
      "Tool Call Issue"
    ],
    "deepeval_tool_correctness": {
      "score": 0.0,
      "explanation": "[\n\t Tool Calling Reason: Incomplete tool usage: missing tools [ToolCall(\n    name=\"data_checker\",\n    input_parameters={\n        \"query\": \"SELECT AVG(amount) as avg_sales FROM orders LIMIT 1\"\n    },\n    output='[{\"avg_sales\": 250.50}]'\n)]; expected ['data_checker'], called ['wrong_tool_name']. See more details above.\n\t Tool Selection Reason: No available tools were provided to assess tool selection criteria\n]\n"
    },
    "generation": {
      "global_explanation": "All metrics met the expected values.",
      "relevancy_rating": "good",
      "score": 1.0,
      "possible_issues": [],
      "binary_score": 1,
      "avg_score": 1.0,
      "completeness": {
        "score": 3,
        "explanation": "The response accurately identifies the average sales amount as $250.50, matching the numeric value in the expected output exactly. It fully addresses the question with no factual contradictions or omissions.",
        "success": true,
        "normalized_score": 1.0
      },
      "redundancy": {
        "score": 1,
        "explanation": "The response is concise and directly answers the question without any repetition. The key information regarding the average sales amount is stated only once, and the introductory phrase is a standard lead-in that does not count towards redundancy.",
        "normalized_score": 1.0
      },
      "language_consistency": {
        "score": 1,
        "explanation": "The instructional language of the input is English, and the actual output is also written in English, maintaining complete language consistency between the question and the response.",
        "success": true,
        "normalized_score": 1.0
      },
      "refusal_alignment": {
        "score": 1,
        "explanation": "is_refusal was detected from the expected response as not refusal, as it directly provides the requested average sales amount. The actual response also provides a direct answer without any refusal indicators or disclaimers. Both are not refusals, so the alignment is correct.",
        "success": true,
        "normalized_score": 1.0
      }
    }
  }
}

Tools Structure

The tools that are provided to the field tools_called and expected_tools follows the structure below:

name (str): The name of the tools, this will be used as the identifier of a tool
args (dict[str, Any], optional): The arguments/parameters that are accepted by the tools
output (str, optional): The output/result of the tool

Example:

[
  {
    "name": "search_tutorials",
    "args": {
      "topic": "Python"
    },
    "output": "[{\"id\": 1, \"title\": \"Python Basics\"}, {\"id\": 2, \"title\": \"Advanced Python\"}]"
  },
  {
    "name": "save_to_bookmarks",
    "args": {
      "items": [
        {"id": 1, "title": "Python Basics"},
        {"id": 2, "title": "Advanced Python"}
      ]
    },
    "output": "Successfully saved 2 items to bookmarks"
  }
]

Agent Trajectory Structure

The agent trajectory to be provided for agent_trajectory and expected_agent_trajectory are list of dictionaries that have several types of role:

user: The user that ask question
assistant: The agent that are responding/calling a tool
tool: the tools that are called

Each dictionary represent a chat message of each role:

role: the role of the message sender
content: the message sent
tool_calls: list of tools called. This is exclusive to role assistant
tool_call_id: the identifier of the tool call result. This should reflect the tool called by assistant and exclusive to role tool

Example:

[
  {
    "role": "user",
    "content": "What's the weather in Tokyo?"
  },
  {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "id": "call_4",
        "type": "function",
        "function": {
          "name": "weather_search",
          "arguments": "{\"location\": \"Tokyo\"}"
        }
      }
    ]
  },
  {
    "role": "tool",
    "tool_call_id": "call_4",
    "content": "{\"temperature\": 22, \"condition\": \"sunny\"}"
  },
  {
    "role": "assistant",
    "content": "It's 22 degrees and sunny in Tokyo."
  }
]

Customizing Tool Correctness Parameters

DeepevalToolCorrectnessMetric supports various parameters to configure the behavior of the metric. Below are the parameters that are configureable:

threshold (float): passing threshold between 0-1 that classify tool calls as good/bad. Defaults to 0.5
model (str): Model used for evaluation, this model will only be used if available_tools is provided.
model_credentials (str): API Key for the model used for evaluation
available_tools (list[dict], optional): list of tools schema/definition that are allowed to be called by the agent evaluated
strict_mode (bool): If True, scores return as 0 or 1. Default to False
should_exact_match (bool): If True, requires each tool call in actual and reference to be exact match in tool name, argument, and output. Defaults to False
should_consider_ordering (bool): If True, ordering of the tools will be considered in the evaluation. Defaults to False
evaluation_params (list[str]): The parameters in a tool call to be evaluated. Defaults to evalaute tool calls input parameter (args) and output (output). This will only be evaluated if the data is present.
include_reason (bool): Include explanation in the scoring result. Defaults to True

For more details about DeepEvalToolCorrectnessMetric, see here.

Using Available Tools for Tool Correctness

By default, the tool correctness metric evaluates whether the agent called the right tools by comparing to the reference. However, providing available_tools context significantly improves evaluation accuracy by evaluating with LLM if the tools provided to the agent are the most fit.

Why Provide Available Tools?

Without available_tools, the evaluator can only assess if the called tools match the expected tools. With available_tools, the evaluator can also judge:

Whether the agent selected the most appropriate tool from available options
If the agent missed better tool alternatives
Context-aware reasoning about tool selection

Tool Schema

Tool Schema are dictionary that defines tools that are available to use by the agent. Each tool schema should have at least name, description, parameters that are accepted.

[
  {
    "name": "calculator",
    "description": "Perform mathematical calculations and computations",
    "parameters": {
      "type": "object",
      "properties": {
        "expression": {
          "type": "string",
          "description": "Mathematical expression to evaluate"
        }
      },
      "required": ["expression"]
    }
  },
  {
    "name": "weather_search",
    "description": "Get weather information for a specific location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "The location to get weather for"
        }
      },
      "required": ["location"]
    }
  }
]

How to use

To use tool schema as available_tools you only need to load the tool schemas and feed it to available_tools parameter on DeepEvalToolCorrectnessMetric.

from gllm_evals.constant import DefaultValues
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator
from gllm_evals.metrics.agent.deepeval_tool_correctness import DeepEvalToolCorrectnessMetric
from gllm_evals.dataset.simple_agent_tool_call_dataset import load_tool_schema

# Load tool schema
available_tools = load_tool_schema()

# Configure tool correctness metric with available tools
tool_correctness = DeepEvalToolCorrectnessMetric(
    model=DefaultValues.AGENT_EVALS_MODEL,
    model_credentials=os.getenv("OPENAI_API_KEY"),
    available_tools=available_tools,  # Provide tool context
)

# Create evaluator
evaluator = AgentEvaluator(
    tool_correctness_metric=tool_correctness,
)

DeepEvalToolCorrectnessMetric will compare the tool selection score compared to available_tools and the comparison score between the tool calls to the reference. The final result returned will be the lowest between both score.

Enabling Langchain Agent Trajectory Evaluator

The trajectory accuracy metric evaluates the agent's full trajectory using LangChain's agentevals approach. It's disabled by default and only runs when LangChainAgentTrajectoryAccuracyMetric is provided agent_trajectory to the AgentEvaluator. using LangChainAgentTrajectoryAccuracyMetric requires you to provide agent_trajectory and expected_agent_trajectory.

Agent Trajectory Evaluator will not affect the final score of AgentEvaluator and purely used to evaluate the trajectory only.

Using LangChainAgentTrajectoryAccuracyMetric may be costly as it will compare the full trajectory and the referenced trajectory that are relatively long.

from gllm_evals.constant import DefaultValues
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator

trajectory_accuracy = LangChainAgentTrajectoryAccuracyMetric(
    model=DefaultValues.AGENT_EVALS_MODEL,
    model_credentials=os.getenv("OPENAI_API_KEY"),
)
evaluator = AgentEvaluator(
    trajectory_accuracy_metric=trajectory_accuracy
)

There are several configuration can be done to LangChainAgentTrajectoryAccuracyMetric via constructor, thus:

model (str): Model used for evaluation. Current recommended model for Agent Trajectory evaluator is gpt-4.1
model_credentials (str): the API Key for the models provided
use_reference (bool): if True, it will compare agent trajectory to the reference in the expected agent trajectory. If False, the evaluation over the agent trajectory will not use the expected agent trajectory. Defaults to True
continuous (bool): If True, score will return as float between 0 - 1. Defaults to False
use_reasoning (bool): If True, explanation will be included in the output
few_shot_examples (list[FewShotExample], optional): list of few shot examples that will be provided as context

For more details on LangChainAgentTrajectoryAccuracyMetric, please see here.

PreviousGEvalGenerationEvaluator NextClassicalRetrievalEvaluator

Last updated 4 days ago

Was this helpful?

hashtagOverview

hashtagFields

hashtagConfiguration Options

hashtagOutput

hashtagExample Usage

hashtagExample Output

hashtagTools Structure

hashtagAgent Trajectory Structure

hashtagCustomizing Tool Correctness Parameters

hashtagUsing Available Tools for Tool Correctness

hashtagWhy Provide Available Tools?

hashtagTool Schema

hashtagHow to use

hashtagEnabling Langchain Agent Trajectory Evaluator

Overview

Fields

Configuration Options

Output

Example Usage

Example Output

Tools Structure

Agent Trajectory Structure

Customizing Tool Correctness Parameters

Using Available Tools for Tool Correctness

Why Provide Available Tools?

Tool Schema

How to use

Enabling Langchain Agent Trajectory Evaluator