🤖AIP Evaluation Tutorial
This guide shows how to evaluate AI agent trajectories using gllm-evals.
The AIP (Agent Integration Platform) uses the AgentEvaluator with LangChain AgentEvals trajectory accuracy metrics to assess agent performance.
It supports both trajectory evaluation and generation quality assessment with rule-based aggregation.
Results can be monitored in Langfuse.
Prerequisites
Before you can start evaluating AIP agent trajectories, prepare the following:
Required Parameters for AIP
Dataset Path – Path to your CSV file containing agent trajectory data
Model Credentials – OpenAI API key for evaluation models
Required Keys (Langfuse – Optional)
LANGFUSE_PUBLIC_KEYLANGFUSE_SECRET_KEYLANGFUSE_HOST
Install the Required Libraries
pip install gllm-evals langfuseSetup Environment and Configuration
# OpenAI API Key for evaluation models
OPENAI_API_KEY=your_openai_api_key_here
# Langfuse (Optional - for experiment tracking)
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_HOST=your_langfuse_host_urlimport os
from langfuse import get_client
langfuse = get_client()
if langfuse.auth_check():
print("Langfuse client is authenticated and ready!")
else:
print("Authentication failed. Please check your credentials and host.")Prepare Your Dataset
Input Fields
agent_trajectoryexpected_agent_trajectory(optional)query,generated_response,expected_response(optional)retrieved_context(optional)
Example Dataset Structure
query,generated_response,expected_response,agent_trajectory,expected_agent_trajectory
"What is the weather in San Francisco?","The weather in San Francisco is 75 degrees and partly cloudy.","It's 75 degrees and partly cloudy in San Francisco.","[{""role"":""user"",""content"":""What is the weather in San Francisco?""},{""role"":""assistant"",""content"":""The weather in San Francisco is 75 degrees and partly cloudy.""}]","[{""role"":""user"",""content"":""What's the weather like in San Francisco?""},{""role"":""assistant"",""content"":""It's 75 degrees and partly cloudy in San Francisco.""}]"Langfuse Mapping (Optional)
langfuse_mapping = {
"input": {
"query": "query",
"agent_trajectory": "agent_trajectory",
"expected_agent_trajectory": "expected_agent_trajectory"
},
"expected_output": {
"expected_response": "expected_response"
},
"metadata": {
"generated_response": "generated_response",
"retrieved_context": "retrieved_context"
}
}Configure the AgentEvaluator
Trajectory-Only Evaluation
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator
evaluator = AgentEvaluator(
model_credentials=os.getenv("OPENAI_API_KEY"),
use_reference=True,
continuous=True,
)Combined Evaluation
from gllm_evals.constant import DefaultValues
from gllm_evals.evaluator.trajectory_generation_evaluator import TrajectoryGenerationEvaluator
evaluator = TrajectoryGenerationEvaluator(
# Model for agent trajectory evaluation (execution quality)
agent_model=DefaultValues.AGENT_EVALS_MODEL,
agent_model_credentials=os.getenv("OPENAI_API_KEY"),
# Model for generation evaluation (response quality)
generation_model=DefaultValues.MODEL,
generation_model_credentials=os.getenv("OPENAI_API_KEY"),
)Custom Configuration Options
modelpromptuse_referencecontinuouschoicesuse_reasoningfew_shot_examples
Load and Prepare Your Dataset
from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset
dataset = load_simple_agent_dataset()
# or dataset = load_dataset_from_csv("path/to/your/dataset.csv")Run the Evaluation
import asyncio
async def main():
dataset = load_simple_agent_dataset()
evaluator = AgentEvaluator(
model_credentials=os.getenv("OPENAI_API_KEY")
)
result = await evaluator.evaluate(dataset[0])
print(result)
if __name__ == "__main__":
asyncio.run(main())Advanced Evaluation with Langfuse Integration
import asyncio, os
from langfuse import get_client
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator
from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker
async def main():
dataset = load_simple_agent_dataset()
async def generate_agent_response(item):
return {
"query": item.get("query"),
"generated_response": item.get("generated_response"),
"agent_trajectory": item.get("agent_trajectory"),
"expected_response": item.get("expected_response"),
"expected_agent_trajectory": item.get("expected_agent_trajectory"),
}
results = await evaluate(
data=dataset,
inference_fn=generate_agent_response,
evaluators=[AgentEvaluator(
model_credentials=os.getenv("OPENAI_API_KEY"),
use_reference=True
)],
experiment_tracker=LangfuseExperimentTracker(
langfuse_client=get_client(),
mapping=langfuse_mapping,
),
)
print(results)
if __name__ == "__main__":
asyncio.run(main())Understanding Evaluation Results
Trajectory-Only Results
{
"trajectory_accuracy": {
"score": 1.0,
"explanation": "The trajectory shows good progression..."
}
}Combined Evaluation Results (using TrajectoryGenerationEvaluator)
{
"trajectory_accuracy": {"score": 1.0},
"geval_generation_evals": {"score": 1, "relevancy_rating": "good"},
"final_result": {"score": 1, "relevancy_rating": "good"}
}View Results in Langfuse (Optional)
Open Langfuse project dashboard
View Datasets → Runs → Traces → Sessions
Best Practices
Ensure data completeness
Use reference trajectories for better accuracy
Validate trajectory format (
role,content)Use Langfuse for tracking and visualization
Troubleshooting
Invalid trajectory format
Ensure list of dicts with role and content
Missing fields
Add required columns in CSV
Invalid credentials
Recheck OpenAI and Langfuse keys
How to Generate Expected Agent Trajectory
Creating high-quality expected agent trajectories is crucial for accurate evaluation. Here's a systematic approach to generate reference trajectories.
Generate Referenceless Trajectories
Start by running your agent without reference trajectories to capture natural execution patterns. This establishes a baseline of how your agent behaves without prior bias, which you can later compare against curated reference trajectories.
Run Initial Evaluation
After generating referenceless trajectories, run evaluations on your dataset to produce complete trajectory data. Each record should include the agent’s reasoning steps, tool calls, and responses. The goal is to collect a diverse set of examples that capture both effective and ineffective reasoning paths. This dataset forms the basis for your later quality review.
Manual Quality Assessment and Selection
Conduct a manual quality assessment with your data or evaluation team to determine which trajectories are suitable as reference trajectories.
Review each trajectory for:
Logical consistency — Does each step follow a coherent reasoning process?
Step progression — Does the agent take actions that progressively solve the query?
Goal achievement — Does it reach a correct or reasonable final outcome?
Error handling — Does it avoid irrelevant, repetitive, or failed actions?
Efficiency — Is the reasoning concise and free of unnecessary steps?
Inspect the explanation generated by the AgentEvaluator to help score or label trajectory quality. Check for:
The trajectory is efficient (minimal redundant steps).
No tool errors or failed function calls.
The reasoning chain is complete (no abrupt jumps or missing logic).
The agent’s final decision or response matches the task goal.
Combining manual review with evaluator-generated explanations provides a structured way to score or label trajectory quality.
Integration with Dataset
After selecting and validating the high-quality trajectories, integrate them back into your dataset. For each query, attach the corresponding validated trajectory under expected_agent_trajectory. This enriched dataset will serve as the foundation for future trajectory-based evaluations, enabling consistent benchmarking and comparison across agents.
Best Practices for Reference Trajectory Generation
Expert Review — Always involve domain experts for final validation.
Diverse Coverage — Include reasoning across multiple task types.
Regular Updates — Refresh reference trajectories as the agent improves.
Documentation — Record rationale and notes for accepted or rejected trajectories.