🤖AIP Evaluation Tutorial

This guide shows how to evaluate AI agent trajectories using gllm-evals. The AIP (Agent Integration Platform) uses the AgentEvaluator with LangChain AgentEvals trajectory accuracy metrics to assess agent performance. It supports both trajectory evaluation and generation quality assessment with rule-based aggregation. Results can be monitored in Langfuse.

Prerequisites

Before you can start evaluating AIP agent trajectories, prepare the following:

Required Parameters for AIP

Dataset Path – Path to your CSV file containing agent trajectory data
Model Credentials – OpenAI API key for evaluation models

Required Keys (Langfuse – Optional)

LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY
LANGFUSE_HOST

Install the Required Libraries

pip install gllm-evals langfuse

Setup Environment and Configuration

# OpenAI API Key for evaluation models
OPENAI_API_KEY=your_openai_api_key_here

# Langfuse (Optional - for experiment tracking)
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_HOST=your_langfuse_host_url

import os
from langfuse import get_client

langfuse = get_client()

if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Prepare Your Dataset

Input Fields

agent_trajectory
expected_agent_trajectory (optional)
query, generated_response, expected_response (optional)
retrieved_context (optional)

Example Dataset Structure

query,generated_response,expected_response,agent_trajectory,expected_agent_trajectory
"What is the weather in San Francisco?","The weather in San Francisco is 75 degrees and partly cloudy.","It's 75 degrees and partly cloudy in San Francisco.","[{""role"":""user"",""content"":""What is the weather in San Francisco?""},{""role"":""assistant"",""content"":""The weather in San Francisco is 75 degrees and partly cloudy.""}]","[{""role"":""user"",""content"":""What's the weather like in San Francisco?""},{""role"":""assistant"",""content"":""It's 75 degrees and partly cloudy in San Francisco.""}]"

Langfuse Mapping (Optional)

langfuse_mapping = {
    "input": {
        "query": "query",
        "agent_trajectory": "agent_trajectory",
        "expected_agent_trajectory": "expected_agent_trajectory"
    },
    "expected_output": {
        "expected_response": "expected_response"
    },
    "metadata": {
        "generated_response": "generated_response",
        "retrieved_context": "retrieved_context"
    }
}

Configure the AgentEvaluator

Trajectory-Only Evaluation

from gllm_evals.evaluator.agent_evaluator import AgentEvaluator

evaluator = AgentEvaluator(
    model_credentials=os.getenv("OPENAI_API_KEY"),
    use_reference=True,
    continuous=True,
)

Combined Evaluation

from gllm_evals.constant import DefaultValues
from gllm_evals.evaluator.trajectory_generation_evaluator import TrajectoryGenerationEvaluator

evaluator = TrajectoryGenerationEvaluator(
    # Model for agent trajectory evaluation (execution quality)
    agent_model=DefaultValues.AGENT_EVALS_MODEL,
    agent_model_credentials=os.getenv("OPENAI_API_KEY"),
    # Model for generation evaluation (response quality)
    generation_model=DefaultValues.MODEL,
    generation_model_credentials=os.getenv("OPENAI_API_KEY"),
)

Custom Configuration Options

model
prompt
use_reference
continuous
choices
use_reasoning
few_shot_examples

Load and Prepare Your Dataset

from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset

dataset = load_simple_agent_dataset()

# or dataset = load_dataset_from_csv("path/to/your/dataset.csv")

Run the Evaluation

import asyncio

async def main():
    dataset = load_simple_agent_dataset()
    evaluator = AgentEvaluator(
        model_credentials=os.getenv("OPENAI_API_KEY")
    )
    result = await evaluator.evaluate(dataset[0])
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Advanced Evaluation with Langfuse Integration

import asyncio, os
from langfuse import get_client
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator
from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker

async def main():
    dataset = load_simple_agent_dataset()

    async def generate_agent_response(item):
        return {
            "query": item.get("query"),
            "generated_response": item.get("generated_response"),
            "agent_trajectory": item.get("agent_trajectory"),
            "expected_response": item.get("expected_response"),
            "expected_agent_trajectory": item.get("expected_agent_trajectory"),
        }

    results = await evaluate(
        data=dataset,
        inference_fn=generate_agent_response,
        evaluators=[AgentEvaluator(
            model_credentials=os.getenv("OPENAI_API_KEY"),
            use_reference=True
        )],
        experiment_tracker=LangfuseExperimentTracker(
            langfuse_client=get_client(),
            mapping=langfuse_mapping,
        ),
    )
    print(results)

if __name__ == "__main__":
    asyncio.run(main())

Understanding Evaluation Results

Trajectory-Only Results

{
    "trajectory_accuracy": {
        "score": 1.0,
        "explanation": "The trajectory shows good progression..."
    }
}

Combined Evaluation Results (using TrajectoryGenerationEvaluator)

{
    "trajectory_accuracy": {"score": 1.0},
    "geval_generation_evals": {"score": 1, "relevancy_rating": "good"},
    "final_result": {"score": 1, "relevancy_rating": "good"}
}

View Results in Langfuse (Optional)

Open Langfuse project dashboard
View Datasets → Runs → Traces → Sessions

Best Practices

Ensure data completeness
Use reference trajectories for better accuracy
Validate trajectory format (role, content)
Use Langfuse for tracking and visualization

Troubleshooting

Issue

Fix

Invalid trajectory format

Ensure list of dicts with role and content

Missing fields

Add required columns in CSV

Invalid credentials

Recheck OpenAI and Langfuse keys

How to Generate Expected Agent Trajectory

Creating high-quality expected agent trajectories is crucial for accurate evaluation. Here's a systematic approach to generate reference trajectories.

Generate Referenceless Trajectories

Start by running your agent without reference trajectories to capture natural execution patterns. This establishes a baseline of how your agent behaves without prior bias, which you can later compare against curated reference trajectories.

from gllm_evals.evaluator.agent_evaluator import AgentEvaluator

# Generate trajectories without expected references
evaluator_no_ref = AgentEvaluator(
    model_credentials=os.getenv("OPENAI_API_KEY"),
    use_trajectory_only=True,
    use_reference=False,  # No reference trajectory needed
    continuous=True,
)

# Run evaluation to generate baseline trajectories
results = await evaluator_no_ref.evaluate(your_dataset_item)

Run Initial Evaluation

After generating referenceless trajectories, run evaluations on your dataset to produce complete trajectory data. Each record should include the agent’s reasoning steps, tool calls, and responses. The goal is to collect a diverse set of examples that capture both effective and ineffective reasoning paths. This dataset forms the basis for your later quality review.

Manual Quality Assessment and Selection

Conduct a manual quality assessment with your data or evaluation team to determine which trajectories are suitable as reference trajectories.

Review each trajectory for:

Logical consistency — Does each step follow a coherent reasoning process?
Step progression — Does the agent take actions that progressively solve the query?
Goal achievement — Does it reach a correct or reasonable final outcome?
Error handling — Does it avoid irrelevant, repetitive, or failed actions?
Efficiency — Is the reasoning concise and free of unnecessary steps?

Inspect the explanation generated by the AgentEvaluator to help score or label trajectory quality. Check for:

The trajectory is efficient (minimal redundant steps).
No tool errors or failed function calls.
The reasoning chain is complete (no abrupt jumps or missing logic).
The agent’s final decision or response matches the task goal.

Combining manual review with evaluator-generated explanations provides a structured way to score or label trajectory quality.

Integration with Dataset

After selecting and validating the high-quality trajectories, integrate them back into your dataset. For each query, attach the corresponding validated trajectory under expected_agent_trajectory. This enriched dataset will serve as the foundation for future trajectory-based evaluations, enabling consistent benchmarking and comparison across agents.

Best Practices for Reference Trajectory Generation

Expert Review — Always involve domain experts for final validation.
Diverse Coverage — Include reasoning across multiple task types.
Regular Updates — Refresh reference trajectories as the agent improves.
Documentation — Record rationale and notes for accepted or rejected trajectories.

Additional Resources

PreviousEvaluate GLChat Tutorial NextEvaluate GLChat - Google Colab Guide

Was this helpful?

hashtagPrerequisites

hashtagRequired Parameters for AIP

hashtagRequired Keys (Langfuse – Optional)

hashtagInstall the Required Libraries

hashtagSetup Environment and Configuration

hashtagPrepare Your Dataset

hashtagInput Fields

hashtagExample Dataset Structure

hashtagLangfuse Mapping (Optional)

hashtagConfigure the AgentEvaluator

hashtagTrajectory-Only Evaluation

hashtagCombined Evaluation

hashtagCustom Configuration Options

hashtagLoad and Prepare Your Dataset

hashtagRun the Evaluation

hashtagAdvanced Evaluation with Langfuse Integration

hashtagUnderstanding Evaluation Results

hashtagTrajectory-Only Results

hashtagCombined Evaluation Results (using TrajectoryGenerationEvaluator)

hashtagView Results in Langfuse (Optional)

hashtagBest Practices

hashtagTroubleshooting

hashtagHow to Generate Expected Agent Trajectory

hashtagGenerate Referenceless Trajectories

hashtagRun Initial Evaluation

hashtagManual Quality Assessment and Selection

hashtagIntegration with Dataset

hashtagBest Practices for Reference Trajectory Generation

hashtagAdditional Resources