🤖AIP Evaluation Tutorial

This guide shows how to evaluate AI agent trajectories using gllm-evals. The AIP (Agent Integration Platform) uses the AgentEvaluator with LangChain AgentEvals trajectory accuracy metrics to assess agent performance. It supports both trajectory evaluation and generation quality assessment with rule-based aggregation. Results can be monitored in Langfuse.

Prerequisites

Before you can start evaluating AIP agent trajectories, prepare the following:

Required Parameters for AIP

  • Dataset Path – Path to your CSV file containing agent trajectory data

  • Model Credentials – OpenAI API key for evaluation models

Required Keys (Langfuse – Optional)

  • LANGFUSE_PUBLIC_KEY

  • LANGFUSE_SECRET_KEY

  • LANGFUSE_HOST

1

Install the Required Libraries

pip install gllm-evals langfuse
2

Setup Environment and Configuration

# OpenAI API Key for evaluation models
OPENAI_API_KEY=your_openai_api_key_here

# Langfuse (Optional - for experiment tracking)
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_HOST=your_langfuse_host_url
import os
from langfuse import get_client

langfuse = get_client()

if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")
3

Prepare Your Dataset

Input Fields

  • agent_trajectory

  • expected_agent_trajectory (optional)

  • query, generated_response, expected_response (optional)

  • retrieved_context (optional)

Example Dataset Structure

query,generated_response,expected_response,agent_trajectory,expected_agent_trajectory
"What is the weather in San Francisco?","The weather in San Francisco is 75 degrees and partly cloudy.","It's 75 degrees and partly cloudy in San Francisco.","[{""role"":""user"",""content"":""What is the weather in San Francisco?""},{""role"":""assistant"",""content"":""The weather in San Francisco is 75 degrees and partly cloudy.""}]","[{""role"":""user"",""content"":""What's the weather like in San Francisco?""},{""role"":""assistant"",""content"":""It's 75 degrees and partly cloudy in San Francisco.""}]"

Langfuse Mapping (Optional)

langfuse_mapping = {
    "input": {
        "query": "query",
        "agent_trajectory": "agent_trajectory",
        "expected_agent_trajectory": "expected_agent_trajectory"
    },
    "expected_output": {
        "expected_response": "expected_response"
    },
    "metadata": {
        "generated_response": "generated_response",
        "retrieved_context": "retrieved_context"
    }
}
4

Configure the AgentEvaluator

Trajectory-Only Evaluation

from gllm_evals.evaluator.agent_evaluator import AgentEvaluator

evaluator = AgentEvaluator(
    model_credentials=os.getenv("OPENAI_API_KEY"),
    use_reference=True,
    continuous=True,
)

Combined Evaluation

from gllm_evals.constant import DefaultValues
from gllm_evals.evaluator.trajectory_generation_evaluator import TrajectoryGenerationEvaluator

evaluator = TrajectoryGenerationEvaluator(
    # Model for agent trajectory evaluation (execution quality)
    agent_model=DefaultValues.AGENT_EVALS_MODEL,
    agent_model_credentials=os.getenv("OPENAI_API_KEY"),
    # Model for generation evaluation (response quality)
    generation_model=DefaultValues.MODEL,
    generation_model_credentials=os.getenv("OPENAI_API_KEY"),
)

Custom Configuration Options

  • model

  • prompt

  • use_reference

  • continuous

  • choices

  • use_reasoning

  • few_shot_examples

5

Load and Prepare Your Dataset

from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset

dataset = load_simple_agent_dataset()

# or dataset = load_dataset_from_csv("path/to/your/dataset.csv")
6

Run the Evaluation

import asyncio

async def main():
    dataset = load_simple_agent_dataset()
    evaluator = AgentEvaluator(
        model_credentials=os.getenv("OPENAI_API_KEY")
    )
    result = await evaluator.evaluate(dataset[0])
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
7

Advanced Evaluation with Langfuse Integration

import asyncio, os
from langfuse import get_client
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator
from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker

async def main():
    dataset = load_simple_agent_dataset()

    async def generate_agent_response(item):
        return {
            "query": item.get("query"),
            "generated_response": item.get("generated_response"),
            "agent_trajectory": item.get("agent_trajectory"),
            "expected_response": item.get("expected_response"),
            "expected_agent_trajectory": item.get("expected_agent_trajectory"),
        }

    results = await evaluate(
        data=dataset,
        inference_fn=generate_agent_response,
        evaluators=[AgentEvaluator(
            model_credentials=os.getenv("OPENAI_API_KEY"),
            use_reference=True
        )],
        experiment_tracker=LangfuseExperimentTracker(
            langfuse_client=get_client(),
            mapping=langfuse_mapping,
        ),
    )
    print(results)

if __name__ == "__main__":
    asyncio.run(main())
8

Understanding Evaluation Results

Trajectory-Only Results

{
    "trajectory_accuracy": {
        "score": 1.0,
        "explanation": "The trajectory shows good progression..."
    }
}

Combined Evaluation Results (using TrajectoryGenerationEvaluator)

{
    "trajectory_accuracy": {"score": 1.0},
    "geval_generation_evals": {"score": 1, "relevancy_rating": "good"},
    "final_result": {"score": 1, "relevancy_rating": "good"}
}
9

View Results in Langfuse (Optional)

  • Open Langfuse project dashboard

  • View Datasets → Runs → Traces → Sessions

Best Practices

  1. Ensure data completeness

  2. Use reference trajectories for better accuracy

  3. Validate trajectory format (role, content)

  4. Use Langfuse for tracking and visualization

Troubleshooting

Issue
Fix

Invalid trajectory format

Ensure list of dicts with role and content

Missing fields

Add required columns in CSV

Invalid credentials

Recheck OpenAI and Langfuse keys

How to Generate Expected Agent Trajectory

Creating high-quality expected agent trajectories is crucial for accurate evaluation. Here's a systematic approach to generate reference trajectories.

1

Generate Referenceless Trajectories

Start by running your agent without reference trajectories to capture natural execution patterns. This establishes a baseline of how your agent behaves without prior bias, which you can later compare against curated reference trajectories.

2

Run Initial Evaluation

After generating referenceless trajectories, run evaluations on your dataset to produce complete trajectory data. Each record should include the agent’s reasoning steps, tool calls, and responses. The goal is to collect a diverse set of examples that capture both effective and ineffective reasoning paths. This dataset forms the basis for your later quality review.

3

Manual Quality Assessment and Selection

Conduct a manual quality assessment with your data or evaluation team to determine which trajectories are suitable as reference trajectories.

Review each trajectory for:

  • Logical consistency — Does each step follow a coherent reasoning process?

  • Step progression — Does the agent take actions that progressively solve the query?

  • Goal achievement — Does it reach a correct or reasonable final outcome?

  • Error handling — Does it avoid irrelevant, repetitive, or failed actions?

  • Efficiency — Is the reasoning concise and free of unnecessary steps?

Inspect the explanation generated by the AgentEvaluator to help score or label trajectory quality. Check for:

  • The trajectory is efficient (minimal redundant steps).

  • No tool errors or failed function calls.

  • The reasoning chain is complete (no abrupt jumps or missing logic).

  • The agent’s final decision or response matches the task goal.

Combining manual review with evaluator-generated explanations provides a structured way to score or label trajectory quality.

4

Integration with Dataset

After selecting and validating the high-quality trajectories, integrate them back into your dataset. For each query, attach the corresponding validated trajectory under expected_agent_trajectory. This enriched dataset will serve as the foundation for future trajectory-based evaluations, enabling consistent benchmarking and comparison across agents.

Best Practices for Reference Trajectory Generation

  1. Expert Review — Always involve domain experts for final validation.

  2. Diverse Coverage — Include reasoning across multiple task types.

  3. Regular Updates — Refresh reference trajectories as the agent improves.

  4. Documentation — Record rationale and notes for accepted or rejected trajectories.

Additional Resources