📈Experiment Tracker

We provide BaseExperimentTracker and several ready-to-use experiment tracker classes. These can also be plugged into the evaluate function to log and record evaluation results, making it easier to analyze and share outcomes.

Available Experiment Trackers


🪶 SimpleExperimentTracker

Use when: You want a lightweight, local tracker that will log the results in CSV. It is great for quick tests, prototyping, or when you do not need a full UI.

Example usage:

from gllm_evals.experiment_tracker.simple_experiment_tracker import SimpleExperimentTracker

tracker = SimpleExperimentTracker(
    project_name="my_project", 
    output_dir="./my_experiments"
)
tracker.log(...)

🌐 LangfuseExperimentTracker

Use when: You want a production-grade tracker integrated with Langfuse. It is great for detailed traces & spans, dataset & run management, and session & dataset level scoring.

Example usage:

New User Configuration

If you are new to Langfuse, you can follow these steps to use Langfuse Experiment Tracker:

  1. Create an Organization Click the New Organization and enter the organization name on the Organizations page. This gives you a top-level space to manage projects and members. The organization can be filled with your company/team/client name. Use a human-readable name (e.g. client-XZY).

  2. Manage members

    1. Invite teammates to the org/project with the roles you need (viewer/editor/admin).

    2. Important: Set yourself (or one trusted person) as the admin so they can invite and manage other project members in the organization.

  3. Create a Project Experiments, datasets, traces, and API keys are project-scoped. Enter your project name to create the project. The project can be filled with your project/application name (e.g. project-abc).

  4. Create API credentials You can create API key now or later in your Project → Settings → API Keys, then generate keys and copy:

    1. Public key

    2. Secret key

    3. Langfuse host

  5. Configure your environment Most Langfuse clients (and our evaluate() integration) read these env vars:

  6. Run an evaluation with Langfuse tracking enabled With this credentials, you can now use the Langfuse Experiment Tracker.


What happens automatically

  • Auto-dataset creation (when needed). If you pass a dataset that does not already exist in Langfuse, we automatically create it with either the column expected_response will automatically be the ground truth response OR based on the given mapping dictionary. To see the mapping example, you can visit this subsection. You’ll find it under Project → Datasets in the left sidebar after a round of evaluation.

  • Experiment run logging. Your evaluation will be logged including runs, metrics/scores, and the underlying traces.


Where to see results (in Langfuse)

  • Datasets: the dataset created/linked by evaluate(). Path: Project → Datasets

  • Dataset runs: executions over a dataset with per-item outputs and evaluator scores. Path: Project → Datasets → select a dataset → Runs

  • Traces / Observations: drill into individual calls/spans, inputs/outputs, timings. Path: Project → Traces (and Observations)

  • Sessions: grouped traces per experiment; you can review, share, and even score sessions. Path: Project → Sessions.


🔁 Refresh Langfuse Experiment Tracker

To refresh the scores in Langfuse after updating them, you can run using the following function:


📁 Export Langfuse Experiment Results to CSV

You can export the Langfuse experiment results with all the updated scores to CSV using the following function:

Last updated

Was this helpful?