📈Experiment Tracker

We provide BaseExperimentTracker and several ready-to-use experiment tracker classes. These can also be plugged into the evaluate function to log and record evaluation results, making it easier to analyze and share outcomes.

Available Experiment Trackers


🪶 SimpleExperimentTracker

Use when: You want a lightweight, local tracker that will log the results in CSV. It is great for quick tests, prototyping, or when you do not need a full UI.

Example usage:

from gllm_evals.experiment_tracker.simple_experiment_tracker import SimpleExperimentTracker

tracker = SimpleExperimentTracker(
    project_name="my_project", 
    output_dir="./my_experiments"
)
tracker.log(...)

🌐 LangfuseExperimentTracker

Use when: You want a production-grade tracker integrated with Langfuse. It is great for detailed traces & spans, dataset & run management, and session & dataset level scoring.

Example usage:

New User Configuration

If you are new to Langfuse, you can follow these steps to use Langfuse Experiment Tracker:

  1. Open the Langfuse host Go to https://langfuse.obrol.id/arrow-up-right and login using your GDP Labs account. This website is managed by BOSA team.

  2. Create an Organization Click the New Organization and enter the organization name on the Organizations page. This gives you a top-level space to manage projects and members. The organization can be filled with your company/team/client name. Use a human-readable name (e.g., glchat, catapa, client-XZY).

  3. Manage members

    1. Invite teammates to the org/project with the roles you need (viewer/editor/admin).

    2. Important: Set yourself (or one trusted person) as the admin so they can invite and manage other project members in the organization.

  4. Create a Project Experiments, datasets, traces, and API keys are project-scoped. Enter your project name to create the project. The project can be filled with your project/application name (e.g. glchat-beta).

  5. Create API credentials You can create API key now or later in your Project → Settings → API Keys, then generate keys and copy:

    1. Public key

    2. Secret key

    3. Langfuse host

  6. Configure your environment Most Langfuse clients (and our evaluate() integration) read these env vars:

  7. Run an evaluation with Langfuse tracking enabled With this credentials, you can now use the Langfuse Experiment Tracker.


What happens automatically

  • Auto-dataset creation (when needed). If you pass a dataset that does not already exist in Langfuse, we automatically create it with either the column expected_response will automatically be the ground truth response OR based on the given mapping dictionary. To see the mapping example, you can visit this subsection. You’ll find it under Project → Datasets in the left sidebar after a round of evaluation.

  • Experiment run logging. Your evaluation will be logged including runs, metrics/scores, and the underlying traces.


Where to see results (in Langfuse)

  • Datasets: the dataset created/linked by evaluate(). &#xNAN;Path: Project → Datasets

  • Dataset runs: executions over a dataset with per-item outputs and evaluator scores. &#xNAN;Path: Project → Datasets → select a dataset → Runs

  • Traces / Observations: drill into individual calls/spans, inputs/outputs, timings. &#xNAN;Path: Project → Traces (and Observations)

  • Sessions: grouped traces per experiment; you can review, share, and even score sessions. &#xNAN;Path: Project → Sessions.

Service window (our hosted Langfuse)

Langfuse will turn off automatically at 23.59 WIB. Contact evals or BOSA team for the Slack command to turn on/off the langfuse host.


🔁 Refresh Langfuse Experiment Tracker

To refresh the scores in Langfuse after updating them, you can run using the following function:


📁 Export Langfuse Experiment Results to CSV

You can export the Langfuse experiment results with all the updated scores to CSV using the following function:

Last updated