📈Experiment Tracker
We provide BaseExperimentTracker and several ready-to-use experiment tracker classes. These can also be plugged into the evaluate function to log and record evaluation results, making it easier to analyze and share outcomes.
Available Experiment Trackers
🪶 SimpleExperimentTracker
Use when: You want a lightweight, local tracker that will log the results in CSV. It is great for quick tests, prototyping, or when you do not need a full UI.
Example usage:
from gllm_evals.experiment_tracker.simple_experiment_tracker import SimpleExperimentTracker
tracker = SimpleExperimentTracker(
project_name="my_project",
output_dir="./my_experiments"
)
tracker.log(...)🌐 LangfuseExperimentTracker
Use when: You want a production-grade tracker integrated with Langfuse. It is great for detailed traces & spans, dataset & run management, and session & dataset level scoring.
Example usage:
New User Configuration
If you are new to Langfuse, you can follow these steps to use Langfuse Experiment Tracker:
Open the Langfuse host Go to https://langfuse.obrol.id/ and login using your GDP Labs account. This website is managed by BOSA team.
Create an Organization Click the
New Organizationand enter the organization name on theOrganizationspage. This gives you a top-level space to manage projects and members. The organization can be filled with your company/team/client name. Use a human-readable name (e.g.,glchat,catapa,client-XZY).Manage members
Invite teammates to the org/project with the roles you need (viewer/editor/admin).
Important: Set yourself (or one trusted person) as the admin so they can invite and manage other project members in the organization.
Create a Project Experiments, datasets, traces, and API keys are project-scoped. Enter your project name to create the project. The project can be filled with your project/application name (e.g.
glchat-beta).Create API credentials You can create API key now or later in your Project → Settings → API Keys, then generate keys and copy:
Public key
Secret key
Langfuse host
Configure your environment Most Langfuse clients (and our
evaluate()integration) read these env vars:Run an evaluation with Langfuse tracking enabled With this credentials, you can now use the Langfuse Experiment Tracker.
What happens automatically
Auto-dataset creation (when needed). If you pass a dataset that does not already exist in Langfuse, we automatically create it with either the column
expected_responsewill automatically be the ground truth response OR based on the given mapping dictionary. To see the mapping example, you can visit this subsection. You’ll find it under Project → Datasets in the left sidebar after a round of evaluation.Experiment run logging. Your evaluation will be logged including runs, metrics/scores, and the underlying traces.
Where to see results (in Langfuse)
Datasets: the dataset created/linked by
evaluate(). &#xNAN;Path: Project → DatasetsDataset runs: executions over a dataset with per-item outputs and evaluator scores. &#xNAN;Path: Project → Datasets → select a dataset → Runs
Traces / Observations: drill into individual calls/spans, inputs/outputs, timings. &#xNAN;Path: Project → Traces (and Observations)
Sessions: grouped traces per experiment; you can review, share, and even score sessions. &#xNAN;Path: Project → Sessions.
Service window (our hosted Langfuse)
Langfuse will turn off automatically at 23.59 WIB. Contact evals or BOSA team for the Slack command to turn on/off the langfuse host.
🔁 Refresh Langfuse Experiment Tracker
To refresh the scores in Langfuse after updating them, you can run using the following function:
📁 Export Langfuse Experiment Results to CSV
You can export the Langfuse experiment results with all the updated scores to CSV using the following function:
Last updated