🔄Evaluate Helper Function

We provide a convenience helper function called evaluate. This function provides a streamlined way to run AI evaluations with minimal setup. It orchestrates the entire evaluation process, from data loading to result tracking, in a single function call. This helper only supports structured evaluation rules, where each record receives the same evaluation treatment.

Quick Start

1

Create a script called evaluate_example.py.

import asyncio
import os

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn


async def main():
    """Main function."""
    results = await evaluate(
        data=load_simple_qa_dataset(),
        inference_fn=inference_fn,
        evaluators=[GEvalGenerationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))],
    )
    print(results)


if __name__ == "__main__":
    asyncio.run(main())
2

Run the script:

python evaluate_example.py
3

The evaluate function will generate a response for the run summary:

{
  "experiment_urls": {
    "run_url": "/path/to/experiments/experiment_results.csv",
    "leaderboard_url": "/path/to/experiments/leaderboard.csv",
  },
  "run_id": "default_simple_qa_data_55d8ad1d",
  "dataset_name": "simple_qa_data",
  "timestamp": "2026-01-31T10:34:05.930843",
  "num_samples": 4,
  "metadata": {
    "batch_size": 10,
    "evaluator_parameters": {
      "evaluator_0": {
        "name": "generation",
        "batch_status_check_interval": 30.0,
        "batch_max_iterations": 120,
        "run_parallel": True,
        "judge": None,
        "good_thresholds": {
          "completeness": (">=", 3),
          "redundancy": ("<=", 1),
          "groundedness": (">=", 3),
          "language_consistency": (">=", 1),
          "refusal_alignment": (">=", 1),
        },
        "bad_thresholds": {
          "completeness": ("<=", 1),
          "redundancy": (">=", 3),
          "groundedness": ("<=", 1),
          "language_consistency": ("<=", 0),
          "refusal_alignment": ("<=", 0),
        },
        "metric_0": {
          "evaluation_steps": [
            "Step 1. Understand the Question...",
            "Step 2. Identify Substantive Statements...",
            "Step 3. Normalize and Compare Meaning...",
            "Step 4. Detect Matches and Contradictions...",
            "Step 5. Apply Pragmatic Rules...",
            "- Critical Numeric Impact...",
            "Step 6. Output Requirements...",
          ],
          "batch_status_check_interval": 30.0,
          "batch_max_iterations": 120,
          "name": "completeness",
          "_evaluation_lock": None,
        },
        "metric_1": {
          "evaluation_steps": [...],
          "batch_status_check_interval": 30.0,
          "batch_max_iterations": 120,
          "name": "groundedness",
          "_evaluation_lock": None,
        },
        "metric_2": {
          "evaluation_steps": [...],
          "batch_status_check_interval": 30.0,
          "batch_max_iterations": 120,
          "name": "redundancy",
          "_evaluation_lock": None,
        },
        "metric_3": {
          "evaluation_steps": [...],
          "batch_status_check_interval": 30.0,
          "batch_max_iterations": 120,
          "name": "language_consistency",
          "_evaluation_lock": None,
        },
        "metric_4": {
          "evaluation_steps": [...],
          "batch_status_check_interval": 30.0,
          "batch_max_iterations": 120,
          "name": "refusal_alignment",
          "_evaluation_lock": None,
        },
      }
    },
    "dataset_name": "simple_qa_data",
  },
  "summary_result": {},
}

4

You will see the result of the run in the path written in experiment_urls/run_url and additionally experiment_urls/leaderboard_url for the leaderboard of each run you done.

circle-check

Function Signature

async def evaluate(
    data: str | BaseDataset, 
    inference_fn: Callable, 
    evaluators: list[BaseEvaluator | BaseMetric], 
    experiment_tracker: BaseExperimentTracker | None = None,
    batch_size: int = 10,
    allow_batch_evaluation: bool = False,
    summary_evaluators: list[SummaryEvaluatorCallable] | None = None
    **kwargs: Any,
) -> list[list[EvaluationOutput]]

Parameters

  • data (str | BaseDataset): Dataset to be evaluated.

    • [RECOMMENDED] Can be a BaseDataset object (see Dataset section).

    • Can also be filled with a string:

      • hf/[dataset_name] -> load from HuggingFace Hub.

      • gs/[worksheet_name] -> load from Google Sheets spreadsheet.

      • langfuse/[dataset_name] -> load from Langfuse dataset.

      • [dataset_name] (no prefix) -> load from local path (*.csv, *.jsonl)

  • inference_fn (Callable): User-supplied callable (any custom function) that generates responses to be evaluated. It may be any implementation, as long as it accepts the required parameters stated in the below requirements.

    • Requirements:

      • Input parameter:

        • Your inference_fn must accept a parameter called row with a data type of dictionary.

          • The parameter name must be exactly row.

          • The dictionary will store the keys needed for the evaluation (e.g. query).

          • See this code example for more details.

        • Your inference_fn may accept a parameter called attachments with a data type of dictionary.

          • The parameter name must be exactly attachments.

          • This dictionary will store the keys which contain the attachment names and the values which are the file bytes that can be processed in the function if needed.

      • Output (return):

        • The required output of inference_fn is a dictionary consisting of the response or answer key needed for the evaluation.

        • The required evaluation keys can be viewed on the docstring of every evaluator class in gllm-evals. The example required keys can be seen herearrow-up-right.

        • The evaluation keys must match those names exactly. For example, if an evaluator expects generated_response, return {"generated_response": "..."}.

        • You may include additional keys (e.g., retrieved_context) in the output if your inference_fn also produces them based on the required evaluation keys. The other required keys will be obtained from the given dataset.

  • evaluators (list[BaseEvaluator | BaseMetric]): List of evaluators or metrics to apply. Custom evaluator or metric also can be provided if there is currently no match evaluator / metric as long as either inherits from BaseEvaluator or BaseMetric.

  • experiment_tracker (BaseExperimentTracker | None, optional): Optional tracker for logging results. Defaults to SimpleExperimentTracker. For experiment tracker object, see Experiment Tracker section.

  • batch_size (int, optional): Number of samples to process in parallel. Defaults to 10.

  • allow_batch_evaluation (bool): is a boolean parameter that enables batch processing mode for LLM API calls. When enabled, the runner passes entire batches to evaluators for optimized batch processing instead of processing items individually.

  • summary_evaluators (list[SummaryEvaluatorCallable] | None, optional ): List of user-supplied callable functions that compute aggregate metrics across all evaluation results. They are called after each batch with cumulative data to produce batch-level statistics. Below are the input parameters and output expected for a summary evaluator.

    • Input Parameters:

      • evaluation_results (list[EvaluationOutput]): List of evaluation outputs from all processed batches (cumulative)

      • data (list[MetricInput]): List of input data rows from all processed batches (cumulative)

    • Output (return) :

      • should return a dictionary with any key value that will be appended into the summary_result in leaderboard and run result.

  • **kwargs (Any): Additional configuration like tags, metadata, or run_id.


Usage Example

Using data from Google Sheets

Using Langfuse Experiment Tracker with custom mapping

mapping tells the tracker which of your dataset's columns should be logged into Langfuse’s canonical fields. That’s useful when your dataset uses custom column names but you still want to import the dataset to Langfuse with consistent structure.

The tracker expects three top-level buckets:

  • input: the input fields that are useful for the model (e.g., query, retrieved context).

  • expected_output: the target you want to compare against (e.g., reference answer/label/ground truth).

  • metadata: any extra attributes or information for each data row (e.g., topic, type).

Your mapping simply points each Langfuse field to the column name in your dataset.

Example Scenario

Your dataset has columns question_id, user_question, answer, expected_response, topic You want to map them to Langfuse’s fields as follows:

  • question_idinput.question_id

  • user_questioninput.query

  • answerinput.generated_response

  • expected_responseexpected_output.expected_response

  • topicmetadata.topic

Then, your mapping should be:

The tracker will log them based on your mapping.


Here is the full example of how to insert dataset from Google Sheets to Langfuse Dataset and to use Langfuse Experiment Tracker:

inference_fn Examples

Summary Evaluator Example

Last updated

Was this helpful?