🔄End-to-End Evaluation

To run the end-to-end evaluation, we provide a convenience function called evaluate. This function provides a streamlined way to run AI evaluations with minimal setup. It orchestrates the entire evaluation process, from data loading to result tracking, in a single function call.

Quick Start

1

Create a script called evaluate_example.py.

import asyncio
import os

from langfuse import get_client

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.evaluate import evaluate
from gllm_evals.utils.shared_functionality import inference_fn
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker
from gllm_evals import load_simple_qa_dataset


async def main():
    """Main function."""
    results = await evaluate(
        data=load_simple_qa_dataset(),
        inference_fn=inference_fn,
        evaluators=[GEvalGenerationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))],
        experiment_tracker=LangfuseExperimentTracker(langfuse_client=get_client()),
    )
    print(results)


if __name__ == "__main__":
    asyncio.run(main())
2

Run the script:

python evaluate_example.py
3

The evaluator will generate a response for the given input, e.g.:

[
    [{
        'generation': {
            'global_explanation': 'All metrics met the expected values.',
            'relevancy_rating': 'good',
            'possible_issues': [],
            'score': 1,
            'completeness': {
                'score': 3,
                'explanation': 'The response correctly identifies Paris as the capital of France. While the actual output is more verbose than the expected output, it fully and accurately answers the question without any contradictions or omissions.',
                'success': True
            },
            'groundedness': {
                'score': 3,
                'explanation': 'The response accurately answers the question by stating that Paris is the capital of France, which is a fact directly supported by the provided retrieval context.',
                'success': True
            },
            'redundancy': {
                'score': 1,
                'explanation': 'The response consists of a single, direct sentence answering the input question. There is no repetition of any information, making it concise and to the point.'
            },
            'language_consistency': {
                'score': 1,
                'explanation': 'The instructional language of the input is English, and the actual output is also in English, making them consistent.',
                'success': True
            },
            'refusal_alignment': {
                'score': 1,
                'explanation': "The expected output is not a refusal as it directly answers the user's factual question. The actual output also provides a direct, factual answer. Since both the expected and actual responses are not refusals, their refusal status aligns.",
                'success': True
            }
        }
    }],
    [{
        'generation': {
            'global_explanation': 'All metrics met the expected values.',
            'relevancy_rating': 'good',
            'possible_issues': [],
            'score': 1,
            'completeness': {
                'score': 3,
                'explanation': 'The response accurately provides the correct answer to the mathematical question. While the formatting differs by including conversational text, the core numerical fact matches the expected output perfectly.',
                'success': True
            },
            'groundedness': {
                'score': 3,
                'explanation': 'The response directly answers the question, and the provided answer is explicitly stated and fully supported by the retrieval context.',
                'success': True
            },
            'redundancy': {
                'score': 1,
                'explanation': 'The response is concise and directly answers the question by presenting the key information only once. There is no repetition of words, phrases, or ideas.'
            },
            'language_consistency': {
                'score': 1,
                'explanation': 'The instructional language of the input is English, and the actual output is also written in English, maintaining language consistency.',
                'success': True
            },
            'refusal_alignment': {
                'score': 1,
                'explanation': 'The expected output directly answers the simple arithmetic question, so it is not a refusal. The actual output also provides a direct answer. Since both the expected and actual responses are not refusals, their statuses align correctly.',
                'success': True
            }
        }
    }],
    [{
        'generation': {
            'global_explanation': 'The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)',
            'relevancy_rating': 'bad',
            'possible_issues': ['Retrieval Issue', 'Generation Issue'],
            'score': 0,
            'completeness': {
                'score': 1,
                'explanation': 'The response provides a factually incorrect answer. It identifies Mercury as the largest planet, which is a critical contradiction of the correct answer, Jupiter.'
            },
            'groundedness': {
                'score': 1,
                'explanation': 'The response identifies Mercury as the largest planet, which is a direct contradiction of the provided context that explicitly states Jupiter is the largest planet in the Solar System.'
            },
            'redundancy': {
                'score': 1,
                'explanation': 'The response consists of a single word and therefore contains no repetition of information. It is concise and presents its point only once.'
            },
            'language_consistency': {
                'score': 1,
                'explanation': "The user's question and the model's actual output are both in English, demonstrating language consistency.",
                'success': True
            },
            'refusal_alignment': {
                'score': 1,
                'explanation': 'The expected output directly answers the factual question, so it is not a refusal. The actual output also attempts to answer the question directly, although it provides an incorrect fact. This is considered a knowledge failure, not a refusal. Since both the expected and actual responses are not refusals, their refusal statuses align.',
                'success': True
            }
        }
    }]
]

Function Signature

async def evaluate(
    data: str | BaseDataset, 
    inference_fn: Callable, 
    evaluators: list[BaseEvaluator | BaseMetric], 
    experiment_tracker: BaseExperimentTracker | None = None,
    batch_size: int = 10,
    **kwargs: Any,
) -> list[list[EvaluationOutput]]

Parameters

  • data (str | BaseDataset): Dataset to be evaluated.

    • [RECOMMENDED] Can be a BaseDataset object (see Dataset section).

    • Can also be filled with a string:

      • hf/[dataset_name] -> load from HuggingFace Hub.

      • gs/[worksheet_name] -> load from Google Sheets spreadsheet.

      • langfuse/[dataset_name] -> load from Langfuse dataset.

      • [dataset_name] (no prefix) -> load from local path (*.csv, *.jsonl)

  • inference_fn (Callable): User-supplied callable (any custom function) that generates responses to be evaluated. It may be any implementation, as long as it accepts the required parameters stated in the below requirements.

    • Requirements:

      • Input parameter:

        • Your inference_fn must accept a parameter called row with a data type of dictionary.

          • The parameter name must be exactly row.

          • The dictionary will store the keys needed for the evaluation (e.g. query).

          • See this code example for more details.

        • Your inference_fn may accept a parameter called attachments with a data type of dictionary.

          • The parameter name must be exactly attachments.

          • This dictionary will store the keys which contain the attachment names and the values which are the file bytes that can be processed in the function if needed.

      • Output (return):

        • The required output of inference_fn is a dictionary consisting of the response or answer key needed for the evaluation.

        • The required evaluation keys can be viewed on the docstring of every evaluator class in gllm-evals. The example required keys can be seen here.

        • The evaluation keys must match those names exactly. For example, if an evaluator expects generated_response, return {"generated_response": "..."}.

        • You may include additional keys (e.g., retrieved_context) in the output if your inference_fn also produces them based on the required evaluation keys. The other required keys will be obtained from the given dataset.

  • evaluators (list[BaseEvaluator | BaseMetric]): List of evaluators or metrics to apply. Custom evaluator or metric also can be provided if there is currently no match evaluator / metric as long as either inherits from BaseEvaluator or BaseMetric.

  • experiment_tracker (BaseExperimentTracker | None, optional): Optional tracker for logging results. Defaults to SimpleExperimentTracker. For experiment tracker object, see Experiment Tracker section.

  • batch_size (int, optional): Number of samples to process in parallel. Defaults to 10.

  • **kwargs (Any): Additional configuration like tags, metadata, or run_id.


Usage Example

Using data from Google Sheets

Using Langfuse Experiment Tracker with custom mapping

mapping tells the tracker which of your dataset's columns should be logged into Langfuse’s canonical fields. That’s useful when your dataset uses custom column names but you still want to import the dataset to Langfuse with consistent structure.

The tracker expects three top-level buckets:

  • input: the input fields that are useful for the model (e.g., query, retrieved context).

  • expected_output: the target you want to compare against (e.g., reference answer/label/ground truth).

  • metadata: any extra attributes or information for each data row (e.g., topic, type).

Your mapping simply points each Langfuse field to the column name in your dataset.

Example Scenario

Your dataset has columns question_id, user_question, answer, expected_response, topic You want to map them to Langfuse’s fields as follows:

  • question_idinput.question_id

  • user_questioninput.query

  • answerinput.generated_response

  • expected_responseexpected_output.expected_response

  • topicmetadata.topic

Then, your mapping should be:

The tracker will log them based on your mapping.


Here is the full example of how to insert dataset from Google Sheets to Langfuse Dataset and to use Langfuse Experiment Tracker:

inference_fn Examples

Last updated