🔄End-to-End Evaluation

To run the end-to-end evaluation, we provide a convenience function called evaluate. This function provides a streamlined way to run AI evaluations with minimal setup. It orchestrates the entire evaluation process, from data loading to result tracking, in a single function call.

Quick Start

Create a script called evaluate_example.py.

import asyncio
import os

from langfuse import get_client

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.evaluate import evaluate
from gllm_evals.utils.shared_functionality import inference_fn
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker
from gllm_evals import load_simple_qa_dataset


async def main():
    """Main function."""
    results = await evaluate(
        data=load_simple_qa_dataset(),
        inference_fn=inference_fn,
        evaluators=[GEvalGenerationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))],
        experiment_tracker=LangfuseExperimentTracker(langfuse_client=get_client()),
    )
    print(results)


if __name__ == "__main__":
    asyncio.run(main())

Run the script:

python evaluate_example.py

The evaluator will generate a response for the given input, e.g.:

[
    [{
        'geval_generation_evals': {
            'relevancy_rating': 'good',
            'possible_issues': [],
            'score': 1,
            'completeness': {
                'score': 3,
                'explanation': "The expected output contains the substantive statement 'Paris' as the answer to the question. The actual output, 'The capital of France is Paris,' includes the information 'Paris' as the capital, matching the substantive statement in the expected output. All key information is present, although with additional phrasing, which does not affect the score."
            },
            'groundedness': {
                'score': 3,
                'explanation': "The response directly answers the question, stating that Paris is the capital of France. This information is clearly and explicitly supported by the context, which says, 'Paris is the capital and largest city of France.' There are no unsupported or extraneous statements."
            },
            'redundancy': {
                'score': 1,
                'explanation': 'The response clearly states the capital of France in a single, concise sentence without any redundancy or repeated information. Each key point is presented just once and the message is direct.'
            }
        }
    }],
    [{
        'geval_generation_evals': {
            'relevancy_rating': 'good',
            'possible_issues': [],
            'score': 1,
            'completeness': {
                'score': 3,
                'explanation': "The generated output exactly matches the substantive statement in the expected output. The answer '4' is correct and all required information is present."
            },
            'groundedness': {
                'score': 3,
                'explanation': "The response accurately answers the question, and the answer '4' is explicitly supported by the context which states '2+2 equals 4.' There is complete alignment between the response, context, and question intent, with no extraneous or unsupported information."
            },
            'redundancy': {
                'score': 1,
                'explanation': "The response provides the answer '4' to the question with no repetition or unnecessary elaboration. Each idea is presented only once, and the answer is concise and to the point without restating or paraphrasing the key point."
            }
        }
    }],
    [{
        'geval_generation_evals': {
            'relevancy_rating': 'good',
            'possible_issues': [],
            'score': 1,
            'completeness': {
                'score': 3,
                'explanation': 'The generated output matches the key substantive statement from the expected output, correctly identifying Jupiter as the largest planet in our solar system. Although the generated output is shorter, it fully captures the essential information required by the question.'
            },
            'groundedness': {
                'score': 3,
                'explanation': "The response correctly identifies Jupiter as the largest planet in our solar system, which is directly supported by the context stating, 'Jupiter is the fifth planet from the Sun and the largest in the Solar System.' There are no unsupported or extraneous statements."
            },
            'redundancy': {
                'score': 1,
                'explanation': "The response provides the correct answer, 'Jupiter,' with no restatement, repetition, or unnecessary elaboration. There is only one key idea, and it is presented directly and concisely."
            }
        }
    }]
]

Congratulations! You have successfully created your first evaluate convenience function!

Function Signature

async def evaluate(
    data: str | BaseDataset, 
    inference_fn: Callable, 
    evaluators: list[BaseEvaluator | BaseMetric], 
    experiment_tracker: BaseExperimentTracker | None = None,
    batch_size: int = 10,
    **kwargs: Any,
) -> list[list[EvaluationOutput]]

Parameters

data (str | BaseDataset): Dataset to be evaluated.
- Can be a BaseDataset object (see Dataset section).
- Can also be filled with a string:
  - hf/[dataset_name] -> load from HuggingFace Hub.
  - gs/[worksheet_name] -> load from Google Sheets spreadsheet.
  - langfuse/[dataset_name] -> load from Langfuse dataset.
  - [dataset_name] (no prefix) -> load from local path (*.csv, *.jsonl)
inference_fn (Callable): User-supplied callable (any custom function) that generates responses to be evaluated. It may be any implementation, as long as it accepts the required parameters stated in the below requirements.
- Requirements:
  - Input parameter:
    Your inference_fn must accept a parameter called row with a data type of dictionary.
    The parameter name must be exactly row.
    The dictionary will store the keys needed for the evaluation (e.g. query).
    See this code example for more details.
  - Output (return):
    The required output of inference_fn is a dictionary consisting of the response or answer key needed for the evaluation.
    The required evaluation keys can be viewed on the docstring of every evaluator class in gllm-evals. The example required keys can be seen here.
    The evaluation keys must match those names exactly. For example, if an evaluator expects generated_response, return {"generated_response": "..."}.
    You may include additional keys (e.g., retrieved_context) in the output if your inference_fn also produces them based on the required evaluation keys. The other required keys will be obtained from the given dataset.

inference_fn output example:

{
    "generated_response": "...",
    "retrieved_context": "..."
}

evaluators (list[BaseEvaluator | BaseMetric]): List of evaluators or metrics to apply. Custom evaluator or metric also can be provided if there is currently no match evaluator / metric as long as either inherits from BaseEvaluator or BaseMetric.
experiment_tracker (BaseExperimentTracker | None, optional): Optional tracker for logging results. Defaults to SimpleExperimentTracker. For experiment tracker object, see Experiment Tracker section.
batch_size (int, optional): Number of samples to process in parallel. Defaults to 10.
**kwargs (Any): Additional configuration like tags, metadata, or run_id.

Usage Example

Using data from Google Sheets

import asyncio
import os

from langfuse import get_client

from gllm_evals.dataset.spreadsheet_dataset import SpreadsheetDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.evaluate import evaluate
from gllm_evals.utils.shared_functionality import inference_fn
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker
from gllm_evals import load_simple_qa_dataset


async def main():
    """Main function."""
    results = await evaluate(
        data=await SpreadsheetDataset.from_gsheets(
            sheet_id="1CVWqNzX_tdnvkV0fQ3NPDuEE9HtTXk8k2XtgIg6Ml6M",
            worksheet_name="test",
            client_email=os.getenv("GOOGLE_SHEETS_CLIENT_EMAIL"),
            private_key=os.getenv("GOOGLE_SHEETS_PRIVATE_KEY"),
        ),
        inference_fn=inference_fn,
        evaluators=[GEvalGenerationEvaluator(model_credentials=os.getenv("OPENAI_API_KEY"))],
        experiment_tracker=LangfuseExperimentTracker(langfuse_client=get_client()),
    )
    print(results)


if __name__ == "__main__":
    asyncio.run(main())

Using Langfuse Experiment Tracker with custom mapping

mapping tells the tracker which of your dataset's columns should be logged into Langfuse’s canonical fields. That’s useful when your dataset uses custom column names but you still want to import the dataset to Langfuse with consistent structure.

The tracker expects three top-level buckets:

input: the input fields that are useful for the model (e.g., query, retrieved context).
expected_output: the target you want to compare against (e.g., reference answer/label/ground truth).
metadata: any extra attributes or information for each data row (e.g., topic, type).

Your mapping simply points each Langfuse field to the column name in your dataset.

Example Scenario

Your dataset has columns question_id, user_question, answer, expected_response, topic You want to map them to Langfuse’s fields as follows:

question_id → input.question_id
user_question → input.query
answer → input.generated_response
expected_response → expected_output.expected_response
topic → metadata.topic

Then, your mapping should be:

mapping = {
    "input": {
        "question_id": "question_id",
        "user_question": "query",
        "answer": "generated_response"
    },
    "expected_output": {
        "expected_response": "expected_response"
    },
    "metadata": {
        "topic": "topic"
    }
}

The tracker will log them based on your mapping.

Here is the full example of how to insert dataset from Google Sheets to Langfuse Dataset and to use Langfuse Experiment Tracker:

import asyncio
import os

from langfuse import get_client

from gllm_evals.dataset.spreadsheet_dataset import SpreadsheetDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.evaluate import evaluate
from gllm_evals.utils.shared_functionality import inference_fn
from gllm_evals import load_simple_qa_dataset
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker


async def main():
    """Main function."""
    
    mapping = {
        "input": {
            "question_id": "question_id",
            "query": "query",
            "retrieved_context": "retrieved_context",
            "generated_response": "generated_response"
        },
        "expected_output": {
            "expected_response": "expected_response"
        },
        "metadata": {
            "topic": "topic"
        }
    }
    
    results = await evaluate(
        data=await SpreadsheetDataset.from_gsheets(
            sheet_id="1CVWqNzX_tdnvkV0fQ3NPDuEE9HtTXk8k2XtgIg6Ml6M",
            worksheet_name="test",
            client_email=os.getenv("GOOGLE_SHEETS_CLIENT_EMAIL"),
            private_key=os.getenv("GOOGLE_SHEETS_PRIVATE_KEY"),
        ),
        inference_fn=inference_fn,
        evaluators=[GEvalGenerationEvaluator(model_credentials=os.getenv("OPENAI_API_KEY"))],
        experiment_tracker=LangfuseExperimentTracker(
            langfuse_client=get_client(),
            mapping=mapping,
        ),
    )
    print(results)


if __name__ == "__main__":
    asyncio.run(main())

inference_fn Examples

from typing import Any


def generate_response(row: dict[str, Any]):
    """Inference function example."""
    query = row["user_query"]
    
    # do some inference logic here to get the generated response:
    generated_response = run_something(query)
    
    # The return key is varied based on which evaluator / metric to be evaluated
    return {"generated_response": generated_response}


# later in the `evaluate()` function
results = await evaluate(
    data="...",
    inference_fn=generate_response,
    ...
)

PreviousGetting Started NextEvaluator / Scorer

Last updated 2 months ago

Was this helpful?

hashtagQuick Start

hashtagFunction Signature

hashtagParameters

hashtagUsage Example

hashtagUsing data from Google Sheets

hashtagUsing Langfuse Experiment Tracker with custom mapping

hashtagExample Scenario

hashtaginference_fn Examples

Quick Start

Function Signature

Parameters

Usage Example

Using data from Google Sheets

Using Langfuse Experiment Tracker with custom mapping

Example Scenario

inference_fn Examples