🔄End-to-End Evaluation
To run the end-to-end evaluation, we provide a convenience function called evaluate. This function provides a streamlined way to run AI evaluations with minimal setup. It orchestrates the entire evaluation process, from data loading to result tracking, in a single function call.
Quick Start
Create a script called evaluate_example.py.
import asyncio
import os
from langfuse import get_client
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.evaluate import evaluate
from gllm_evals.utils.shared_functionality import inference_fn
from gllm_evals.experiment_tracker.langfuse_experiment_tracker import LangfuseExperimentTracker
from gllm_evals import load_simple_qa_dataset
async def main():
"""Main function."""
results = await evaluate(
data=load_simple_qa_dataset(),
inference_fn=inference_fn,
evaluators=[GEvalGenerationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))],
experiment_tracker=LangfuseExperimentTracker(langfuse_client=get_client()),
)
print(results)
if __name__ == "__main__":
asyncio.run(main())Run the script:
python evaluate_example.pyThe evaluator will generate a response for the given input, e.g.:
[
[{
'geval_generation_evals': {
'relevancy_rating': 'good',
'possible_issues': [],
'score': 1,
'completeness': {
'score': 3,
'explanation': "The expected output contains the substantive statement 'Paris' as the answer to the question. The actual output, 'The capital of France is Paris,' includes the information 'Paris' as the capital, matching the substantive statement in the expected output. All key information is present, although with additional phrasing, which does not affect the score."
},
'groundedness': {
'score': 3,
'explanation': "The response directly answers the question, stating that Paris is the capital of France. This information is clearly and explicitly supported by the context, which says, 'Paris is the capital and largest city of France.' There are no unsupported or extraneous statements."
},
'redundancy': {
'score': 1,
'explanation': 'The response clearly states the capital of France in a single, concise sentence without any redundancy or repeated information. Each key point is presented just once and the message is direct.'
}
}
}],
[{
'geval_generation_evals': {
'relevancy_rating': 'good',
'possible_issues': [],
'score': 1,
'completeness': {
'score': 3,
'explanation': "The generated output exactly matches the substantive statement in the expected output. The answer '4' is correct and all required information is present."
},
'groundedness': {
'score': 3,
'explanation': "The response accurately answers the question, and the answer '4' is explicitly supported by the context which states '2+2 equals 4.' There is complete alignment between the response, context, and question intent, with no extraneous or unsupported information."
},
'redundancy': {
'score': 1,
'explanation': "The response provides the answer '4' to the question with no repetition or unnecessary elaboration. Each idea is presented only once, and the answer is concise and to the point without restating or paraphrasing the key point."
}
}
}],
[{
'geval_generation_evals': {
'relevancy_rating': 'good',
'possible_issues': [],
'score': 1,
'completeness': {
'score': 3,
'explanation': 'The generated output matches the key substantive statement from the expected output, correctly identifying Jupiter as the largest planet in our solar system. Although the generated output is shorter, it fully captures the essential information required by the question.'
},
'groundedness': {
'score': 3,
'explanation': "The response correctly identifies Jupiter as the largest planet in our solar system, which is directly supported by the context stating, 'Jupiter is the fifth planet from the Sun and the largest in the Solar System.' There are no unsupported or extraneous statements."
},
'redundancy': {
'score': 1,
'explanation': "The response provides the correct answer, 'Jupiter,' with no restatement, repetition, or unnecessary elaboration. There is only one key idea, and it is presented directly and concisely."
}
}
}]
]Congratulations! You have successfully created your first evaluate convenience function!
Function Signature
async def evaluate(
data: str | BaseDataset,
inference_fn: Callable,
evaluators: list[BaseEvaluator | BaseMetric],
experiment_tracker: BaseExperimentTracker | None = None,
batch_size: int = 10,
**kwargs: Any,
) -> list[list[EvaluationOutput]]Parameters
data(str | BaseDataset): Dataset to be evaluated.Can be a
BaseDatasetobject (see Dataset section).Can also be filled with a string:
hf/[dataset_name]-> load from HuggingFace Hub.gs/[worksheet_name]-> load from Google Sheets spreadsheet.langfuse/[dataset_name]-> load from Langfuse dataset.[dataset_name](no prefix) -> load from local path (*.csv,*.jsonl)
inference_fn(Callable): User-supplied callable (any custom function) that generates responses to be evaluated. It may be any implementation, as long as it accepts the required parameters stated in the below requirements.Requirements:
Input parameter:
Your
inference_fnmust accept a parameter calledrowwith a data type of dictionary.The parameter name must be exactly
row.The dictionary will store the keys needed for the evaluation (e.g.
query).See this code example for more details.
Output (return):
The required output of
inference_fnis a dictionary consisting of the response or answer key needed for the evaluation.The required evaluation keys can be viewed on the docstring of every evaluator class in
gllm-evals. The example required keys can be seen here.The evaluation keys must match those names exactly. For example, if an evaluator expects
generated_response, return{"generated_response": "..."}.You may include additional keys (e.g.,
retrieved_context) in the output if yourinference_fnalso produces them based on the required evaluation keys. The other required keys will be obtained from the given dataset.
evaluators (list[BaseEvaluator | BaseMetric]): List of evaluators or metrics to apply. Custom evaluator or metric also can be provided if there is currently no match evaluator / metric as long as either inherits fromBaseEvaluatororBaseMetric.experiment_tracker(BaseExperimentTracker | None, optional): Optional tracker for logging results. Defaults toSimpleExperimentTracker. For experiment tracker object, see Experiment Tracker section.batch_size(int, optional): Number of samples to process in parallel. Defaults to 10.**kwargs(Any): Additional configuration liketags,metadata, orrun_id.
Usage Example
Using data from Google Sheets
Using Langfuse Experiment Tracker with custom mapping
mapping tells the tracker which of your dataset's columns should be logged into Langfuse’s canonical fields. That’s useful when your dataset uses custom column names but you still want to import the dataset to Langfuse with consistent structure.
The tracker expects three top-level buckets:
input: the input fields that are useful for the model (e.g., query, retrieved context).expected_output: the target you want to compare against (e.g., reference answer/label/ground truth).metadata: any extra attributes or information for each data row (e.g., topic, type).
Your mapping simply points each Langfuse field to the column name in your dataset.
Example Scenario
Your dataset has columns question_id, user_question, answer, expected_response, topic
You want to map them to Langfuse’s fields as follows:
question_id→input.question_iduser_question→input.queryanswer→input.generated_responseexpected_response→expected_output.expected_responsetopic→metadata.topic
Then, your mapping should be:
The tracker will log them based on your mapping.
Here is the full example of how to insert dataset from Google Sheets to Langfuse Dataset and to use Langfuse Experiment Tracker:
inference_fn Examples
Last updated