📈Running an Evaluation
Introduction
This tutorial will guide you step-by-step on how to install the GenAI Evaluator SDK and run your first evaluation.
Installation
Option 1: Install with pip
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-evals"
Option 2: Install with poetry
Step 1: Configure authentication
poetry config http-basic.gen-ai-internal oauth2accesstoken "$(gcloud auth print-access-token)"
poetry config http-basic.gen-ai oauth2accesstoken "$(gcloud auth print-access-token)"
poetry config http-basic.gen-ai-internal-publication oauth2accesstoken "$(gcloud auth print-access-token)"
poetry config http-basic.gen-ai-publication oauth2accesstoken "$(gcloud auth print-access-token)"
Step 2: Add to projects
poetry add gllm-evals
Environment Setup
Set a valid language model credential as an environment variable.
In this example, let's use an OpenAI API key.
Get an OpenAI API key from OpenAI Console.
export OPENAI_API_KEY="sk-..."
Running Your First Evaluation
In this tutorial, we will evaluate RAG pipeline output.
1
Create a script called eval.py
import asyncio
import os
from gllm_evals.evaluator.generation_evaluator import GenerationEvaluator
from gllm_evals.types import RAGData
async def main():
evaluator = GenerationEvaluator(
model_id="openai/gpt-4.1",
model_credentials=os.getenv("OPENAI_API_KEY")
)
data = RAGData(
query="What is the capital of France?",
expected_response="Paris",
generated_response="New York",
retrieved_context="Paris is the capital of France.",
)
result = await evaluator.evaluate(data)
print(result)
if __name__ == "__main__":
asyncio.run(main())
2
Run the script
python eval.py
3
The evaluator will generate a response for the given input, e.g.:
{
"generation": {
"relevancy_rating": "bad",
"possible_issues": ["Retrieval Issue", "Generation Issue"],
"score": 0,
"completeness": {
"question": "What is the capital of France?",
"expected_output_statements": ["Paris"],
"generated_output_statements": ["New York"],
"count": "0 of 1 substantive statements are matched",
"score": 1,
"explanation": "The expected substantive statement is 'Paris', correctly naming the capital of France. The generated output is 'New York', which does not match the expected answer nor provide any correct information in relation to the question. Thus, none of the substantive statements are matched, resulting in a score of 1."
},
"redundancy": {
"generated_response": ["New York"],
"analysis": ["There is only a single statement, 'New York', with no repetition of concepts, phrases, or rephrased content.", "The answer is incorrect, but only redundancy is being considered for this evaluation.", "No elaboration, restatement, or repeated ideas are present."],
"score": 1,
"explanation": "The generated_response consists of a single statement with no repetition or restatement of information. While it is factually incorrect, there is no redundancy according to the evaluation criteria."
},
"groundedness": {
"expected_response": ["Paris"],
"generated_response": ["New York"],
"score": 3,
"explanation": "The generated response 'New York' is not supported by the context, which clearly states that Paris is the capital of France. This is a critical factual mistake and constitutes a hallucination, so the score is 3."
}
}
}
Congratulations! You have successfully run your first evaluation
Next Steps
You're now ready to start using our evaluators. We offer several prebuilt evaluators to get you started:
Looking for something else? Build your own custom evaluator here.
Last updated