gcloud CLI - required because gllm-evals is a private library hosted in a private Google Cloud repository
After installing, please run
gcloudauthlogin
to authorize gcloud to access the Cloud Platform with Google user credentials.
Our internal gllm-evals package is hosted in a secure Google Cloud Artifact Registry.
You need to authenticate via gcloud CLI to access and download the package during installation.
In this tutorial, we will evaluate RAG pipeline output.
1
Create a script called eval.py. By default, GEvalGenerationEvaluator uses Gemini 3 Pro from Google as it's model. If you want to use your own model, you can pass the model via model parameter and provide the corresponding credentials.
2
Run the script
3
The evaluator will generate a response for the given input, e.g.:
Congratulations! You have successfully run your first evaluation
Recommendation
If you want to run an end-to-end evaluation, use the evaluate() convenience function instead of the step-by-step commands above.
It will automatically handle experiment tracking (via the Experiment Tracker) and integrates results into your existing Dataset, so you donβt have to wire these pieces together manually.
Next Steps
You're now ready to start using our evaluators. We offer several prebuilt evaluators to get you started:
import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData
async def main():
evaluator = GEvalGenerationEvaluator(
model_credentials=os.getenv("GOOGLE_API_KEY")
)
data = RAGData(
query="What is the capital of France?",
expected_response="Paris",
generated_response="New York",
retrieved_context="Paris is the capital of France.",
)
result = await evaluator.evaluate(data)
print(result)
if __name__ == "__main__":
asyncio.run(main())
python eval.py
{
"generation": {
"global_explanation": "The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)",
"relevancy_rating": "bad",
"score": 0.0,
"possible_issues": ["Retrieval Issue", "Generation Issue"],
"binary_score": 0,
"avg_score": 0.6,
"completeness": {
"score": 1,
"explanation": "The output provides a critical factual contradiction by stating that New York is the capital of France, whereas the expected answer is Paris.",
"normalized_score": 0.0,
},
"groundedness": {
"score": 1,
"explanation": "The output 'New York' is a direct contradiction of the retrieval context, which explicitly states that 'Paris is the capital of France.' Because the information provided is factually incorrect and not grounded in the context, it receives the lowest score.",
"normalized_score": 0.0,
},
"redundancy": {
"score": 1,
"explanation": "The response provides a single, direct answer without any repetition of words, phrases, or ideas. It is concise and contains no redundant statements or restatements.",
"normalized_score": 1.0,
},
"language_consistency": {
"score": 1,
"explanation": "The instructional language of the input is English, and the actual output is also written in English, maintaining language consistency regardless of the factual accuracy of the answer.",
"success": True,
"normalized_score": 1.0,
},
"refusal_alignment": {
"score": 1,
"explanation": "is_refusal was detected from the expected response, which directly provides the answer 'Paris' without refusal indicators. The actual output 'New York' is factually incorrect, but it also contains no refusal indicators and attempts to answer the core request. Since both the expected and actual responses are not refusals, they align correctly.",
"success": True,
"normalized_score": 1.0,
},
}
}