📈Running an Evaluation

Introduction

This tutorial will guide you step-by-step on how to install the GenAI Evaluator SDK and run your first evaluation.

Prerequisites

Before installing, make sure you have:

  1. gcloud CLI - required because gllm-evals is a private library hosted in a private Google Cloud repository

After installing, please run

gcloud auth login

to authorize gcloud to access the Cloud Platform with Google user credentials.

Our internal gllm-evals package is hosted in a secure Google Cloud Artifact Registry. You need to authenticate via gcloud CLI to access and download the package during installation.

Installation

Option 1: Install with pip

pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-evals"

Option 2: Install with poetry

Step 1: Configure authentication

poetry config http-basic.gen-ai-internal oauth2accesstoken "$(gcloud auth print-access-token)"
poetry config http-basic.gen-ai oauth2accesstoken "$(gcloud auth print-access-token)"
poetry config http-basic.gen-ai-internal-publication oauth2accesstoken "$(gcloud auth print-access-token)"
poetry config http-basic.gen-ai-publication oauth2accesstoken "$(gcloud auth print-access-token)"

Step 2: Add to projects

poetry add gllm-evals

Environment Setup

Set a valid language model credential as an environment variable.

  • In this example, let's use an OpenAI API key.

Get an OpenAI API key from OpenAI Console.

export OPENAI_API_KEY="sk-..."

Running Your First Evaluation

In this tutorial, we will evaluate RAG pipeline output.

1

Create a script called eval.py

import asyncio
import os
from gllm_evals.evaluator.generation_evaluator import GenerationEvaluator
from gllm_evals.types import RAGData

async def main():
    evaluator = GenerationEvaluator(
        model_id="openai/gpt-4.1",
        model_credentials=os.getenv("OPENAI_API_KEY")
    )

    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
2

Run the script

python eval.py
3

The evaluator will generate a response for the given input, e.g.:

{
    "generation": {
        "relevancy_rating": "bad",
        "possible_issues": ["Retrieval Issue", "Generation Issue"],
        "score": 0,
        "completeness": {
            "question": "What is the capital of France?",
            "expected_output_statements": ["Paris"],
            "generated_output_statements": ["New York"],
            "count": "0 of 1 substantive statements are matched",
            "score": 1,
            "explanation": "The expected substantive statement is 'Paris', correctly naming the capital of France. The generated output is 'New York', which does not match the expected answer nor provide any correct information in relation to the question. Thus, none of the substantive statements are matched, resulting in a score of 1."
        },
        "redundancy": {
            "generated_response": ["New York"],
            "analysis": ["There is only a single statement, 'New York', with no repetition of concepts, phrases, or rephrased content.", "The answer is incorrect, but only redundancy is being considered for this evaluation.", "No elaboration, restatement, or repeated ideas are present."],
            "score": 1,
            "explanation": "The generated_response consists of a single statement with no repetition or restatement of information. While it is factually incorrect, there is no redundancy according to the evaluation criteria."
        },
        "groundedness": {
            "expected_response": ["Paris"],
            "generated_response": ["New York"],
            "score": 3,
            "explanation": "The generated response 'New York' is not supported by the context, which clearly states that Paris is the capital of France. This is a critical factual mistake and constitutes a hallucination, so the score is 3."
        }
    }
}

Next Steps

You're now ready to start using our evaluators. We offer several prebuilt evaluators to get you started:

Looking for something else? Build your own custom evaluator here.

Last updated