🚀Getting Started

Introduction

This tutorial will guide you step-by-step on how to install the GenAI Evaluator SDK and run your first evaluation.

Prerequisites

Before installing, make sure you have:

Python 3.11+
Pip
OpenAI API Key
gcloud CLI - required because gllm-evals is a private library hosted in a private Google Cloud repository

After installing, please run

gcloud auth login

to authorize gcloud to access the Cloud Platform with Google user credentials.

Our internal gllm-evals package is hosted in a secure Google Cloud Artifact Registry. You need to authenticate via gcloud CLI to access and download the package during installation.

Installation

Run the following command to install

pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-evals[deepeval,langchain,ragas]"

Run the following command to install

pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-evals[deepeval,langchain,ragas]"

Run the following command to install

FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/"  "gllm-evals[deepeval,langchain,ragas]"

Environment Setup

Set a valid language model credential as an environment variable. This API Key will be used for evaluators that uses LLM as judge.

In this example, let's use an Google API Key.

Get an Google API key from Google AI Studio.

export GOOGLE_API_KEY="AIz..."

Running Your First Evaluation

In this tutorial, we will evaluate RAG pipeline output.

Create a script called eval.py. By default, GEvalGenerationEvaluator uses Gemini 3 Pro from Google as it's model. If you want to use your own model, you can pass the model via model parameter and provide the corresponding credentials.

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData

async def main():
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Run the script

python eval.py

The evaluator will generate a response for the given input, e.g.:

{
    "generation": {
        "global_explanation": "The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)",
        "relevancy_rating": "bad",
        "score": 0.0,
        "possible_issues": ["Retrieval Issue", "Generation Issue"],
        "binary_score": 0,
        "avg_score": 0.6,
        "completeness": {
            "score": 1,
            "explanation": "The output provides a critical factual contradiction by stating that New York is the capital of France, whereas the expected answer is Paris.",
            "normalized_score": 0.0,
        },
        "groundedness": {
            "score": 1,
            "explanation": "The output 'New York' is a direct contradiction of the retrieval context, which explicitly states that 'Paris is the capital of France.' Because the information provided is factually incorrect and not grounded in the context, it receives the lowest score.",
            "normalized_score": 0.0,
        },
        "redundancy": {
            "score": 1,
            "explanation": "The response provides a single, direct answer without any repetition of words, phrases, or ideas. It is concise and contains no redundant statements or restatements.",
            "normalized_score": 1.0,
        },
        "language_consistency": {
            "score": 1,
            "explanation": "The instructional language of the input is English, and the actual output is also written in English, maintaining language consistency regardless of the factual accuracy of the answer.",
            "success": True,
            "normalized_score": 1.0,
        },
        "refusal_alignment": {
            "score": 1,
            "explanation": "is_refusal was detected from the expected response, which directly provides the answer 'Paris' without refusal indicators. The actual output 'New York' is factually incorrect, but it also contains no refusal indicators and attempts to answer the core request. Since both the expected and actual responses are not refusals, they align correctly.",
            "success": True,
            "normalized_score": 1.0,
        },
    }
}

Congratulations! You have successfully run your first evaluation

Recommendation

If you want to run an end-to-end evaluation, use the evaluate() convenience function instead of the step-by-step commands above.

It will automatically handle experiment tracking (via the Experiment Tracker) and integrates results into your existing Dataset, so you don’t have to wire these pieces together manually.

Next Steps

You're now ready to start using our evaluators. We offer several prebuilt evaluators to get you started:

Looking for something else? Build your own custom evaluator here.

^{*All fields are optional and can be adjusted depending on the chosen metric.}

PreviousEvaluation Fundamentals NextEvaluation Workflow

Last updated 1 month ago

Was this helpful?

hashtagIntroduction

hashtagInstallation

hashtagEnvironment Setup

hashtagRunning Your First Evaluation

hashtagRecommendation

hashtagNext Steps

Introduction

Installation

Environment Setup

Running Your First Evaluation

Recommendation

Next Steps