🔢Metric

Metrics are the core evaluation components in the gllm-evals framework. They define specific ways to measure and assess the performance of language models for generation, retrieval systems, and agent behaviors.

Metrics work in conjunction with evaluators to provide comprehensive evaluation capabilities. Evaluators can run multiple metrics in parallel or sequentially, combining their results into a comprehensive evaluation report.

Example Usage

from gllm_evals.metrics.generation.langchain_helpfulness import LangChainHelpfulnessMetric
from gllm_evals import load_simple_qa_dataset

metric = LangChainHelpfulnessMetric(
    model="openai/gpt-4.1",
    credentials=os.getenv("OPENAI_API_KEY")
)

data = load_simple_qa_dataset()
result = await metric.evaluate(data.dataset[0])
print(result)

Available Metrics

Below are several existing metrics example. To view the full metrics list, see the Metrics directory.

Last updated