🔢Metric
Metrics are the core evaluation components in the gllm-evals framework. They define specific ways to measure and assess the performance of language models for generation, retrieval systems, and agent behaviors.
Metrics work in conjunction with evaluators to provide comprehensive evaluation capabilities. Evaluators can run multiple metrics in parallel or sequentially, combining their results into a comprehensive evaluation report.
Example Usage
from gllm_evals.metrics.generation.langchain_helpfulness import LangChainHelpfulnessMetric
from gllm_evals import load_simple_qa_dataset
metric = LangChainHelpfulnessMetric(
model="openai/gpt-4.1",
credentials=os.getenv("OPENAI_API_KEY")
)
data = load_simple_qa_dataset()
result = await metric.evaluate(data.dataset[0])
print(result)Available Metrics
Below are several existing metrics example. To view the full metrics list, see the Metrics directory.
Generation Evaluation Metrics
Agent Evaluation Metrics
Last updated