📗Custom Evaluator / Scorer Tutorial

In this guide, we’ll walk through how to build a custom evaluator for a specific use case. You can adapt the tutorial to match your project’s needs. In our example, the evaluator checks whether the summary of a customer’s complaint is correct and accurate based on the detailed description provided.

Step 0: Install the Required Libraries

To install gllm-evals, you can follow this section.

Step 1: Prepare Your Dataset

Before running the evaluation, we need a dataset that contains all the information required for scoring. For this example, we’ll use the following data:

This data has 5 columns:

  • no: The row number.

  • detailed_description: The client’s full complaint description (this serves as the query).

  • detailed_case_gangguan: The summarized complaint generated from the detailed description (this is the response we will evaluate).

  • gt_detail_case_gangguan: The ground-truth summary, used to compare against detailed_case_gangguan to determine the true score. It is not used during evaluation itself, but it can be used afterward to measure the evaluator's accuracy.

  • score_detail_case_gangguan: The ground-truth score representing how well detailed_case_gangguan matches gt_detail_case_gangguan. It is not used during the evaluation itself, but it will be used afterward to measure the evaluator’s accuracy as the alignment score.

Step 2: Create a Custom Metric

Before we can evaluate our dataset, we need to decide which metric we will need to evaluate. Because this case is unique and specialized, we will create a custom metric using DeepEval's GEval with custom evaluation steps. Before proceeding, you can check the gllm-evals' metrics to decide whether to reuse the existing one or to create a custom one.

This is the example custom evaluation steps for the dataset above:

After that, we can create our custom metric:

Step 3: Create a Custom Evaluator

For the next step, we create our custom evaluator using the metric we have created:

Step 4: Perform Evaluation

To run the evaluation, we process our data, convert it into QAData, and pass it to the custom evaluator. You can adapt this step to fit your project’s specific requirements.

In addition to saving the evaluation's score and explanation, we can also compute a final alignment score to check the evaluator’s accuracy by comparing its output with the ground-truth score.

Below is the CSV output based on the evaluation we have just done:

Conclusion

This cookbook provides a simple guide to evaluating custom dataset using a custom metric and a custom evaluator. By following these steps, you can:

  • Monitor your QA system's performance

  • Automate the evaluation process if the alignment score is already high

  • Ensure reliable and high-quality QA responses in production

Last updated