πŸ“šGLChat Evaluation Tutorial

In this guide, we will learn how to generate GLChat message response and evaluate its performance using gllm-evals for QA dataset evaluation.

The evaluation focuses on question-answering capabilities with support for web search capability and PII handling. The dataset and experiment results then can be accessed in Langfuse for monitoring. To view more details on each component, you can click them in the sidebar inside the Evaluation page.

Prerequisites

Before you can start evaluating GLChat QA dataset, you need to prepare the following:

Required Parameters for GLChat

1. User ID (user_id)

The user_id is a unique identifier for the user who will be interacting with the specified GLChat application. This information is needed to create a conversation or message.

Where to get it:

  • From your existing user in GLChat: If you already have an existing user in your GLChat application, you can use it as the user_id.

  • You can also provide any user that has access to the application you want to test.

2. Chatbot ID (chatbot_id)

The chatbot_id identifies which chatbot or application configuration to use for the conversation. This information is needed to create a conversation or message.

3. [Optional] Model Name (model_name)

The model_name specifies which language model to use for generating GLChat response. Model name can be filled with the model display name in an application / chatbot. If not specified, the response will be generated using the default model there.

Required Keys

For Langfuse

We will need Langfuse credentials to trace, debug, and view the evaluation results for our GLChat QA system. If you do not have any Langfuse credentials yet, you can follow the New User Configurationarrow-up-right to get them. The required keys are: LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST.

For GLChat

We will also need access to the GLChat credentials to generate the response. Please contact GLChat team if you do not have them yet. The required keys are: GLCHAT_BASE_URL and GLCHAT_API_KEY.

Step 0: Install the Required Libraries

We need to install the required libraries for GLChat evaluation, including GLChat SDK, gllm-evals, and Langfuse.

Step 1: Setup Environment and Configuration

Prepare the environment variables for the evaluation script:

With the environment variables set, we can now initialize the Langfuse and GLChat clients.

Step 2: Prepare Your Dataset

Before we can evaluate our GLChat QA system, we need to prepare a dataset with all the information needed for evaluation.

For example purpose, you can download the following CSV file:

In this example dataset, we have the following key fields:

Input Fields:

  • question_id: Unique identifier for each query.

  • query: The question to ask.

  • enable_search?: Whether to enable web search functionality.

  • enable_pii?: Whether to enable PII processing.

Expected Output Fields:

  • expected_response: The expected answer.

Metadata Fields:

  • category: The question category. This field categorizes each question based on its topic.

We then can create a mapping that tells Langfuse which of your dataset's fields should be logged into Langfuse’s canonical fields.

Langfuse's fields consist of:

  • input: the input fields that are useful for the model (e.g., query, retrieved context).

  • expected_output: the target you want to compare against (e.g., reference answer/label/ground truth).

  • metadata: any extra attributes or information for each data row (e.g., category, topic, type, additional notes).

Your mapping simply points each Langfuse field to the column name in your dataset.

Below is the mapping example based on the dataset above:

Step 3: Instrument your GLChat Functions

Before we can evaluate our GLChat system, we need to create the functions needed to produce a GLChat response, such as conversation creation and message sending operations. Optionally, we can wrap them with Langfuse's @observe decorator to track and monitor the processes.

We'll start by creating basic instrumentation setup for GLChat:

Besides them, you may also need to create other supporting functions such as parsing GLChat response:

You can view more details on the provided GLChat gitbook above.

Step 4: Prepare your Inference Function

The inference function is the core component that takes each dataset row and generates a response to be evaluated. We'll create an inference function example called generate_response that handles conversation creation, message sending, and response parsing. To customize the Langfuse monitoring like setting up the custom name, you can also add @observe decorator here.

Step 5: Perform end-to-end Evaluation

To run the end-to-end evaluation, we can use a convenience function in gllm-evals called evaluate. This function provides a streamlined way to run AI evaluations with minimal setup. It orchestrates the entire evaluation process, from data loading to result tracking, in a single function call.

We can fill the dataset path we have downloaded, the generate_response function we have just created, and the langfuse mapping dataset to the evaluate() function. In this example, we will use GEvalGenerationEvaluator which is suitable for evaluating the QA dataset. Since we want to use Langfuse, we will also add LangfuseExperimentTracker as the dedicated experiment tracker.

To learn more about evaluate() function, you can visit the following section.

Step 6: View Evaluation Results in Langfuse

After running the evaluation, all the dataset and experiment you've provided will automatically be logged to Langfuse. This step shows you how to navigate the Langfuse UI and interpret your evaluation results.

Accessing Your Langfuse Dashboard

  1. Navigate to your Langfuse project: Go to https://langfuse.obrol.id/arrow-up-right.

  2. Select your organization and project: Choose your dedicated organization and project (or that you've just created).

  3. Access the dashboard: You'll see various sections for analyzing your data.

View Dataset

To view the dataset you've just created, you can go to: Project β†’ Datasets β†’ select a dataset β†’ Items. In this page, we can see all the data rows you have just evaluated based on the provided Langfuse mapping. This dataset can also be reused for future evaluation.

To see more detail for each row, you can click one of the data item above.

View Dataset Runs

Dataset runs are the executions over a dataset with per-item output. A dataset run represents an experiment. To view the dataset runs, you can go to: Project β†’ Datasets β†’ select a dataset β†’ Runs. In here, you can view all the scores for each experiment, including LLM-as-a-judge score columnsβ€”both as aggregations and per-row values.

You can also click a specific dataset run to view all the data rows result for each experiment:

View Traces / Observations

Trace / observation let you drill into individual spans, view the inputs, outputs (our evaluation results), and metadata. You can go to: Project β†’ Traces.

Below is the trace example:

View Sessions

Sessions contain grouped traces per experiment; you can review and annotate each data trace in sessions. You can access the sessions in Project β†’ Sessions.

Below is the session screenshot example:

circle-check

Conclusion

This cookbook provides a simple guide to evaluating GLChat QA systems using Langfuse. By following these steps, you can:

  • Monitor your QA system's performance

  • Evaluate different models and configurations systematically

  • Track quality metrics and identify improvement opportunities

  • Ensure reliable and high-quality QA responses in production


Note: This is a simple guide to get you started with GLChat QA evaluation using Langfuse. For more comprehensive evaluation information and advanced techniques, please refer to the evaluation gitbook. For detailed information about generating GLChat responses and using the GLChat SDK, please consult the GLChat GitBookarrow-up-right.

Last updated