How to Debug Accuracy Issues

Debugging accuracy is often more of an art than a science. Unlike functional errors that cause a crash, accuracy issues are subtle problems where the application runs successfully but provides incorrect, irrelevant, or low-quality answers. This section provides a systematic approach to diagnosing and fixing these issues.

Chapter 1: Understanding the Full Pipeline

To debug accuracy, you must first understand the entire flow of data. An issue in an early step can have a significant impact on the final answer. The GLChat pipeline consists of several major stages:

  1. Guardrails (Optional): The first check to see if a user's query contains malicious or otherwise disallowed content. This step can be configured from the admin dashboard.

  2. DPO (Document Processing Orchestrator) (Optional): If a user attaches a file (like a PDF or image), this stage is responsible for extracting the text and information from it. This is also configurable from the admin dashboard.

  3. Preprocessing: This is a multi-step stage that prepares the user's query for the main pipeline. It includes:

    • Retrieving chat history to provide context.

    • Processing attachments (either from DPO or directly, if the model supports it).

    • Anonymizing the query if PII masking is enabled.

    • Generating a standalone query by combining the user's latest message with the chat history to create a query that can be understood without the full conversation context.

    • Checking for cached responses to potentially skip the main pipeline.

    • Retrieving relevant information from long-term memory (optional).

  4. The Main Pipeline (e.g., Standard RAG): This is the core of the chatbot where the main logic, such as Retrieval-Augmented Generation (RAG), happens. This stage retrieves relevant knowledge and generates an answer.

  5. Postprocessing: After an answer is generated, this stage handles final tasks like saving the new message to the chat history and updating the long-term memory (optional).

Accuracy issues can be introduced at any of these stages. The key to debugging is to trace the data through this pipeline to find where it goes wrong.

Chapter 2: The Systematic Approach to Accuracy Debugging

The primary tool for debugging accuracy is the verbose state log. The process involves forming a hypothesis about where the problem is and then using the logs to confirm or deny it.

  1. Start with a Bad Response. Identify a query that consistently produces a poor answer. This is your test case.

  2. Enable Verbose Logging. Set the environment variable export DEBUG_STATE=true and run the pipeline with your test case.

  3. Trace the State and Ask Key Questions. Inspect the verbose logs and analyze the state at the end of each major stage. The goal is to follow the data and see where it deviates from what you expect.

    • Question 1: Was the context for the retriever correct?

      • State to check: Look at the standalone_query and transformed_retrieval_query keys after the Preprocessing stage.

      • Analysis: Does the standalone query accurately reflect the user's intent, including the context from the chat history? If the query sent to the retriever is misleading, the retrieved documents will be wrong. This points to an issue in the history processing or standalone query generation prompt.

    • Question 2: Were the right documents retrieved?

      • State to check: Look at the chunks and contexts keys after the retrieval step in the Main Pipeline.

      • Analysis: This is one of the most critical steps. Are the retrieved documents (chunks) relevant to the query? If the chatbot is hallucinating or providing factually incorrect information, the cause is very often irrelevant or outdated documents being retrieved here. This points to a problem with your knowledge base data or the retrieval strategy itself.

    • Question 3: Was the final context assembled correctly?

      • State to check: Look at the context key after the reranker and repacker steps.

      • Analysis: The reranker and repacker components organize the retrieved chunks into the final context that the LLM will use. Did the reranker accidentally discard the most relevant chunk? Did the repacker combine the chunks in a confusing way? If the retrieved chunks were good but the final context is bad, the issue lies in this part of the pipeline.

    • Question 4: Was the final prompt clear?

      • State to check: Look at the final prompt that is sent to the language model for generation.

      • Analysis: If the retrieved context is perfect, the issue might be in the final prompt template itself. Is the prompt ambiguous? Does it contain conflicting instructions? A poorly worded prompt can confuse the LLM, even with the correct context.

    • Question 5: Is it the Language Model?

      • Analysis: If you have verified that the query, the retrieved context, and the final prompt are all perfect, then the issue may be with the language model itself. It might not have the reasoning capability to answer the specific type of question you are asking. At this point, you might consider experimenting with a more powerful model.

Chapter 3: Common Scenarios and Solutions

  • Scenario: The chatbot's answer is factually wrong or irrelevant.

    • Where to look: Start by inspecting the chunks retrieved during the main pipeline. This is the most likely cause. If the chunks are good, inspect the final context and the prompt.

    • Common Fixes: Improve the quality of the documents in your knowledge base. Tune retrieval parameters like top_k. Adjust the repacker or reranker logic.

  • Scenario: The chatbot's citations are wrong or missing.

    • Where to look: All citations are generated based on the retrieved chunks. Inspect the chunks to ensure the cited information is actually present.

    • Common Fixes: If the information is present in the chunks, the issue is likely in the prompt that instructs the LLM on how to generate citations. Make sure this prompt is clear and explicit.

Last updated