Debugging a Hallucinated Response
A "hallucination" is when the chatbot provides an answer that sounds plausible but is factually incorrect or not based on the provided source documents.
User Query:
Pada pasal 36, bahasa-bahasa apa yang merupakan sebagian dari kebudayaan Indonesia yang hidup?(In Article 36, what languages are part of the living culture of Indonesia?)Hallucinated Response:
Berdasarkan UUD 45, bahasa yang diakui adalah Bahasa Indonesia. Bahasa daerah lain tidak disebutkan secara spesifik.(According to the UUD 45, the recognized language is Indonesian. Other regional languages are not specifically mentioned.)Expected Correct Answer:
Pasal 36 ...Di daerah-daerah yang mempunyai bahasa sendiri, yang dipelihara oleh rakyatnya dengan baik-baik (misalnya bahasa Jawa, Sunda, Madura, dan sebagainya) bahasa-bahasa itu akan dihormati dan dipelihara juga oleh negara. Bahasa-bahasa itu pun merupakan sebagian dari kebudayaan Indonesia yang hidup.
The Most Likely Cause
In a Retrieval-Augmented Generation (RAG) system, the number one cause of hallucination is failed retrieval. The system was unable to find any relevant document chunks that contained the correct answer. When the Large Language Model (LLM) is asked to answer a question but is given irrelevant context, it will fall back on its own internal knowledge and invent an answer.
The Debugging Workflow (Using Standard Logs)
For many retrieval problems, the standard application logs (with the log level set to DEBUG) are sufficient. You do not always need to enable the full verbose DEBUG_STATE mode, because the standard logs show the inputs and outputs of key components like the retriever. Here is a real-world example of tracing this issue using these logs.
Note on Logs: For this case study, we will analyze snippets from a real log file. For those who wish to follow along with the full context, the complete log file can be found here:
Isolate the Query and Inspect the Logs. First, we identify the query causing the issue. Then, we run the pipeline and examine the standard application logs generated by the
glchat-beservice.Analyze the Query Transformation. We check the logs to see the input and output of the query transformation step. This tells us what query was ultimately sent to the retriever.
Log Snippet:
{ "timestamp": "2025-11-20T11:16:18+0000", "name": "OneToManyQueryTransformer", "level": "DEBUG", "message": "[Start 'OneToManyQueryTransformer'] Processing query: 'Pada pasal 36, bahasa-bahasa apa yang merupakan sebagian dari kebudayaan Indonesia yang hidup'" }, { "timestamp": "2025-11-20T11:16:18+0000", "name": "gllm_core.utils.retry", "level": "WARNING", "message": "Function _invoke failed on attempt 1/3. Retrying in 1.21 seconds. Error: Connection error." }, { "timestamp": "2025-11-20T11:16:20+0000", "name": "gllm_core.utils.retry", "level": "WARNING", "message": "Function _invoke failed on attempt 2/3. Retrying in 2.13 seconds. Error: Connection error." }, { "timestamp": "2025-11-20T11:16:22+0000", "name": "gllm_core.utils.retry", "level": "ERROR", "message": "Function _invoke failed after 3 attempts. Last error: Connection error." }, { "timestamp": "2025-11-20T11:16:22+0000", "name": "OneToManyQueryTransformer", "level": "DEBUG", "message": "[Finished 'OneToManyQueryTransformer'] Successfully produced 1 result(s):\\n - 'Pada pasal 36, bahasa-bahasa apa yang merupakan sebagian dari kebudayaan Indonesia yang hidup'" }Analysis: The logs show both the input and output of the OneToManyQueryTransformer. In this case, the output is the same as the input. (Note: in the full logs, we can see some intermittent connection errors that caused the transformer to fall back to the original query). Since the original query is already clear and specific, this result is acceptable.
Conclusion: The query being sent to the retriever is correct. The problem is not in the query transformation step.
Inspect the Retrieved
chunks(The Root Cause). This is the most critical step. We find the log entry from the retriever and carefully examine thechunksit returned.Log Snippet:
{ "timestamp": "2025-11-20T11:16:22+0000", "name": "BasicVectorRetriever", "level": "DEBUG", "message": "[Finished 'BasicVectorRetriever'] Successfully retrieved 20 chunks.\\n - Rank: 1\\n ID: 144350-StructuredElementChunker-1800-360-4000-0-markdown-True-27\\n Content: ATURAN PERALIHAN\\n\\ndemokratis dan yang hendak menye...\\n Score: 0.74966073\\n Metadata:\\n - title: ATURAN PERALIHAN\\n ... \\n - Rank: 2\\n ID: 144350-StructuredElementChunker-1800-360-4000-0-markdown-True-8\\n Content: BAB XI AGAMA\\nPasal 29 (1) Negara berdasar atas Ket...\\n Score: 0.7456615\\n Metadata:\\n - title: BAB XVI PERUBAHAN UNDANG-UNDANG DASAR\\n ..." }Analysis: We read the
Contentof the top retrieved chunks. The retriever has returned chunks with titles like "ATURAN PERALIHAN" (Transitional Provisions) and "BAB XI AGAMA" (Chapter XI Religion). Although they are from the correct source document, none of them contain the text of Article 36 about regional languages. The retrieval has failed.Conclusion: This is the root cause. The retriever did not provide the LLM with the necessary facts, which will force it to hallucinate.
Confirm the Final Response. Finally, we look at what the
ResponseSynthesizergenerated using this bad context.Log Snippet:
{ "timestamp": "2025-11-20T11:16:28+0000", "name": "ResponseSynthesizer", "level": "DEBUG", "message": "[Finished 'ResponseSynthesizer'] Successfully synthesized response:\\n'Berdasarkan UUD 45, bahasa yang diakui adalah Bahasa Indonesia. Bahasa daerah lain tidak disebutkan secara spesifik.'" }Analysis: The chatbot generated the exact hallucinated response we saw in the problem description. This confirms that because the retriever provided no relevant information about regional languages, the LLM defaulted to its own (incorrect) general knowledge.
The Fix
Based on the analysis, the root cause was a failure in the document retrieval step.
Root Cause: The vector retriever failed to find documents containing the specific text of Article 36, instead retrieving other, semantically similar but contextually irrelevant articles from the same constitution document.
Common Solutions:
Solution 1: Tune the Retrieval Process:
Increase
top_k: In your retriever configuration, you can increase thetop_kparameter. This will retrieve more documents, increasing the chance of finding the correct one. Be aware that this can also add more noise to the context.Use a Reranker: A reranker is a secondary model that takes the initial list of retrieved documents and re-orders them for relevance. Adding a powerful reranker (or upgrading an existing one) is one of the most effective ways to ensure the best documents are at the top of the list.
Implement Hybrid Search: Instead of relying only on semantic (vector) search, you can use a hybrid approach that combines it with traditional keyword search (like BM25). This is very effective for queries that contain specific keywords, codes, or IDs (e.g., "Pasal 36").
Use a Map-Reduce Strategy for Context: For broad queries that may require information from many documents, a simple
top_kmight not be enough. Instead, you can implement a Map-Reduce strategy after retrieval:Retrieve a large number of chunks (e.g.,
top_k=100).Map: Group the chunks into smaller batches (e.g., 10 chunks per batch). Send each batch to an LLM with a prompt to summarize it.
Reduce: Take all the summaries from the "Map" step and combine them into a final, dense context that is then used to generate the final answer.
Use a Better Embedding Model: The quality of the embedding model directly impacts the quality of the semantic search. If retrieval is consistently poor across many queries, consider upgrading to a more powerful embedding model.
Solution 2: Improve Data and Chunking Strategy:
Data Quality: The most common reason for failed retrieval is that the correct information isn't in the knowledge base, or it's unclear. Ensure a document with the correct, up-to-date answer is ingested.
Chunking Strategy: Review your document chunking strategy. Information might be split across disconnected chunks. For structured documents like legal texts, ensure each logical section (like an article) is a self-contained chunk. This makes it easier for the retriever to find an exact match.
General Best Practices for Improving Accuracy
Beyond fixing specific issues, there are general strategies you can implement to make your chatbot more robust and prevent accuracy problems from happening in the first place.
Implement a Safety Net Prompt
One of the most effective ways to prevent hallucinations and build user trust is to explicitly instruct the LLM on how to behave when it doesn't know the answer. This is done by adding a "safety net" instruction to your generation prompt.
What it is: A clear instruction that tells the model to admit when the provided context is insufficient.
Example Instruction: "You are a helpful assistant. Answer the user's question based only on the provided context. If the context does not contain the answer, you must state that you do not have enough information to answer."
Why it works: This simple instruction significantly reduces the likelihood that the LLM will "guess" or invent an answer when the retrieval step fails to provide the correct information. It makes the chatbot more reliable and honest with the user.
Last updated