Adding Document References
This guide will walk you through adding a Reference Formatter component to your existing RAG pipeline that automatically formats and includes source references in your responses, making your answers more credible and traceable.
Reference formatting enhances your RAG responses with automatic source citations, providing transparency and credibility by showing exactly where information came from
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastoreYou can either:
You can refer to the guide whenever you need explanation or want to clarify how each part works.
Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.
Both options will workβchoose based on whether you prefer speed or learning by doing!
Project Setup
Extend Your RAG Pipeline Project
Start with your completed RAG pipeline project from the Your First RAG Pipeline tutorial. We don't need to add any new file for this tutorial. Therefore, the structure should stay as is:
<project-name>/
βββ data/
β βββ <index>/... # preset data index folder
β βββ chroma.sqlite3 # preset database file
β βββ imaginary_animals.csv # sample data
βββ modules/
β βββ retriever.py
β βββ response_synthesizer.py
βββ .env
βββ indexer.py
βββ pipeline.py # π Will be updated with reference formatter1) Build the Reference Formatter Pipeline
The SimilarityBasedReferenceFormatter analyzes your response against retrieved chunks and automatically creates formatted citations using chunk metadata.
Create the reference formatter step
Update your pipeline.py (or create a new one) with the reference formatter:
Compose the final pipeline
Chain all steps including the reference formatter:
This creates a pipeline that generates responses with automatic source citations from the retrieved chunks.
π§ The
RAGStateinput state already contains areferencefield that gets populated by the reference formatter.
2) Run the Pipeline
Configure and invoke the pipeline
Configure the state and config for direct pipeline invocation:
Observe output
If you successfully run all the steps, you will see something like this appended in the end of the result:
Troubleshooting
Common Issues
No references being generated:
Ensure chunks have meaningful metadata
Check that the reference formatter step is included in the pipeline
Verify the response contains content that matches chunk content
Poor reference quality:
Improve chunk metadata with more descriptive information
Ensure chunks are properly indexed with relevant content
Check that the response synthesizer generates content that references the chunks
Reference formatting issues:
Verify the SimilarityBasedReferenceFormatter is properly configured
Check that chunk metadata is in the expected format
Ensure the pipeline state includes the reference field
Debug Tips
Enable debug mode: Set
debug: truein your request to see detailed logsCheck chunk metadata: Verify that your chunks have meaningful metadata
Examine response content: Ensure the response actually references the chunk content
Pipeline step order: Confirm the reference formatter step comes after response generation
Congratulations! You've successfully implemented a Reference Formatter component in your RAG pipeline. This enhancement makes your responses more credible and traceable by automatically including formatted references to the source information, significantly improving the transparency and reliability of your AI-powered answers.
Last updated