Adding Document References
This guide will walk you through adding a Reference Formatter component to your RAG pipeline that automatically formats and includes source references in your responses, making your answers more credible and traceable.
What is a Reference Formatter?
A Reference Formatter is a pipeline component that:
Takes input: Retrieved chunks and the generated response
Analyzes similarity: Matches response content with source chunks
Formats references: Creates clean, readable reference citations
Enhances credibility: Provides traceability for information sources
The Reference Formatter component automatically extracts and formats source references from the retrieved chunks, making your RAG pipeline responses more credible and traceable. It takes the input chunks and response, then intelligently formats references to show users exactly where the information came from.
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore
Set Up Your Project
We'll build upon the pipeline you created in the Your First RAG Pipeline tutorial. Make sure you have that working before proceeding.
In this tutorial, we will be:
Reusing existing components: Same retriever, repacker, and response synthesizer
Adding reference formatting: New component that processes chunks and response
Enhancing output: Final response includes formatted references
Maintaining state: Uses existing RAGState which already contains reference field
Prepare your repository
Go to the repository you use for Your First RAG Pipeline:
cd my-rag-pipeline
Prepare your .env
file:
Ensure you have a file named .env
in your project directory with the following content:
CSV_DATA_PATH="data/imaginary_animals.csv"
ELASTICSEARCH_URL="http://localhost:9200/"
EMBEDDING_MODEL="text-embedding-3-small"
LANGUAGE_MODEL="gpt-4o-mini"
INDEX_NAME="first-quest"
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
Adjust Folder Structure
Extend your existing project structure to include the reference formatter:
my-rag-pipeline/
├── data/
│ ├── imaginary_animals.csv
├── modules/
│ ├── __init__.py
│ ├── retriever.py
│ ├── repacker.py
│ ├── response_synthesizer.py
│ ├── reference_formatter.py # 👈 New
├── indexer.py
├── pipeline.py
├── reference_formatter_pipeline.py # 👈 New
└── main.py # 👈 Will be modified
Index Your Data
Ensure you have your data indexed. If not, you should follow steps in Index Your Data before proceeding.
Build Core Components of Your Pipeline
Create the Reference Formatter Component
The reference formatter component will analyze the response and chunks to create meaningful references.
Create the reference formatter module
Create modules/reference_formatter.py
with the necessary imports and component:
from gllm_rag.response_synthesizer import SimilarityBasedReferenceFormatter
def reference_formatter_component() -> SimilarityBasedReferenceFormatter:
"""Reference formatter component for creating citations from chunks and response.
Returns:
SimilarityBasedReferenceFormatter: An instance for formatting references.
"""
return SimilarityBasedReferenceFormatter()
Key features:
SimilarityBasedReferenceFormatter: Automatically matches response content with source chunks
Intelligent formatting: Creates clean, readable reference citations
Metadata utilization: Uses chunk metadata for meaningful citations
Automatic processing: No manual configuration needed
Understand the reference formatter behavior
The SimilarityBasedReferenceFormatter
:
Analyzes response: Examines the generated response text
Matches chunks: Finds which chunks contributed to the response
Extracts metadata: Uses chunk metadata for citation formatting
Formats references: Creates clean, professional citations
Returns formatted text: Combines response with formatted references
Build the Pipeline
Now let's create the pipeline that includes the reference formatter.
Create the reference formatter pipeline file
Create reference_formatter_pipeline.py
with the necessary imports:
from gllm_pipeline.pipeline.states import RAGState
from gllm_pipeline.pipeline.pipeline import Pipeline
from gllm_pipeline.steps import bundle, step
from modules import (
repacker_component,
response_synthesizer_component,
retriever_component,
reference_formatter_component,
)
Use the existing RAGState
Since RAGState already contains a reference
field, we don't need to create a new state:
# RAGState already contains:
# - user_query: str
# - chunks: list[Chunk]
# - context: str
# - response: str
# - reference: str # 👈 This is what we'll populate
The existing RAGState is perfect for our reference formatter pipeline.
Create component instances
Instantiate your existing components plus the new reference formatter:
retriever = retriever_component()
repacker = repacker_component(mode="context")
response_synthesizer = response_synthesizer_component()
reference_formatter = reference_formatter_component()
These include your original components plus the new reference formatter.
Define the individual pipeline steps
Create the standard pipeline steps including the reference formatter:
retriever_step = step(
retriever,
{"query": "user_query"},
"chunks",
{"top_k": "top_k"},
)
repacker_step = step(
repacker,
{"chunks": "chunks"},
"context",
)
response_synthesizer_step = step(
response_synthesizer,
{"query": "user_query", "context": "context"},
"response",
)
reference_formatter_step = step(
reference_formatter,
{"chunks": "chunks", "response": "response"},
"reference",
)
Key points:
Standard RAG steps: Same retriever, repacker, and response synthesizer
Reference formatter step: Takes chunks and response as input, outputs formatted references
State flow: Chunks → Context → Response → Reference
Compose the final pipeline
Connect all steps into the complete reference formatter pipeline:
e2e_pipeline_with_reference_formatter = (
retriever_step
| repacker_step
| response_synthesizer_step
| reference_formatter_step
)
e2e_pipeline_with_reference_formatter.state_type = RAGState
Pipeline flow:
Retriever Step: Searches knowledge base for relevant chunks
Repacker Step: Assembles chunks into context
Response Synthesizer Step: Generates response from context
Reference Formatter Step: Creates formatted references from chunks and response
Modify the Application Code
Here we will update the main.py file to use the reference formatter pipeline.
Update Pipeline Import
Update the import in main.py
Add the import for the new reference formatter pipeline:
from reference_formatter_pipeline import e2e_pipeline_with_reference_formatter
Update the pipeline execution
Modify the run_pipeline
function to use the reference formatter pipeline:
async def run_pipeline(state: dict, config: dict):
...
try:
await event_emitter.emit("Starting pipeline")
await e2e_pipeline_with_reference_formatter.invoke(state, config) # Change to new pipeline
...
Handle Reference Output
The reference formatter will populate the reference
field in the state, which you can use in your response.
Update the response handling
Modify your response handling to include the formatted references:
async def run_pipeline(state: dict, config: dict):
...
try:
await event_emitter.emit("Starting pipeline")
await e2e_pipeline_with_reference_formatter.invoke(state, config)
# Include reference in the final response
final_response = {
"response": state.get("response", ""),
"reference": state.get("reference", ""),
"chunks": state.get("chunks", [])
}
await event_emitter.emit("Finished pipeline", final_response)
...
Key changes:
Reference inclusion: The response now includes formatted references
Enhanced output: Users get both the answer and its sources
Traceability: Easy to verify information sources
Run Your Application
Now let's test the reference formatter functionality.
Start your server
Run your FastAPI server as before:
poetry run uvicorn main:app --reload
Test with reference formatter
Try this query to see the reference formatter in action:
{
"user_query": "Which animal lives in the forest?",
"top_k": 5,
"debug": true
}
Expected behavior:
The pipeline will retrieve information from your
imaginary_animals.csv
You'll see all standard RAG steps plus the reference formatter
The response will include both the answer and formatted references
Analyze the debug output
You should see logs showing the reference formatter in action:
Starting pipeline
[Start 'BasicVectorRetriever'] Processing input:
- query: 'Which animal lives in the forest?'
- top_k: 5
- event_emitter: <gllm_core.event.event_emitter.EventEmitter object at ...>
[Finished 'BasicVectorRetriever'] Successfully retrieved 5 chunks.
[Start 'Repacker'] Repacking 5 chunks.
[Finished 'Repacker'] Successfully repacked chunks: ...
[Start 'StuffResponseSynthesizer'] Processing query: 'Which animal lives in the forest?'
[Finished 'StuffResponseSynthesizer'] Successfully synthesized response:
"All the animals mentioned live in forests:
1. Dusk Panther - Twilight forests of Shadowglade
2. Mossback Tortoise - Damp forests of Evergreen Hollow
3. Mistlynx - Fog-laden forests of Whisperwood
4. Whispering Viper - Dense underbrush of Murmur Jungle
5. Luminafox - Luminescent forests of Nyxland"
[Start 'SimilarityBasedReferenceFormatter'] Formatting references using 5 candidate chunks.
[
Chunk(
'id': <chunk_id>,
'content': The Dusk Panther prowls the twilight forests of Sh...,
'metadata': {'name': <animal_name>},
'score': <retrieval_score>
),
...
]
[Finished 'SimilarityBasedReferenceFormatter'] Successfully formatted references.
Finished pipeline
Examine the formatted references
The reference formatter will create clean, professional citations. Your response might look like:
{
"response": "All the animals mentioned live in forests: 1. Dusk Panther - Twilight forests of Shadowglade 2. Mossback Tortoise - Damp forests of Evergreen Hollow 3. Mistlynx - Fog-laden forests of Whisperwood 4. Whispering Viper - Dense underbrush of Murmur Jungle 5. Luminafox - Luminescent forests of Nyxland",
"reference": "Sources: Dusk Panther (metadata: name), Mossback Tortoise (metadata: name), Mistlynx (metadata: name), Whispering Viper (metadata: name), Luminafox (metadata: name)",
"chunks": [...]
}
Understanding the Flow
Here's what happens in the reference formatter pipeline:
Complete Pipeline Flow
Retrieval: Searches your knowledge base for relevant chunks
Repacking: Assembles retrieved chunks into context
Response Generation: Creates answer from context
Reference Formatting: Analyzes response and chunks to create citations
Final Output: Returns both answer and formatted references
Reference Formatter Process
Input Analysis: Takes the generated response and source chunks
Similarity Matching: Identifies which chunks contributed to the response
Metadata Extraction: Uses chunk metadata for citation formatting
Reference Creation: Generates clean, professional citations
Output: Returns formatted reference string
Extending the Reference System
Multiple Reference Types
You can extend the reference system for different types of sources:
class ExtendedRAGState(RAGState):
academic_references: str
web_references: str
internal_references: str
Custom Reference Formatters
Create specialized reference formatters for different content types:
def academic_reference_formatter() -> SimilarityBasedReferenceFormatter:
"""Formats references in academic citation style."""
return SimilarityBasedReferenceFormatter(
# Academic citation configuration
)
def web_reference_formatter() -> SimilarityBasedReferenceFormatter:
"""Formats references as web links."""
return SimilarityBasedReferenceFormatter(
# Web link configuration
)
Troubleshooting
Common Issues
No references being generated:
Ensure chunks have meaningful metadata
Check that the reference formatter step is included in the pipeline
Verify the response contains content that matches chunk content
Poor reference quality:
Improve chunk metadata with more descriptive information
Ensure chunks are properly indexed with relevant content
Check that the response synthesizer generates content that references the chunks
Reference formatting issues:
Verify the SimilarityBasedReferenceFormatter is properly configured
Check that chunk metadata is in the expected format
Ensure the pipeline state includes the reference field
Debug Tips
Enable debug mode: Set
debug: true
in your request to see detailed logsCheck chunk metadata: Verify that your chunks have meaningful metadata
Examine response content: Ensure the response actually references the chunk content
Pipeline step order: Confirm the reference formatter step comes after response generation
📂 Complete Tutorial Files
Congratulations! You've successfully implemented a Reference Formatter component in your RAG pipeline. This enhancement makes your responses more credible and traceable by automatically including formatted references to the source information, significantly improving the transparency and reliability of your AI-powered answers.
Last updated