Adding Document References

This guide will walk you through adding a Reference Formatter component to your RAG pipeline that automatically formats and includes source references in your responses, making your answers more credible and traceable.

This tutorial extends the Your First RAG Pipeline tutorial. Ensure you have followed the instructions to set up your repository and index your data.

Prerequisites

This tutorial specifically requires:

Completion of the Your First RAG Pipeline tutorial.
All setup steps listed on the Prerequisites page.
An Elastic Search vector data store that is already set up and available for use. Refer to Supported Vector Data Store for tutorial.

What is a Reference Formatter?

A Reference Formatter is a pipeline component that:

Takes input: Retrieved chunks and the generated response
Analyzes similarity: Matches response content with source chunks
Formats references: Creates clean, readable reference citations
Enhances credibility: Provides traceability for information sources

The Reference Formatter component automatically extracts and formats source references from the retrieved chunks, making your RAG pipeline responses more credible and traceable. It takes the input chunks and response, then intelligently formats references to show users exactly where the information came from.

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

Set Up Your Project

We'll build upon the pipeline you created in the Your First RAG Pipeline tutorial. Make sure you have that working before proceeding.

In this tutorial, we will be:

Reusing existing components: Same retriever, repacker, and response synthesizer
Adding reference formatting: New component that processes chunks and response
Enhancing output: Final response includes formatted references
Maintaining state: Uses existing RAGState which already contains reference field

Prepare your repository

Go to the repository you use for Your First RAG Pipeline:

cd my-rag-pipeline

Prepare your .env file:

Ensure you have a file named .env in your project directory with the following content:

CSV_DATA_PATH="data/imaginary_animals.csv"
ELASTICSEARCH_URL="http://localhost:9200/"
EMBEDDING_MODEL="text-embedding-3-small"
LANGUAGE_MODEL="gpt-4o-mini"
INDEX_NAME="first-quest"
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"

This is an example .env file. You may adjust the variables according to your need.

Adjust Folder Structure

Extend your existing project structure to include the reference formatter:

my-rag-pipeline/
├── data/
│   ├── imaginary_animals.csv
├── modules/
│   ├── __init__.py
│   ├── retriever.py
│   ├── repacker.py
│   ├── response_synthesizer.py
│   ├── reference_formatter.py    # 👈 New
├── indexer.py
├── pipeline.py
├── reference_formatter_pipeline.py    # 👈 New
└── main.py                         # 👈 Will be modified

Index Your Data

Ensure you have your data indexed. If not, you should follow steps in Index Your Data before proceeding.

Build Core Components of Your Pipeline

Create the Reference Formatter Component

The reference formatter component will analyze the response and chunks to create meaningful references.

Create the reference formatter module

Create modules/reference_formatter.py with the necessary imports and component:

from gllm_rag.response_synthesizer import SimilarityBasedReferenceFormatter

def reference_formatter_component() -> SimilarityBasedReferenceFormatter:
    """Reference formatter component for creating citations from chunks and response.

    Returns:
        SimilarityBasedReferenceFormatter: An instance for formatting references.
    """
    return SimilarityBasedReferenceFormatter()

Key features:

SimilarityBasedReferenceFormatter: Automatically matches response content with source chunks
Intelligent formatting: Creates clean, readable reference citations
Metadata utilization: Uses chunk metadata for meaningful citations
Automatic processing: No manual configuration needed

Understand the reference formatter behavior

The SimilarityBasedReferenceFormatter:

Analyzes response: Examines the generated response text
Matches chunks: Finds which chunks contributed to the response
Extracts metadata: Uses chunk metadata for citation formatting
Formats references: Creates clean, professional citations
Returns formatted text: Combines response with formatted references

Build the Pipeline

Now let's create the pipeline that includes the reference formatter.

Create the reference formatter pipeline file

Create reference_formatter_pipeline.py with the necessary imports:

from gllm_pipeline.pipeline.states import RAGState
from gllm_pipeline.pipeline.pipeline import Pipeline
from gllm_pipeline.steps import bundle, step

from modules import (
    repacker_component,
    response_synthesizer_component,
    retriever_component,
    reference_formatter_component,
)

Use the existing RAGState

Since RAGState already contains a reference field, we don't need to create a new state:

# RAGState already contains:
# - user_query: str
# - chunks: list[Chunk]
# - context: str
# - response: str
# - reference: str  # 👈 This is what we'll populate

The existing RAGState is perfect for our reference formatter pipeline.

Create component instances

Instantiate your existing components plus the new reference formatter:

retriever = retriever_component()
repacker = repacker_component(mode="context")
response_synthesizer = response_synthesizer_component()
reference_formatter = reference_formatter_component()

These include your original components plus the new reference formatter.

Define the individual pipeline steps

Create the standard pipeline steps including the reference formatter:

retriever_step = step(
    retriever,
    {"query": "user_query"},
    "chunks",
    {"top_k": "top_k"},
)

repacker_step = step(
    repacker,
    {"chunks": "chunks"},
    "context",
)

response_synthesizer_step = step(
    response_synthesizer,
    {"query": "user_query", "context": "context"},
    "response",
)

reference_formatter_step = step(
    reference_formatter,
    {"chunks": "chunks", "response": "response"},
    "reference",
)

Key points:

Standard RAG steps: Same retriever, repacker, and response synthesizer
Reference formatter step: Takes chunks and response as input, outputs formatted references
State flow: Chunks → Context → Response → Reference

Compose the final pipeline

Connect all steps into the complete reference formatter pipeline:

e2e_pipeline_with_reference_formatter = (
    retriever_step 
    | repacker_step 
    | response_synthesizer_step 
    | reference_formatter_step
)

e2e_pipeline_with_reference_formatter.state_type = RAGState

Pipeline flow:

Retriever Step: Searches knowledge base for relevant chunks
Repacker Step: Assembles chunks into context
Response Synthesizer Step: Generates response from context
Reference Formatter Step: Creates formatted references from chunks and response

Modify the Application Code

Here we will update the main.py file to use the reference formatter pipeline.

Update Pipeline Import

Update the import in main.py

Add the import for the new reference formatter pipeline:

from reference_formatter_pipeline import e2e_pipeline_with_reference_formatter

Update the pipeline execution

Modify the run_pipeline function to use the reference formatter pipeline:

async def run_pipeline(state: dict, config: dict):
    ...
    try:
        await event_emitter.emit("Starting pipeline")
        await e2e_pipeline_with_reference_formatter.invoke(state, config)  # Change to new pipeline
    ...

Handle Reference Output

The reference formatter will populate the reference field in the state, which you can use in your response.

Update the response handling

Modify your response handling to include the formatted references:

async def run_pipeline(state: dict, config: dict):
    ...
    try:
        await event_emitter.emit("Starting pipeline")
        await e2e_pipeline_with_reference_formatter.invoke(state, config)
        
        # Include reference in the final response
        final_response = {
            "response": state.get("response", ""),
            "reference": state.get("reference", ""),
            "chunks": state.get("chunks", [])
        }
        
        await event_emitter.emit("Finished pipeline", final_response)
    ...

Key changes:

Reference inclusion: The response now includes formatted references
Enhanced output: Users get both the answer and its sources
Traceability: Easy to verify information sources

Run Your Application

Now let's test the reference formatter functionality.

Start your server

Run your FastAPI server as before:

poetry run uvicorn main:app --reload

Test with reference formatter

Try this query to see the reference formatter in action:

{
  "user_query": "Which animal lives in the forest?",
  "top_k": 5,
  "debug": true
}

Expected behavior:

The pipeline will retrieve information from your imaginary_animals.csv
You'll see all standard RAG steps plus the reference formatter
The response will include both the answer and formatted references

Analyze the debug output

You should see logs showing the reference formatter in action:

Starting pipeline
[Start 'BasicVectorRetriever'] Processing input:
    - query: 'Which animal lives in the forest?'
    - top_k: 5
    - event_emitter: <gllm_core.event.event_emitter.EventEmitter object at ...>
[Finished 'BasicVectorRetriever'] Successfully retrieved 5 chunks.
[Start 'Repacker'] Repacking 5 chunks.
[Finished 'Repacker'] Successfully repacked chunks: ...
[Start 'StuffResponseSynthesizer'] Processing query: 'Which animal lives in the forest?'
[Finished 'StuffResponseSynthesizer'] Successfully synthesized response:
"All the animals mentioned live in forests:
1. Dusk Panther - Twilight forests of Shadowglade
2. Mossback Tortoise - Damp forests of Evergreen Hollow
3. Mistlynx - Fog-laden forests of Whisperwood
4. Whispering Viper - Dense underbrush of Murmur Jungle
5. Luminafox - Luminescent forests of Nyxland"
[Start 'SimilarityBasedReferenceFormatter'] Formatting references using 5 candidate chunks.
[
    Chunk(
      'id': <chunk_id>,
      'content': The Dusk Panther prowls the twilight forests of Sh...,
      'metadata': {'name': <animal_name>},
      'score': <retrieval_score>
    ), 
    ...
]
[Finished 'SimilarityBasedReferenceFormatter'] Successfully formatted references.
Finished pipeline

Examine the formatted references

The reference formatter will create clean, professional citations. Your response might look like:

{
  "response": "All the animals mentioned live in forests: 1. Dusk Panther - Twilight forests of Shadowglade 2. Mossback Tortoise - Damp forests of Evergreen Hollow 3. Mistlynx - Fog-laden forests of Whisperwood 4. Whispering Viper - Dense underbrush of Murmur Jungle 5. Luminafox - Luminescent forests of Nyxland",
  "reference": "Sources: Dusk Panther (metadata: name), Mossback Tortoise (metadata: name), Mistlynx (metadata: name), Whispering Viper (metadata: name), Luminafox (metadata: name)",
  "chunks": [...]
}

Understanding the Flow

Here's what happens in the reference formatter pipeline:

Complete Pipeline Flow

Retrieval: Searches your knowledge base for relevant chunks
Repacking: Assembles retrieved chunks into context
Response Generation: Creates answer from context
Reference Formatting: Analyzes response and chunks to create citations
Final Output: Returns both answer and formatted references

Reference Formatter Process

Input Analysis: Takes the generated response and source chunks
Similarity Matching: Identifies which chunks contributed to the response
Metadata Extraction: Uses chunk metadata for citation formatting
Reference Creation: Generates clean, professional citations
Output: Returns formatted reference string

Extending the Reference System

Multiple Reference Types

You can extend the reference system for different types of sources:

class ExtendedRAGState(RAGState):
    academic_references: str
    web_references: str
    internal_references: str

Custom Reference Formatters

Create specialized reference formatters for different content types:

def academic_reference_formatter() -> SimilarityBasedReferenceFormatter:
    """Formats references in academic citation style."""
    return SimilarityBasedReferenceFormatter(
        # Academic citation configuration
    )

def web_reference_formatter() -> SimilarityBasedReferenceFormatter:
    """Formats references as web links."""
    return SimilarityBasedReferenceFormatter(
        # Web link configuration
    )

Troubleshooting

Common Issues

No references being generated:
- Ensure chunks have meaningful metadata
- Check that the reference formatter step is included in the pipeline
- Verify the response contains content that matches chunk content
Poor reference quality:
- Improve chunk metadata with more descriptive information
- Ensure chunks are properly indexed with relevant content
- Check that the response synthesizer generates content that references the chunks
Reference formatting issues:
- Verify the SimilarityBasedReferenceFormatter is properly configured
- Check that chunk metadata is in the expected format
- Ensure the pipeline state includes the reference field

Debug Tips

Enable debug mode: Set debug: true in your request to see detailed logs
Check chunk metadata: Verify that your chunks have meaningful metadata
Examine response content: Ensure the response actually references the chunk content
Pipeline step order: Confirm the reference formatter step comes after response generation

📂 Complete Tutorial Files

Coming soon!

Congratulations! You've successfully implemented a Reference Formatter component in your RAG pipeline. This enhancement makes your responses more credible and traceable by automatically including formatted references to the source information, significantly improving the transparency and reliability of your AI-powered answers.

PreviousImplement Semantic Routing NextSimple Guardrail

Last updated 3 days ago