Adding Document References

This guide will walk you through adding a Reference Formatter component to your RAG pipeline that automatically formats and includes source references in your responses, making your answers more credible and traceable.

This tutorial extends the Your First RAG Pipeline tutorial. Ensure you have followed the instructions to set up your repository and index your data.

Prerequisites

This tutorial specifically requires:

What is a Reference Formatter?

A Reference Formatter is a pipeline component that:

  • Takes input: Retrieved chunks and the generated response

  • Analyzes similarity: Matches response content with source chunks

  • Formats references: Creates clean, readable reference citations

  • Enhances credibility: Provides traceability for information sources

The Reference Formatter component automatically extracts and formats source references from the retrieved chunks, making your RAG pipeline responses more credible and traceable. It takes the input chunks and response, then intelligently formats references to show users exactly where the information came from.

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

Set Up Your Project

We'll build upon the pipeline you created in the Your First RAG Pipeline tutorial. Make sure you have that working before proceeding.

In this tutorial, we will be:

  • Reusing existing components: Same retriever, repacker, and response synthesizer

  • Adding reference formatting: New component that processes chunks and response

  • Enhancing output: Final response includes formatted references

  • Maintaining state: Uses existing RAGState which already contains reference field

Prepare your repository

1

Go to the repository you use for Your First RAG Pipeline:

cd my-rag-pipeline
2

Prepare your .env file:

Ensure you have a file named .env in your project directory with the following content:

CSV_DATA_PATH="data/imaginary_animals.csv"
ELASTICSEARCH_URL="http://localhost:9200/"
EMBEDDING_MODEL="text-embedding-3-small"
LANGUAGE_MODEL="gpt-4o-mini"
INDEX_NAME="first-quest"
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"

This is an example .env file. You may adjust the variables according to your need.

Adjust Folder Structure

Extend your existing project structure to include the reference formatter:

my-rag-pipeline/
├── data/
│   ├── imaginary_animals.csv
├── modules/
│   ├── __init__.py
│   ├── retriever.py
│   ├── repacker.py
│   ├── response_synthesizer.py
│   ├── reference_formatter.py    # 👈 New
├── indexer.py
├── pipeline.py
├── reference_formatter_pipeline.py    # 👈 New
└── main.py                         # 👈 Will be modified

Index Your Data

Ensure you have your data indexed. If not, you should follow steps in Index Your Data before proceeding.

Build Core Components of Your Pipeline

Create the Reference Formatter Component

The reference formatter component will analyze the response and chunks to create meaningful references.

1

Create the reference formatter module

Create modules/reference_formatter.py with the necessary imports and component:

from gllm_rag.response_synthesizer import SimilarityBasedReferenceFormatter

def reference_formatter_component() -> SimilarityBasedReferenceFormatter:
    """Reference formatter component for creating citations from chunks and response.

    Returns:
        SimilarityBasedReferenceFormatter: An instance for formatting references.
    """
    return SimilarityBasedReferenceFormatter()

Key features:

  • SimilarityBasedReferenceFormatter: Automatically matches response content with source chunks

  • Intelligent formatting: Creates clean, readable reference citations

  • Metadata utilization: Uses chunk metadata for meaningful citations

  • Automatic processing: No manual configuration needed

2

Understand the reference formatter behavior

The SimilarityBasedReferenceFormatter:

  • Analyzes response: Examines the generated response text

  • Matches chunks: Finds which chunks contributed to the response

  • Extracts metadata: Uses chunk metadata for citation formatting

  • Formats references: Creates clean, professional citations

  • Returns formatted text: Combines response with formatted references

Build the Pipeline

Now let's create the pipeline that includes the reference formatter.

1

Create the reference formatter pipeline file

Create reference_formatter_pipeline.py with the necessary imports:

from gllm_pipeline.pipeline.states import RAGState
from gllm_pipeline.pipeline.pipeline import Pipeline
from gllm_pipeline.steps import bundle, step

from modules import (
    repacker_component,
    response_synthesizer_component,
    retriever_component,
    reference_formatter_component,
)
2

Use the existing RAGState

Since RAGState already contains a reference field, we don't need to create a new state:

# RAGState already contains:
# - user_query: str
# - chunks: list[Chunk]
# - context: str
# - response: str
# - reference: str  # 👈 This is what we'll populate

The existing RAGState is perfect for our reference formatter pipeline.

3

Create component instances

Instantiate your existing components plus the new reference formatter:

retriever = retriever_component()
repacker = repacker_component(mode="context")
response_synthesizer = response_synthesizer_component()
reference_formatter = reference_formatter_component()

These include your original components plus the new reference formatter.

4

Define the individual pipeline steps

Create the standard pipeline steps including the reference formatter:

retriever_step = step(
    retriever,
    {"query": "user_query"},
    "chunks",
    {"top_k": "top_k"},
)

repacker_step = step(
    repacker,
    {"chunks": "chunks"},
    "context",
)

response_synthesizer_step = step(
    response_synthesizer,
    {"query": "user_query", "context": "context"},
    "response",
)

reference_formatter_step = step(
    reference_formatter,
    {"chunks": "chunks", "response": "response"},
    "reference",
)

Key points:

  • Standard RAG steps: Same retriever, repacker, and response synthesizer

  • Reference formatter step: Takes chunks and response as input, outputs formatted references

  • State flow: Chunks → Context → Response → Reference

5

Compose the final pipeline

Connect all steps into the complete reference formatter pipeline:

e2e_pipeline_with_reference_formatter = (
    retriever_step 
    | repacker_step 
    | response_synthesizer_step 
    | reference_formatter_step
)

e2e_pipeline_with_reference_formatter.state_type = RAGState

Pipeline flow:

  1. Retriever Step: Searches knowledge base for relevant chunks

  2. Repacker Step: Assembles chunks into context

  3. Response Synthesizer Step: Generates response from context

  4. Reference Formatter Step: Creates formatted references from chunks and response

Modify the Application Code

Here we will update the main.py file to use the reference formatter pipeline.

Update Pipeline Import

1

Update the import in main.py

Add the import for the new reference formatter pipeline:

from reference_formatter_pipeline import e2e_pipeline_with_reference_formatter
2

Update the pipeline execution

Modify the run_pipeline function to use the reference formatter pipeline:

async def run_pipeline(state: dict, config: dict):
    ...
    try:
        await event_emitter.emit("Starting pipeline")
        await e2e_pipeline_with_reference_formatter.invoke(state, config)  # Change to new pipeline
    ...

Handle Reference Output

The reference formatter will populate the reference field in the state, which you can use in your response.

1

Update the response handling

Modify your response handling to include the formatted references:

async def run_pipeline(state: dict, config: dict):
    ...
    try:
        await event_emitter.emit("Starting pipeline")
        await e2e_pipeline_with_reference_formatter.invoke(state, config)
        
        # Include reference in the final response
        final_response = {
            "response": state.get("response", ""),
            "reference": state.get("reference", ""),
            "chunks": state.get("chunks", [])
        }
        
        await event_emitter.emit("Finished pipeline", final_response)
    ...

Key changes:

  • Reference inclusion: The response now includes formatted references

  • Enhanced output: Users get both the answer and its sources

  • Traceability: Easy to verify information sources

Run Your Application

Now let's test the reference formatter functionality.

1

Start your server

Run your FastAPI server as before:

poetry run uvicorn main:app --reload
2

Test with reference formatter

Try this query to see the reference formatter in action:

{
  "user_query": "Which animal lives in the forest?",
  "top_k": 5,
  "debug": true
}

Expected behavior:

  • The pipeline will retrieve information from your imaginary_animals.csv

  • You'll see all standard RAG steps plus the reference formatter

  • The response will include both the answer and formatted references

3

Analyze the debug output

You should see logs showing the reference formatter in action:

Starting pipeline
[Start 'BasicVectorRetriever'] Processing input:
    - query: 'Which animal lives in the forest?'
    - top_k: 5
    - event_emitter: <gllm_core.event.event_emitter.EventEmitter object at ...>
[Finished 'BasicVectorRetriever'] Successfully retrieved 5 chunks.
[Start 'Repacker'] Repacking 5 chunks.
[Finished 'Repacker'] Successfully repacked chunks: ...
[Start 'StuffResponseSynthesizer'] Processing query: 'Which animal lives in the forest?'
[Finished 'StuffResponseSynthesizer'] Successfully synthesized response:
"All the animals mentioned live in forests:
1. Dusk Panther - Twilight forests of Shadowglade
2. Mossback Tortoise - Damp forests of Evergreen Hollow
3. Mistlynx - Fog-laden forests of Whisperwood
4. Whispering Viper - Dense underbrush of Murmur Jungle
5. Luminafox - Luminescent forests of Nyxland"
[Start 'SimilarityBasedReferenceFormatter'] Formatting references using 5 candidate chunks.
[
    Chunk(
      'id': <chunk_id>,
      'content': The Dusk Panther prowls the twilight forests of Sh...,
      'metadata': {'name': <animal_name>},
      'score': <retrieval_score>
    ), 
    ...
]
[Finished 'SimilarityBasedReferenceFormatter'] Successfully formatted references.
Finished pipeline
4

Examine the formatted references

The reference formatter will create clean, professional citations. Your response might look like:

{
  "response": "All the animals mentioned live in forests: 1. Dusk Panther - Twilight forests of Shadowglade 2. Mossback Tortoise - Damp forests of Evergreen Hollow 3. Mistlynx - Fog-laden forests of Whisperwood 4. Whispering Viper - Dense underbrush of Murmur Jungle 5. Luminafox - Luminescent forests of Nyxland",
  "reference": "Sources: Dusk Panther (metadata: name), Mossback Tortoise (metadata: name), Mistlynx (metadata: name), Whispering Viper (metadata: name), Luminafox (metadata: name)",
  "chunks": [...]
}

Understanding the Flow

Here's what happens in the reference formatter pipeline:

Complete Pipeline Flow

  1. Retrieval: Searches your knowledge base for relevant chunks

  2. Repacking: Assembles retrieved chunks into context

  3. Response Generation: Creates answer from context

  4. Reference Formatting: Analyzes response and chunks to create citations

  5. Final Output: Returns both answer and formatted references

Reference Formatter Process

  1. Input Analysis: Takes the generated response and source chunks

  2. Similarity Matching: Identifies which chunks contributed to the response

  3. Metadata Extraction: Uses chunk metadata for citation formatting

  4. Reference Creation: Generates clean, professional citations

  5. Output: Returns formatted reference string

Extending the Reference System

Multiple Reference Types

You can extend the reference system for different types of sources:

class ExtendedRAGState(RAGState):
    academic_references: str
    web_references: str
    internal_references: str

Custom Reference Formatters

Create specialized reference formatters for different content types:

def academic_reference_formatter() -> SimilarityBasedReferenceFormatter:
    """Formats references in academic citation style."""
    return SimilarityBasedReferenceFormatter(
        # Academic citation configuration
    )

def web_reference_formatter() -> SimilarityBasedReferenceFormatter:
    """Formats references as web links."""
    return SimilarityBasedReferenceFormatter(
        # Web link configuration
    )

Troubleshooting

Common Issues

  1. No references being generated:

    • Ensure chunks have meaningful metadata

    • Check that the reference formatter step is included in the pipeline

    • Verify the response contains content that matches chunk content

  2. Poor reference quality:

    • Improve chunk metadata with more descriptive information

    • Ensure chunks are properly indexed with relevant content

    • Check that the response synthesizer generates content that references the chunks

  3. Reference formatting issues:

    • Verify the SimilarityBasedReferenceFormatter is properly configured

    • Check that chunk metadata is in the expected format

    • Ensure the pipeline state includes the reference field

Debug Tips

  1. Enable debug mode: Set debug: true in your request to see detailed logs

  2. Check chunk metadata: Verify that your chunks have meaningful metadata

  3. Examine response content: Ensure the response actually references the chunk content

  4. Pipeline step order: Confirm the reference formatter step comes after response generation

📂 Complete Tutorial Files

Coming soon!


Congratulations! You've successfully implemented a Reference Formatter component in your RAG pipeline. This enhancement makes your responses more credible and traceable by automatically including formatted references to the source information, significantly improving the transparency and reliability of your AI-powered answers.

Last updated