booksAdding Document References

This guide will walk you through adding a Reference Formatter component to your existing RAG pipeline that automatically formats and includes source references in your responses, making your answers more credible and traceable.

Reference formatting enhances your RAG responses with automatic source citations, providing transparency and credibility by showing exactly where information came from

circle-info

This tutorial extends the Your First RAG Pipeline tutorial. Ensure you have followed the instructions to set up your repository.

chevron-rightPrerequisiteshashtag

This example specifically requires:

  1. Completion of the Your First RAG Pipelinearrow-up-right tutorial - this builds directly on top of it

  2. Completion of all setup steps listed on the Prerequisitesarrow-up-right page

  3. A working OpenAI API key configured in your environment variables

You should be familiar with these concepts and components:

  1. Components in Your First RAG Pipeline - Required foundation

githubView full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

You can either:

  1. You can refer to the guide whenever you need explanation or want to clarify how each part works.

  2. Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will workβ€”choose based on whether you prefer speed or learning by doing!

Project Setup

1

Extend Your RAG Pipeline Project

Start with your completed RAG pipeline project from the Your First RAG Pipeline tutorial. We don't need to add any new file for this tutorial. Therefore, the structure should stay as is:

<project-name>/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ <index>/...                     # preset data index folder
β”‚   β”œβ”€β”€ chroma.sqlite3                  # preset database file
β”‚   β”œβ”€β”€ imaginary_animals.csv           # sample data
β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ retriever.py
β”‚   └── response_synthesizer.py
β”œβ”€β”€ .env
β”œβ”€β”€ indexer.py                    
└── pipeline.py    # πŸ‘ˆ Will be updated with reference formatter

1) Build the Reference Formatter Pipeline

The SimilarityBasedReferenceFormatter analyzes your response against retrieved chunks and automatically creates formatted citations using chunk metadata.

1

Create the reference formatter step

Update your pipeline.py (or create a new one) with the reference formatter:

2

Compose the final pipeline

Chain all steps including the reference formatter:

This creates a pipeline that generates responses with automatic source citations from the retrieved chunks.

🧠 The RAGState input state already contains a reference field that gets populated by the reference formatter.

2) Run the Pipeline

circle-info

When running the pipeline, you may encounter an error like this:

[2025-08-26T14:36:10+0700.550 chromadb.telemetry.product.posthog ERROR] Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

Don't worry about this, since we do not use this Chroma feature. Your Pipeline should still work.

1

Configure and invoke the pipeline

Configure the state and config for direct pipeline invocation:

2

Observe output

If you successfully run all the steps, you will see something like this appended in the end of the result:

Troubleshooting

Common Issues

  1. No references being generated:

    • Ensure chunks have meaningful metadata

    • Check that the reference formatter step is included in the pipeline

    • Verify the response contains content that matches chunk content

  2. Poor reference quality:

    • Improve chunk metadata with more descriptive information

    • Ensure chunks are properly indexed with relevant content

    • Check that the response synthesizer generates content that references the chunks

  3. Reference formatting issues:

    • Verify the SimilarityBasedReferenceFormatter is properly configured

    • Check that chunk metadata is in the expected format

    • Ensure the pipeline state includes the reference field

Debug Tips

  1. Enable debug mode: Set debug: true in your request to see detailed logs

  2. Check chunk metadata: Verify that your chunks have meaningful metadata

  3. Examine response content: Ensure the response actually references the chunk content

  4. Pipeline step order: Confirm the reference formatter step comes after response generation


Congratulations! You've successfully implemented a Reference Formatter component in your RAG pipeline. This enhancement makes your responses more credible and traceable by automatically including formatted references to the source information, significantly improving the transparency and reliability of your AI-powered answers.

Last updated