Caching

This guide will walk you through implementing caching in your AI pipelines to eliminate redundant computations and improve performance. We'll explore how pipeline caching can transform expensive, repetitive operations into instant responses.

Caching functionality gives you control over performance optimization in your pipeline, providing flexibility to cache at different levels based on your specific needs. For example, you can implement step-level caching for expensive operations, pipeline-level caching for complete workflows, or combine both for maximum efficiency.

This tutorial extends the Your First RAG Pipeline tutorial. Ensure you have followed the instructions to set up your repository.

Prerequisites

This example specifically requires:

Completion of the Your First RAG Pipeline tutorial - this builds directly on top of it
Completion of all setup steps listed on the Prerequisites page
A working OpenAI API key configured in your environment variables

You should be familiar with these concepts and components:

Components in Your First RAG Pipeline - Required foundation
Vector Data Store

View full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

You can either:

You can refer to the guide whenever you need explanation or want to clarify how each part works.
Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Project Setup

Extend Your RAG Pipeline Project

Start with your completed RAG pipeline project from the previous tutorial. The caching functionality works with any pipeline components - we'll demonstrate with your existing RAG pipeline:

Your existing structure is already complete:

<your-project>/
├── data/
│   └── imaginary_animals.csv       
├── modules/
│   ├── __init__.py                 
│   ├── retriever.py                
│   └── response_synthesizer.py    
├── pipeline.py                     # 👈 Will be updated with caching
├── indexer.py                     
└── .env

Understanding Pipeline Caching

When you deploy pipelines to production, you quickly discover a common pattern: the same inputs get processed over and over again. Users ask similar questions, run identical analyses, or trigger the same computational workflows repeatedly.

GLLM Pipeline framework provides two levels of caching that work seamlessly together: pipeline-level caching and step-level caching. Pipeline-level caching stores the entire pipeline's output for a given input, while step-level caching stores individual step results within the pipeline execution.

Our caching system uses vector data store as the cache backend, which provides several advantages: semantic similarity matching (so similar inputs can benefit from cached results), scalable storage, and fast retrieval performance.

1) Set Up Your Cache Data Store

Create the cache data store

Before implementing any caching option, you need to set up a cache data store. Add this to your pipeline file:

from gllm_datastore.vector_data_store import ChromaVectorDataStore
from gllm_inference.em_invoker import OpenAIEMInvoker

# Create a vector datastore cache
cache_store = ChromaVectorDataStore(
    collection_name="my_cache", 
    embedding=OpenAIEMInvoker(model_name="text-embedding-3-small")
).as_cache()

You could also configure the matching config (exact match/fuzzy match/semantic match) of the cache store by following the guide in Vector Data Store page.

2) Choose Your Caching Strategy

Pipeline caching allows you to optimize performance at different levels based on your specific needs. Each approach can be implemented independently, giving you flexibility to choose the right caching strategy for your use case:

Step-Level Caching: Caches individual step results within pipeline execution
Pipeline-Level Caching: Caches complete pipeline outputs for given inputs
Multi-Level Caching: Combines both approaches for maximum efficiency

You can choose any combination of these options based on your performance requirements and use cases.

Option 1: Pipeline-Level Caching

When to use: Cache complete pipeline results when users frequently run identical workflows with the same inputs.

Enable caching for the entire pipeline

Create your pipeline with caching enabled:

from gllm_pipeline.pipeline import Pipeline

e2e_pipeline_with_cache = Pipeline(
    [
        step(
            component=BasicVectorRetriever(data_store),
            input_map={"query": "user_query", "top_k": "top_k"},
            output_state="chunks",
        ),
        step(
            component=ResponseSynthesizer.stuff_preset(os.getenv("LANGUAGE_MODEL")),
            input_map={"query": "user_query", "chunks": "chunks"},
            output_state="response",
        ),
    ],
    cache_store=cache_store,  # Enable pipeline-level caching
)

Benefits:

Maximum performance for repeated identical queries
Simple implementation - just add cache_store parameter
Best for production environments with repetitive usage patterns

Option 2: Multi-Level Caching

We could also use step-level caching alongside pipeline caching. If pipeline caching fails, each step with an active cache will check for a cache hit individually.

Enable caching for the step and pipeline level

Create your step and pipeline with caching enabled:

from gllm_pipeline.pipeline import Pipeline

e2e_pipeline_with_cache = Pipeline(
    [
        step(
            component=BasicVectorRetriever(data_store),
            input_map={"query": "user_query", "top_k": "top_k"},
            output_state="chunks",
            cache_store=cache_store,  # Enable step-level caching
        ),
        step(
            component=ResponseSynthesizer.stuff_preset(os.getenv("LANGUAGE_MODEL")),
            input_map={"query": "user_query", "chunks": "chunks"},
            output_state="response",
        ),
    ],
    cache_store=cache_store,  # Enable pipeline-level caching
)

3) Run the Pipeline

When running the pipeline, you may encounter an error like this:

[2025-08-26T14:36:10+0700.550 chromadb.telemetry.product.posthog ERROR] Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

Don't worry about this, since we do not use this Chroma feature. Your Pipeline should still work.

Configure the pipeline state for testing

Set up test cases to demonstrate caching behavior:

# Test state for caching demonstration
test_state = {
    "user_query": "Give me nocturnal creatures",
    "context": "",
}

config = {
    "top_k": 5,
    "debug": True,  # Enable debug to see cache hits/misses
}

Run the pipeline first time (cache miss)

import time

start_time = time.time()
result1 = asyncio.run(e2e_pipeline_with_cache.invoke(test_state, config))
first_time = time.time() - start_time
print(f"First execution time: {first_time:.2f} seconds")
print(f"Result: {result1['response'][:100]}...")

This execution will populate both step-level and pipeline-level caches.

Run the same pipeline again (cache hit)

start_time = time.time()
result2 = asyncio.run(e2e_pipeline_with_cache.invoke(test_state, config))
second_time = time.time() - start_time
print(f"Second execution time: {second_time:.2f} seconds")
print(f"Speed improvement: {(first_time/second_time):.1f}x faster")
print(f"Results identical: {result1['response'] == result2['response']}")

You should see an improvement on the second run.

Troubleshooting

Cache not providing expected speedup:
1. Verify debug logs show cache hits/misses as expected
2. Ensure your inputs are similar enough to trigger cache hits
General caching issues:
1. Verify your cache data store is properly initialized
2. Check that cache keys are being generated consistently
3. Monitor cache hit/miss rates to optimize cache configuration
4. Test cache behavior with various input patterns

Congratulations! You've successfully enhanced your RAG pipeline with multi-level caching functionality. Your pipeline can now eliminate redundant computations and provide dramatic performance improvements for repeated or similar requests. This caching system scales with your application and provides intelligent matching for optimal cache utilization.o

PreviousMultimodal Input Handling NextRAG with Dynamic Models

Last updated 2 months ago

Was this helpful?

hashtagInstallation

hashtagProject Setup

hashtagUnderstanding Pipeline Caching

hashtag1) Set Up Your Cache Data Store

hashtag2) Choose Your Caching Strategy

hashtagOption 1: Pipeline-Level Caching

hashtagOption 2: Multi-Level Caching

hashtag3) Run the Pipeline

hashtagTroubleshooting