Caching

This guide will walk you through implementing caching in your AI pipelines to eliminate redundant computations and improve performance. We'll explore how pipeline caching can transform expensive, repetitive operations into instant responses.

Caching functionality gives you control over performance optimization in your pipeline, providing flexibility to cache at different levels based on your specific needs. For example, you can implement step-level caching for expensive operations, pipeline-level caching for complete workflows, or combine both for maximum efficiency.

This tutorial extends the Your First RAG Pipeline tutorial. Ensure you have followed the instructions to set up your repository.

Prerequisites

This example specifically requires:

  1. Completion of the Your First RAG Pipeline tutorial - this builds directly on top of it

  2. Completion of all setup steps listed on the Prerequisites page

  3. A working OpenAI API key configured in your environment variables

You should be familiar with these concepts and components:

  1. Components in Your First RAG Pipeline - Required foundation

View full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

You can either:

  1. You can refer to the guide whenever you need explanation or want to clarify how each part works.

  2. Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Project Setup

1

Extend Your RAG Pipeline Project

Start with your completed RAG pipeline project from the previous tutorial. The caching functionality works with any pipeline components - we'll demonstrate with your existing RAG pipeline:

Your existing structure is already complete:

<your-project>/
├── data/
│   └── imaginary_animals.csv       
├── modules/
│   ├── __init__.py                 
│   ├── retriever.py                
│   └── response_synthesizer.py    
├── pipeline.py                     # 👈 Will be updated with caching
├── indexer.py                     
└── .env                           

Understanding Pipeline Caching

When you deploy pipelines to production, you quickly discover a common pattern: the same inputs get processed over and over again. Users ask similar questions, run identical analyses, or trigger the same computational workflows repeatedly.

GLLM Pipeline framework provides two levels of caching that work seamlessly together: pipeline-level caching and step-level caching. Pipeline-level caching stores the entire pipeline's output for a given input, while step-level caching stores individual step results within the pipeline execution.

Our caching system uses vector data store as the cache backend, which provides several advantages: semantic similarity matching (so similar inputs can benefit from cached results), scalable storage, and fast retrieval performance.

1) Set Up Your Cache Data Store

1

Create the cache data store

Before implementing any caching option, you need to set up a cache data store. Add this to your pipeline file:

You could also configure the matching config (exact match/fuzzy match/semantic match) of the cache store by following the guide in Vector Data Store page.

2) Choose Your Caching Strategy

Pipeline caching allows you to optimize performance at different levels based on your specific needs. Each approach can be implemented independently, giving you flexibility to choose the right caching strategy for your use case:

  1. Step-Level Caching: Caches individual step results within pipeline execution

  2. Pipeline-Level Caching: Caches complete pipeline outputs for given inputs

  3. Multi-Level Caching: Combines both approaches for maximum efficiency

You can choose any combination of these options based on your performance requirements and use cases.

Option 1: Pipeline-Level Caching

When to use: Cache complete pipeline results when users frequently run identical workflows with the same inputs.

1

Enable caching for the entire pipeline

Create your pipeline with caching enabled:

Benefits:

  • Maximum performance for repeated identical queries

  • Simple implementation - just add cache_store parameter

  • Best for production environments with repetitive usage patterns

Option 2: Multi-Level Caching

We could also use step-level caching alongside pipeline caching. If pipeline caching fails, each step with an active cache will check for a cache hit individually.

1

Enable caching for the step and pipeline level

Create your step and pipeline with caching enabled:

3) Run the Pipeline

When running the pipeline, you may encounter an error like this:

[2025-08-26T14:36:10+0700.550 chromadb.telemetry.product.posthog ERROR] Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

Don't worry about this, since we do not use this Chroma feature. Your Pipeline should still work.

1

Configure the pipeline state for testing

Set up test cases to demonstrate caching behavior:

2

Run the pipeline first time (cache miss)

This execution will populate both step-level and pipeline-level caches.

3

Run the same pipeline again (cache hit)

You should see an improvement on the second run.

Troubleshooting

  1. Cache not providing expected speedup:

    1. Verify debug logs show cache hits/misses as expected

    2. Ensure your inputs are similar enough to trigger cache hits

  2. General caching issues:

    1. Verify your cache data store is properly initialized

    2. Check that cache keys are being generated consistently

    3. Monitor cache hit/miss rates to optimize cache configuration

    4. Test cache behavior with various input patterns


Congratulations! You've successfully enhanced your RAG pipeline with multi-level caching functionality. Your pipeline can now eliminate redundant computations and provide dramatic performance improvements for repeated or similar requests. This caching system scales with your application and provides intelligent matching for optimal cache utilization.o

Last updated