Subgraphs
This guide will walk you through using pipeline subgraphs to break down complex pipelines into manageable, maintainable components. We'll transform a monolithic RAG pipeline that has become unwieldy into a well-organized system using focused subgraphs.
Pipeline subgraphs solve the complexity problem by providing a way to break down large pipelines into smaller, manageable pieces. Instead of having one massive pipeline with 20+ steps, you can create focused subgraphs, each responsible for a specific part of your workflow with its own clean state schema and testing strategy.
Important Note: The pipeline components used in this tutorial (QueryProcessor, DocumentRetriever, etc.) are simplified examples for demonstration purposes. In practice, you would replace these with your actual component implementations. This guide focuses on subgraph architecture patterns rather than component implementation details.
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-pipeline gllm-rag gllm-core gllm-generation gllm-inference gllm-retrieval gllm-misc gllm-datastore# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-pipeline gllm-rag gllm-core gllm-generation gllm-inference gllm-retrieval gllm-misc gllm-datastore# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-pipeline gllm-rag gllm-core gllm-generation gllm-inference gllm-retrieval gllm-misc gllm-datastoreProject Setup
Understanding the Problem
When you first start building pipelines, everything seems straightforward. You create a few steps, connect them together, and your pipeline works perfectly.
Your pipeline starts simple: take a user query, retrieve some documents, and generate a response. But then you realize you need query preprocessing, document filtering, reranking, context building, prompt engineering, etc. Suddenly, your pipeline has grown from 3 steps to 15 steps.
Key Pain Points:
Pipeline Bloat: Simple pipelines grow into 15+ step monsters
State Chaos: Dozens of intermediate variables with unclear purposes
Testing Nightmares: Can't test components in isolation
Maintenance Difficulties: Changes in one area break unrelated functionality
The Subgraph Solution
Subgraphs solve these problems by providing:
Modular Design: Break complex pipelines into focused, manageable pieces
State Isolation: Each subgraph has its own clean state schema
Reusability: Use the same subgraph in multiple pipelines
Clearer Organization: Logical grouping of related functionality
Easier Maintenance: Modify one subgraph without affecting others
Project Structure
Create your project structure for the subgraph refactoring:
<your-project>/
├── modules/
│ └── [your actual components]
├── pipeline_builder.py # 👈 Single file with all subgraphsFollowing real-world patterns, we'll organize all subgraphs within a single pipeline builder class, just like production implementations.
Problem: The Monolithic Pipeline with Tons of Steps
Let's first examine a typical complex RAG pipeline that has become unwieldy:
The Monolithic Pipeline State
Notice how this state has 13 different variables - it's becoming impossible to track what each one does and when it's used.
The Monolithic Pipeline Implementation
This pipeline has several problems:
Hard to understand what each section does
Cluttered state with intermediate variables
Difficult to test individual components
Risky changes - modifying one part can break others
Solution: Building with Subgraphs
Now let's refactor this into a clean, modular pipeline builder following real-world patterns:
1) Create the Pipeline Builder Class
Create the main pipeline builder
Create pipeline_builder.py:
Benefits:
Clean organization: Each subgraph is a separate method
Real-world pattern: Matches production implementations
Easy to understand: Clear separation of concerns
Define the clean main state
Notice how the main state only contains 6 essential variables instead of the original 13!
2) Build Individual Subgraphs
Create the Query Processing Subgraph
Benefits:
Clear responsibility: Only handles query processing
Clean state: Just 3 relevant variables
Easy testing: Can test query processing in isolation
Explicit mapping: Clear input/output contracts
Create the Retrieval Subgraph
Benefits:
Focused functionality: Only handles document retrieval and context building
Isolated state: Contains only retrieval-related variables
Reusable: Can be used in different types of pipelines
Create the Generation Subgraph
Benefits:
Single responsibility: Only handles response generation
Clear data flow: Easy to understand prompt → generation → validation flow
Independent testing: Can test generation logic separately
3) Run the Subgraph Pipeline
Complete the pipeline builder
Add the complete implementation:
Run the pipeline
Observe the improved output
You should see much cleaner debug output with clearly separated subgraph execution:
Benefits of the subgraph output:
Clear boundaries: Easy to see where each logical unit starts and ends
Focused debugging: Problems can be isolated to specific subgraphs
Progress tracking: Better visibility into pipeline execution progress
Comparison: Before vs After
By transforming the monolithic pipeline into subgraphs, we achieved:
Before: 15 steps in one pipeline → After: 3 focused subgraphs
Before: 13 cluttered state variables → After: 6 clean state variables
Before: Hard to test individual components → After: Easy independent testing
Before: Changes risk breaking other parts → After: Isolated modifications
Before: Unclear responsibilities → After: Single responsibility per subgraph
Troubleshooting
Common Issues
State mapping errors between subgraphs:
Ensure all required input states are properly mapped in
input_mapVerify that output states from one subgraph match input requirements of the next
Check that variable names are consistent across subgraph boundaries
Subgraph isolation breaking shared dependencies:
Make sure each subgraph includes all components it needs
Avoid assuming components are available from other subgraphs
Consider creating shared component factories for reusable dependencies
Complex debugging across multiple subgraphs:
Use meaningful names for each subgraph for easier identification
Enable debug mode to see subgraph boundaries in execution logs
Test individual subgraphs in isolation before combining them
Component implementation confusion:
Remember that the example components (QueryProcessor, etc.) are placeholders
Replace with your actual component implementations
Focus on the subgraph structure patterns, not the specific components
Debug Tips
Test subgraphs individually: Each subgraph should work independently with its own state
Use descriptive subgraph names: This makes debugging much easier
Enable debug logging: Set
debug: trueto see subgraph execution boundariesValidate state schemas: Ensure each subgraph's state schema matches its actual usage
Map states explicitly: Always use explicit state mapping rather than relying on defaults
Follow production patterns: Organize subgraphs as methods within a pipeline builder class
Congratulations! You've successfully learned how to use pipeline subgraphs to transform complex, monolithic pipelines into clean, maintainable, and testable modular systems. By following production patterns with a single pipeline builder class, your subgraph organization will be both powerful and practical for real-world applications.
Last updated