Implement Semantic Routing
This guide will walk you through setting up semantic routing in your RAG pipeline to intelligently route different types of queries to specialized handlers.
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc
Set Up Your Project
Prepare your repository
Let’s prepare your workspace step by step.
Create a new project folder:
mkdir my-semantic-routing-pipeline
cd my-semantic-routing-pipeline
Prepare your .env
file:
Create a file named .env
in your project directory with the following content:
EMBEDDING_MODEL="text-embedding-3-small"
LANGUAGE_MODEL="gpt-4.1"
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
Arrange your project structure to include the semantic routing components:
my-semantic-routing-pipeline/
├── modules/
│ ├── __init__.py
│ ├── semantic_router.py # 👈 New
│ └── handlers.py # 👈 New
├── router_pipeline.py # 👈 New
└── main.py
Build Semantic Routing Components
Now let's build the components that will enable intelligent query routing.
Create the Semantic Router
The semantic router analyzes incoming queries and determines which specialized handler should process them. It uses embedding similarity to match queries against predefined route examples.
Load environment settings and dependencies
Create modules/semantic_router.py
and start with the basic imports:
import os
from dotenv import load_dotenv
from gllm_misc.router.similarity_based_router import SimilarityBasedRouter
from gllm_inference.em_invoker import OpenAIEMInvoker
load_dotenv()
Set up the embedding model for routing
The semantic router needs an embedding model to understand query meanings:
em_invoker_openai = OpenAIEMInvoker(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small",
)
🧠 We use the same embedding model as in your retriever for consistency.
Define route examples
This is the core of semantic routing - you define example queries for each route category:
def semantic_router_component():
# Define route examples for different categories
route_examples = {
"code_generation": [
"Write a Python script that reads a CSV file, filters rows where the 'status' column is 'active', and saves the result to a new CSV",
"Generate a Java function that takes a list of integers and returns a new list containing only the prime numbers",
"Create a SQL query to join two tables: orders and customers, returning the customer name, order date, and total amount for orders placed in the last 30 days",
"Generate a Dockerfile for a Flask application running on Python 3.11, exposing port 5000",
"Write a Python code to sort a dataframe based on the 'date' and 'value' columns",
"Write a Python code to calculate the average of a list of numbers",
"Write a Python code to calculate the median of a list of numbers",
"Write a Python code to calculate the mode of a list of numbers",
"Write a Python code to calculate the standard deviation of a list of numbers",
"Write a Python code to calculate the variance of a list of numbers",
"Write a Python code to calculate the correlation between two lists of numbers",
],
"general": [
"What is the capital of France?",
"General knowledge question",
"Tell me about history",
"What is the meaning of life?",
"How does photosynthesis work?",
"What are the benefits of exercise?",
"Tell me about space exploration",
"What is machine learning?",
"How do plants grow?",
"What is the population of Tokyo?"
]
}
How it works:
The router compares incoming queries against these examples using embedding similarity
More examples = better routing accuracy
Examples should be diverse and representative of each category
Create the similarity-based router
Finally, instantiate the router with your configuration:
similarity_router = SimilarityBasedRouter(
em_invoker=em_invoker_openai,
route_examples=route_examples,
default_route="general",
similarity_threshold=0.6
)
return similarity_router
Parameters explained:
em_invoker
: The embedding model for calculating similaritiesroute_examples
: Your predefined examples for each routedefault_route
: Fallback route when no good match is foundsimilarity_threshold
: Minimum similarity score to match a route (0.6 = 60% similarity)
Create Specialized Handlers
Different types of queries need different handling approaches. Let's create specialized response synthesizers for each route type.
Create the handlers file
Create modules/handlers.py
with the necessary imports:
import asyncio
from gllm_inference.lm_invoker import OpenAILMInvoker
from gllm_pipeline.steps import step
from gllm_generation.response_synthesizer import StuffResponseSynthesizer
from gllm_inference.request_processor import LMRequestProcessor
from gllm_inference.prompt_builder import PromptBuilder
Create the code generation handler
This handler is optimized for generating code responses:
def code_generation_handler() -> StuffResponseSynthesizer:
"""Create a step that handles technical queries."""
lm_invoker = OpenAILMInvoker(model_name="gpt-4.1")
system_template = """You are a helpful assistant that provides code based on the user's query.
Be concise and to the point. Answer with only the code, no other text."""
prompt_builder = PromptBuilder(
system_template=system_template,
user_template="User's query: {query}",
)
return StuffResponseSynthesizer(
LMRequestProcessor(
lm_invoker=lm_invoker,
prompt_builder=prompt_builder,
),
)
Key features:
Uses a specialized system prompt for code generation
Configured to return concise, code-focused responses
No retrieval needed - pure generation
Create the general query handler
This handler is optimized for general knowledge questions:
def general_query_handler() -> StuffResponseSynthesizer:
"""Create a step that handles general queries."""
lm_invoker = OpenAILMInvoker(model_name="gpt-4.1")
system_template = """You are a helpful assistant that provides accurate and informative answers to general knowledge questions.
Be concise but thorough in your responses."""
prompt_builder = PromptBuilder(
system_template=system_template,
user_template="Question: {query}",
)
return StuffResponseSynthesizer(
LMRequestProcessor(
lm_invoker=lm_invoker,
prompt_builder=prompt_builder,
)
)
Key differences:
Different system prompt optimized for general knowledge
Encourages thorough but concise responses
Could be extended to use different models or retrieval strategies
Build the Pipeline
Now we'll create a new pipeline that combines semantic routing with conditional execution.
Understanding Conditional Steps
A ConditionalStep allows your pipeline to make decisions about which path to take based on runtime conditions. Here's how it works:
Router Step: Analyzes the query and determines the route
Conditional Step: Uses the route to decide which handler to execute
Handler Execution: Runs the appropriate specialized handler
Create the Routing Pipeline
Create the routing pipeline file
Create router_pipeline.py
with the necessary imports:
from gllm_pipeline.pipeline.states import RAGState
from gllm_pipeline.steps import step
from gllm_pipeline.steps.conditional_step import ConditionalStep
from modules import (
code_generation_handler,
general_query_handler,
semantic_router_component,
)
Define custom state
Extend the default RAGState to include routing information:
class RouterState(RAGState):
route: str
This adds a route
field to track which route was selected.
Create component instances
Instantiate all the components you'll need:
semantic_router_component = semantic_router_component()
code_generation_component = code_generation_handler()
general_query_component = general_query_handler()
Create individual handler steps
Define steps for each specialized handler:
code_generation_step = step(
code_generation_component,
{"query": "user_query"},
"response",
)
general_query_step = step(
general_query_component,
{"query": "user_query"},
"response",
)
Both steps take the user query and output a response, but use different handlers.
Create the conditional step
This is where the magic happens - the conditional step chooses which handler to execute:
conditional_step = ConditionalStep(
name="conditional_step",
branches={"code_generation": code_generation_step, "general": general_query_step},
condition=semantic_router_component,
input_state_map={"source": "user_query"},
output_state="response",
)
Parameters explained:
branches
: Maps route names to their corresponding stepscondition
: The router component that determines which branch to takeinput_state_map
: Passes the user query to the router for decision makingoutput_state
: Where to store the final response
Compose the final pipeline
Connect the router and conditional steps:
e2e_pipeline_with_semantic_router = Pipeline([conditional_step], state_type = RouterState)
This creates a pipeline that:
Routes the query to determine the appropriate handler
Conditionally executes the selected handler
Returns the specialized response
Run the Application
Now let's test the semantic routing functionality with different types of queries.
Start your server
Run your FastAPI server as before:
poetry run uvicorn main:app --reload
You should see something like:
INFO: Uvicorn running on http://127.0.0.1:8000
Test with code generation queries
Modify the prompts in run.py
file. Try these queries with debug: true
to see the routing in action:
Code Generation Examples:
{
"user_query": "Write a Python function to calculate factorial",
"debug": true
}
{
"user_query": "Create a SQL query to find all users who logged in today",
"debug": true
}
You should see in the debug logs that these get routed to the code_generation
handler.
Test with general knowledge queries
Try these general knowledge questions:
General Knowledge Examples:
{
"user_query": "What is the capital of Japan?",
"debug": true
}
{
"user_query": "How does photosynthesis work?",
"debug": true
}
These should be routed to the general
handler.
Verify routing decisions
With debug: true
, you should see logs showing:
Which route was selected
The similarity scores for each route. Observe how the similarity threshold affects routing decisions.
Which handler was executed
The specialized response format
Example debug output:
Starting pipeline
[Start 'SimilarityBasedRouter'] Routing input source: 'Generate python code calculate the average of a list of numbers'
[Finished 'SimilarityBasedRouter'] Successfully selected route: 'code_generation'
[Start 'StuffResponseSynthesizer'] Processing query: 'Generate python code calculate the average of a list of numbers'
Understanding the Flow
Here's what happens when a query comes in:
Query Analysis: The semantic router compares the incoming query against all route examples using embedding similarity
Route Selection: The route with the highest similarity score (above the threshold) is selected
Conditional Execution: The ConditionalStep executes the appropriate handler based on the selected route
Specialized Processing: The specialized handler processes the query with its optimized prompt and model configuration
Response Generation: The handler returns a response tailored to the query type
Troubleshooting
Routes not working as expected:
Check your route examples - they should be representative and diverse
Verify the similarity threshold isn't too high or too low
Add more examples for better classification
All queries going to default route:
Lower the similarity threshold
Add more diverse examples to your route categories
Check that your embedding model is working correctly
Wrong route selection:
Review and improve your route examples
Consider adding negative examples or adjusting thresholds
Use debug mode to see similarity scores
📂 Complete Tutorial Files
Congratulations! You've successfully implemented semantic routing in your RAG pipeline. This intelligent routing system will help you deliver more relevant and specialized responses based on the type of query your users submit.
Last updated