Implement Semantic Routing

This guide will walk you through setting up semantic routing in your RAG pipeline to intelligently route different types of queries to specialized handlers.

Prerequisites

This tutorial specifically requires:

All setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

Set Up Your Project

Prepare your repository

Let’s prepare your workspace step by step.

Create a new project folder:

mkdir my-semantic-routing-pipeline
cd my-semantic-routing-pipeline

Prepare your .env file:

Create a file named .env in your project directory with the following content:

EMBEDDING_MODEL="text-embedding-3-small"
LANGUAGE_MODEL="gpt-4.1"
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"

This is an example .env file. You may adjust the variables according to your need.

Arrange your project structure to include the semantic routing components:

my-semantic-routing-pipeline/
├── modules/
│   ├── __init__.py
│   ├── semantic_router.py          # 👈 New
│   └── handlers.py                 # 👈 New
├── router_pipeline.py              # 👈 New
└── main.py

Build Semantic Routing Components

Now let's build the components that will enable intelligent query routing.

Create the Semantic Router

The semantic router analyzes incoming queries and determines which specialized handler should process them. It uses embedding similarity to match queries against predefined route examples.

Load environment settings and dependencies

Create modules/semantic_router.py and start with the basic imports:

import os
from dotenv import load_dotenv
from gllm_misc.router.similarity_based_router import SimilarityBasedRouter
from gllm_inference.em_invoker import OpenAIEMInvoker

load_dotenv()

Set up the embedding model for routing

The semantic router needs an embedding model to understand query meanings:

em_invoker_openai = OpenAIEMInvoker(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)

🧠 We use the same embedding model as in your retriever for consistency.

Define route examples

This is the core of semantic routing - you define example queries for each route category:

def semantic_router_component():
    # Define route examples for different categories
    route_examples = {
        "code_generation": [
            "Write a Python script that reads a CSV file, filters rows where the 'status' column is 'active', and saves the result to a new CSV",
            "Generate a Java function that takes a list of integers and returns a new list containing only the prime numbers",
            "Create a SQL query to join two tables: orders and customers, returning the customer name, order date, and total amount for orders placed in the last 30 days",
            "Generate a Dockerfile for a Flask application running on Python 3.11, exposing port 5000",
            "Write a Python code to sort a dataframe based on the 'date' and 'value' columns",
            "Write a Python code to calculate the average of a list of numbers",
            "Write a Python code to calculate the median of a list of numbers",
            "Write a Python code to calculate the mode of a list of numbers",
            "Write a Python code to calculate the standard deviation of a list of numbers",
            "Write a Python code to calculate the variance of a list of numbers",
            "Write a Python code to calculate the correlation between two lists of numbers",
        ],
        "general": [
            "What is the capital of France?",
            "General knowledge question",
            "Tell me about history",
            "What is the meaning of life?",
            "How does photosynthesis work?",
            "What are the benefits of exercise?",
            "Tell me about space exploration",
            "What is machine learning?",
            "How do plants grow?",
            "What is the population of Tokyo?"
        ]
    }

How it works:

The router compares incoming queries against these examples using embedding similarity
More examples = better routing accuracy
Examples should be diverse and representative of each category

Create the similarity-based router

Finally, instantiate the router with your configuration:

    similarity_router = SimilarityBasedRouter(
        em_invoker=em_invoker_openai,
        route_examples=route_examples,
        default_route="general",
        similarity_threshold=0.6
    )

    return similarity_router

Parameters explained:

em_invoker: The embedding model for calculating similarities
route_examples: Your predefined examples for each route
default_route: Fallback route when no good match is found
similarity_threshold: Minimum similarity score to match a route (0.6 = 60% similarity)

Create Specialized Handlers

Different types of queries need different handling approaches. Let's create specialized response synthesizers for each route type.

Create the handlers file

Create modules/handlers.py with the necessary imports:

import asyncio
from gllm_inference.lm_invoker import OpenAILMInvoker
from gllm_pipeline.steps import step
from gllm_generation.response_synthesizer import StuffResponseSynthesizer
from gllm_inference.request_processor import LMRequestProcessor
from gllm_inference.prompt_builder import PromptBuilder

Create the code generation handler

This handler is optimized for generating code responses:

def code_generation_handler() -> StuffResponseSynthesizer:
    """Create a step that handles technical queries."""

    lm_invoker = OpenAILMInvoker(model_name="gpt-4.1")

    system_template = """You are a helpful assistant that provides code based on the user's query.
        Be concise and to the point. Answer with only the code, no other text."""

    prompt_builder = PromptBuilder(
        system_template=system_template,
        user_template="User's query: {query}",
    )

    return StuffResponseSynthesizer(
        LMRequestProcessor(
            lm_invoker=lm_invoker,
            prompt_builder=prompt_builder,
        ),
    )

Key features:

Uses a specialized system prompt for code generation
Configured to return concise, code-focused responses
No retrieval needed - pure generation

Create the general query handler

This handler is optimized for general knowledge questions:

def general_query_handler() -> StuffResponseSynthesizer:
    """Create a step that handles general queries."""

    lm_invoker = OpenAILMInvoker(model_name="gpt-4.1")

    system_template = """You are a helpful assistant that provides accurate and informative answers to general knowledge questions.
        Be concise but thorough in your responses."""

    prompt_builder = PromptBuilder(
        system_template=system_template,
        user_template="Question: {query}",
    )

    return StuffResponseSynthesizer(
        LMRequestProcessor(
            lm_invoker=lm_invoker,
            prompt_builder=prompt_builder,
        )
    )

Key differences:

Different system prompt optimized for general knowledge
Encourages thorough but concise responses
Could be extended to use different models or retrieval strategies

Build the Pipeline

Now we'll create a new pipeline that combines semantic routing with conditional execution.

Understanding Conditional Steps

A ConditionalStep allows your pipeline to make decisions about which path to take based on runtime conditions. Here's how it works:

Router Step: Analyzes the query and determines the route
Conditional Step: Uses the route to decide which handler to execute
Handler Execution: Runs the appropriate specialized handler

Create the Routing Pipeline

Create the routing pipeline file

Create router_pipeline.py with the necessary imports:

from gllm_pipeline.pipeline.states import RAGState
from gllm_pipeline.steps import step
from gllm_pipeline.steps.conditional_step import ConditionalStep

from modules import (
    code_generation_handler,
    general_query_handler,
    semantic_router_component,
)

Define custom state

Extend the default RAGState to include routing information:

class RouterState(RAGState):
    route: str

This adds a route field to track which route was selected.

Create component instances

Instantiate all the components you'll need:

semantic_router_component = semantic_router_component()
code_generation_component = code_generation_handler()
general_query_component = general_query_handler()

Create individual handler steps

Define steps for each specialized handler:

code_generation_step = step(
    code_generation_component,
    {"query": "user_query"},
    "response",
)

general_query_step = step(
    general_query_component,
    {"query": "user_query"},
    "response",
)

Both steps take the user query and output a response, but use different handlers.

Create the conditional step

This is where the magic happens - the conditional step chooses which handler to execute:

conditional_step = ConditionalStep(
    name="conditional_step",
    branches={"code_generation": code_generation_step, "general": general_query_step},
    condition=semantic_router_component,
    input_state_map={"source": "user_query"},
    output_state="response",
)

Parameters explained:

branches: Maps route names to their corresponding steps
condition: The router component that determines which branch to take
input_state_map: Passes the user query to the router for decision making
output_state: Where to store the final response

Compose the final pipeline

Connect the router and conditional steps:

e2e_pipeline_with_semantic_router = Pipeline([conditional_step], state_type = RouterState)

This creates a pipeline that:

Routes the query to determine the appropriate handler
Conditionally executes the selected handler
Returns the specialized response

Run the Application

Now let's test the semantic routing functionality with different types of queries.

Create the API in main.py

Simply download and place this script in the main.py file

1KB

main.py

Start your server

Run your FastAPI server as before:

poetry run uvicorn main:app --reload

You should see something like:

INFO:     Uvicorn running on http://127.0.0.1:8000

Test Your RAG Pipeline via API

To test your app, download and run this run.py file

1KB

run.py

Test with code generation queries

Modify the prompts in run.py file. Try these queries with debug: true to see the routing in action:

Code Generation Examples:

{
  "user_query": "Write a Python function to calculate factorial",
  "debug": true
}

{
  "user_query": "Create a SQL query to find all users who logged in today",
  "debug": true
}

You should see in the debug logs that these get routed to the code_generation handler.

Test with general knowledge queries

Try these general knowledge questions:

General Knowledge Examples:

{
  "user_query": "What is the capital of Japan?",
  "debug": true
}

{
  "user_query": "How does photosynthesis work?",
  "debug": true
}

These should be routed to the general handler.

Verify routing decisions

With debug: true, you should see logs showing:

Which route was selected
The similarity scores for each route. Observe how the similarity threshold affects routing decisions.
Which handler was executed
The specialized response format

Example debug output:

Starting pipeline
[Start 'SimilarityBasedRouter'] Routing input source: 'Generate python code calculate the average of a list of numbers'
[Finished 'SimilarityBasedRouter'] Successfully selected route: 'code_generation'
[Start 'StuffResponseSynthesizer'] Processing query: 'Generate python code calculate the average of a list of numbers'

Understanding the Flow

Here's what happens when a query comes in:

Query Analysis: The semantic router compares the incoming query against all route examples using embedding similarity
Route Selection: The route with the highest similarity score (above the threshold) is selected
Conditional Execution: The ConditionalStep executes the appropriate handler based on the selected route
Specialized Processing: The specialized handler processes the query with its optimized prompt and model configuration
Response Generation: The handler returns a response tailored to the query type

Troubleshooting

Routes not working as expected:
1. Check your route examples - they should be representative and diverse
2. Verify the similarity threshold isn't too high or too low
3. Add more examples for better classification
All queries going to default route:
1. Lower the similarity threshold
2. Add more diverse examples to your route categories
3. Check that your embedding model is working correctly
Wrong route selection:
1. Review and improve your route examples
2. Consider adding negative examples or adjusting thresholds
3. Use debug mode to see similarity scores

📂 Complete Tutorial Files

Coming soon!

Congratulations! You've successfully implemented semantic routing in your RAG pipeline. This intelligent routing system will help you deliver more relevant and specialized responses based on the type of query your users submit.

PreviousEnable/Disable Knowledge Base NextAdding Document References

Last updated 1 day ago