Multimodal Input Handling

This guide will walk you through adding multimodal input handling to your existing RAG pipeline. This will allow your pipeline to process more than just text inputs, making your application!

This tutorial extends the Your First RAG Pipeline tutorial. Ensure you have followed the instructions to set up your repository.

Prerequisites

This example specifically requires:

Completion of the Your First RAG Pipeline tutorial - this builds directly on top of it
Completion of all setup steps listed on the Prerequisites page
A working OpenAI API key configured in your environment variables

You should be familiar with these concepts and components:

Your First RAG Pipeline - Required foundation
Routing
switch

View full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-rag gllm-core gllm-generation gllm-inference gllm-pipeline gllm-retrieval gllm-misc gllm-datastore

How to Use this Guide

You can either:

Download or copy the complete guide file(s) to get everything ready instantly by heading to 📂 Complete Guide Files section in the end of this page. You can refer to the guide whenever you need explanation or want to clarify how each part works.
Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Project Setup

Start From Your RAG Pipeline Project

Start with your completed RAG pipeline project from the Your First RAG Pipeline tutorial. We don't need to add any new file for this tutorial. Therefore, the structure should stay as is:

<project-name>/
├── data/
│   ├── <index>/...                     # preset data index folder
│   ├── chroma.sqlite3                  # preset database file
│   ├── imaginary_animals.csv           # sample data
├── modules/
│   ├── retriever.py
│   └── response_synthesizer.py
├── .env
├── indexer.py                    
└── pipeline.py    # 👈 Will be adjusted for multimodal input handling

1) Adding Multimodal Inputs Handling

Extending the Pipeline

Let's adjust the pipeline to handle multimodal inputs. In this tutorial, let's assume that the attachment files are passed as local paths through the pipeline state.

Define the extended state

Create a custom state that includes the attachment files as input as well as the extra contents list to be passed to the response synthesizer:

from gllm_inference.schema import MessageContent

class MultimodalRAGState(RAGState):
    attachments: list[str]
    extra_contents: list[MessageContent]

Create a function to create the extra contents

Our goal is to pass the input attachments as Attachment objects to the response synthesizer's extra_contents parameter. To do this, lets create a custom function!

from typing import Any
from gllm_inference.schema import MessageContent

def format_extra_contents(inputs: dict[str, Any]) -> list[MessageContent]:
    attachments: list[bytes] = inputs["attachments"]
    return [Attachment.from_path(path) for path in attachments]

Update the response synthesizer with a new prompt

We'll update this with a prompt that can test our multimodal functionality.

import os

from dotenv import load_dotenv
from gllm_generation.response_synthesizer import StuffResponseSynthesizer
from gllm_inference.builder import build_lm_request_processor

load_dotenv()

Sresponse_synthesizer = ResponseSynthesizer.stuff(
    lm_request_processor=build_lm_request_processor(
        model_id=os.getenv("LANGUAGE_MODEL"),
        credentials=os.getenv("OPENAI_API_KEY"),
        system_template="""Create an imaginary animal that is similar to the animal in the picture. Context: {context}""",
        user_template="Question: {query}",
    )
)

Update the pipeline steps

Define the step to format extra contents and add the extra content param to the response synthesizer.

format_extra_contents_step = transform(  # 👈 New step
    format_extra_contents,
    ["attachments"],
    "extra_contents",
)

response_synthesizer_step = step(
    response_synthesizer,
    {
        "query": "user_query", 
        "chunks": "chunks",
        "extra_contents": "extra_contents",  # 👈 New parameter
    },
    "response",
)

Compose the final pipeline

Chain all steps to create the complete guardrail pipeline:

e2e_pipeline = format_extra_contents_step | retrieve_step | synthesize_step
e2e_pipeline.state_type = MultimodalRAGState

This creates a pipeline that can handle multimodal input files.

2) Run the Pipeline

When running the pipeline, you may encounter an error like this:

[2025-08-26T14:36:10+0700.550 chromadb.telemetry.product.posthog ERROR] Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

Don't worry about this, since we do not use this Chroma feature. Your Pipeline should still work.

Now that the pipeline is all set, let's try it!

Configure the pipeline state for testing

async def main():
    state = {
        "user_query": "Aquatic animals",
        "attachments": ["dog.png"],
    }
    config = {"top_k": 5}
    result = await e2e_pipeline.invoke(state, config)
    print(f"Pipeline result: {result['response']}")


if __name__ == "__main__":
    asyncio.run(main())

And that's it! Your pipeline should now be able to handle the attached multimodal files!

Troubleshooting

Attachment loading fails:
1. Ensure that the file exists in your local path.
2. Ensure that the path is valid. Pay attention whether you're using full path or relative path.
LM invocation fails:
1. Ensure that the model you're using supports the attachment type and extension.
2. Ensure that the attachment size does not exceed the model token limit.

Congratulations! You've successfully enhanced your RAG pipeline with multimodal input handling, allowing your application to process more than just text inputs!

PreviousQuery Transformation NextCaching

Last updated 2 months ago

Was this helpful?

hashtagInstallation

hashtagHow to Use this Guide

hashtagProject Setup

hashtag1) Adding Multimodal Inputs Handling

hashtagExtending the Pipeline

hashtag2) Run the Pipeline

hashtagTroubleshooting