message[BETA] Realtime Session

gllm-inferencearrow-up-right | Tutorial: [BETA] Realtime Session | API Referencearrow-up-right | Cookbookarrow-up-right

circle-exclamation

What’s a Realtime Session?

The realtime session is a unified interface designed to help you interact with language models that supports realtime interactions. In this tutorial, you'll learn how to perform realtime session using the GoogleRealtimeSession module in just a few lines of code.

chevron-rightPrerequisiteshashtag

This example specifically requires:

  1. Completion of all setup steps listed on the Prerequisitesarrow-up-right page.

  2. Setting a Gemini API keyarrow-up-right in the GOOGLE_API_KEY environment variable.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-inference

Quickstart

Let’s jump into a basic example using GoogleRealtimeSession.

from dotenv import load_dotenv
load_dotenv()

import asyncio
from gllm_inference.realtime_session import GoogleRealtimeSession

realtime_session = GoogleRealtimeSession(model_name="gemini-2.5-flash-native-audio-preview-12-2025")
asyncio.run(realtime_session.start())

Notice that after the realtime session starts, the following message appears in the console:

The conversation starts:

The realtime session modules utilize a set of input and output streamers to define the input sources and output destinations when interacting with the language model. Notice that by default, it uses the following IO streamers:

  1. KeyboardInputStreamer : Sending text inputs sent via the keyboard to model.

  2. ConsoleOutputStreamer : Displaying text outputs from the model to the console.

This means that by default, the GoogleRealtimeSession modules support text inputs and text outputs. Try typing through your keyboard to start interacting with the model!

Interaction Example:

When you're done, you can type /quit to end the conversation.

Ending the conversation:

IO Streamer Customization

Now, let's try using other kinds of IO streamers! In the example below, we're going to utilize the LinuxMicInputStreamer and LinuxSpeakerOutputStreamer to converse with the model via audio inputs and audio outputs!

circle-exclamation

The conversation starts:

Try speaking through your microphone and have fun conversing with the language models in realtime!

After you're done, try combining them with our default IO streamers and see what happens!

Tool Calling

Tool calling means letting a language model call external functions to help it solve a task. It allows the model to interact with external functions and APIs during the conversation, enabling dynamic computation, data retrieval, and complex workflows.

For more information about tools definition, please refer to this guide.

Now, let's try adding tool calling capabilities to our GoogleRealtimeSession module!

circle-exclamation

The conversation starts:

Now try asking a question about the weather of your city!

Interaction Example:

Once again, you can type /quit to end the conversation.

Ending the conversation:

Integration with External System

Now that we've successfully tested the realtime session modules locally, let's learn how to integrate it as part of a larger system!

To communicate with external systems, the realtime session modules rely on the following IO streamers:

  1. EventInputStreamer: Enables external system to push a RealtimeEvent object as inputs for the realtime session module.

  2. EventOutputStreamer: Streams the realtime session modules output events through the event emitter, allowing the system to consumes the output as standard events.

Let's try to simulate a simple integration with an external system using the GoogleRealtimeSession:

The conversation starts:

Please note that in this example, you don't need to do anything, as we've already defined the inputs through the script. Simply observe and wait until the realtime session receives the termination activity event and ends the session.

Output example:

In this example, we simply print the streamed events streamed by the event emitter regardless of their typing, which causes the text and audio output to be mixed in the console. In an actual system, please handle each type of output events accordingly based on your requirements!

Future Plans

In the future, more IO streamers can be added to allow for more robust realtime experience, this may include but are not limited to:

  1. Input streamers

    1. FileInputStreamer

    2. ScreenCaptureInputStreamer

    3. CameraInputStreamer

    4. WindowsMicInputStreamer

    5. MacMicInputStreamer

  2. Output streamers

    1. FileOutputStreamer

    2. WindowsSpeakerOutputStreamer

    3. MacSpeakerOutputStreamer

Last updated

Was this helpful?