Stream LM Output

This guide will walk you through creating real-time streaming responses using LM Request Processor (LMRP) with event handling.

Stream output with LMRP allows you to receive AI responses in real-time as tokens are generated, providing immediate feedback to users while maintaining the full pipeline capabilities of LMRP including prompt formatting and processing.

For example, when asking about Tokyo travel recommendations, instead of waiting for the complete response, you can see each word appearing progressively: "Here" → "are" → "some" → "great" → "activities" → "in" → "Tokyo" → "..."

Prerequisites

This example specifically requires:

  1. Completion of all setup steps listed on the Prerequisites page.

  2. A working OpenAI API key configured in your environment variables.

You should be familiar with these concepts and components:

View full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-core

You can either:

  1. You can refer to the guide whenever you need explanation or want to clarify how each part works.

  2. Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Project Setup

1

Environment Configuration

Ensure you have a file named .env in your project directory with the following content:

OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"

Replace <YOUR_OPENAI_API_KEY> with your actual OpenAI API key.


Build Your Streaming LMRP System

1) Set Up Event Handling Components

The event system manages real-time token streaming from the language model:

1

Import Required Libraries

Start by importing the necessary dependencies for streaming:

2

Create the Event System

Set up the streaming components that will handle real-time events:

The StreamEventHandler captures streaming events, while EventEmitter manages event distribution to handlers.

2) Configure LMRP Components

The LMRP components work together to process prompts and generate streaming responses:

1

Set up LM Invoker and Prompt Builder

The LM invoker handles model communication while the prompt builder formats your templates consistently.

2

Create the LM Request Processor

This combines your prompt formatting and model invocation into a complete processing pipeline.

3) Implement Concurrent Streaming

The concurrent execution allows you to process the request and stream tokens simultaneously:

1

Create the Processing Task

Use asyncio.create_task() to run the processor concurrently with streaming:

The event_emitter parameter enables streaming - without it, you'd only get the final response.

2

Process Streaming Events

Iterate through streaming events as they arrive:

Each event contains token information with metadata like timestamp and content type.

3

Clean Up Resources

Ensure proper cleanup after streaming completes:

This waits for the processor to finish and properly closes the event system.

📂 Complete Guide Files

Here's the full implementation that brings everything together:

Run the Streaming Example

1

Execute the Script

Run your streaming LMRP script:

2

Observe Real-time Output

When running the streaming example, you'll see real-time token output:

Each token includes:

  • value: The token or text fragment

  • level: Log level ('INFO' for response tokens)

  • type: Event type ('response' for generated content)

  • timestamp: Precise generation timestamp

Tips

Alternative Implementation Patterns

Pattern 1: Real-time Display

Pattern 2: Collecting Full Response

Pattern 3: Conditional Processing

When to Use Streaming vs Standard Processing

Use Streaming When:

  • User experience is priority (immediate feedback)

  • Generating long responses (articles, explanations)

  • Building interactive applications

  • Need to process partial responses

Use Standard Processing When:

  • Simple, quick responses

  • Batch processing scenarios

  • When final response structure is needed before proceeding

  • Processing structured outputs that require complete validation


Last updated