Stream LM Output

This guide will walk you through creating real-time streaming responses using LM Request Processor (LMRP) with event handling.

Stream output with LMRP allows you to receive AI responses in real-time as tokens are generated, providing immediate feedback to users while maintaining the full pipeline capabilities of LMRP including prompt formatting and processing.

For example, when asking about Tokyo travel recommendations, instead of waiting for the complete response, you can see each word appearing progressively: "Here" → "are" → "some" → "great" → "activities" → "in" → "Tokyo" → "..."

Prerequisites

This example specifically requires:

Completion of all setup steps listed on the Prerequisites page.
A working OpenAI API key configured in your environment variables.

You should be familiar with these concepts and components:

View full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-core

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-core

You can either:

You can refer to the guide whenever you need explanation or want to clarify how each part works.
Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Project Setup

Environment Configuration

Ensure you have a file named .env in your project directory with the following content:

OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"

Replace <YOUR_OPENAI_API_KEY> with your actual OpenAI API key.

Build Your Streaming LMRP System

1) Set Up Event Handling Components

The event system manages real-time token streaming from the language model:

Import Required Libraries

Start by importing the necessary dependencies for streaming:

import json
import asyncio
from dotenv import load_dotenv
from gllm_core.event import EventEmitter
from gllm_core.event.handler import StreamEventHandler

load_dotenv()

Create the Event System

Set up the streaming components that will handle real-time events:

# Setup event system for streaming
streamer = StreamEventHandler()
event_emitter = EventEmitter([streamer])

The StreamEventHandler captures streaming events, while EventEmitter manages event distribution to handlers.

2) Configure LMRP Components

The LMRP components work together to process prompts and generate streaming responses:

Set up LM Invoker and Prompt Builder

from gllm_inference.lm_invoker import OpenAILMInvoker
from gllm_inference.prompt_builder import PromptBuilder

# Initialize LM invoker
lm_invoker = OpenAILMInvoker(model_name="gpt-4o-mini")

# Create prompt builder
prompt_builder = PromptBuilder(
    system_template="You are a helpful assistant who specializes in recommending activities.",
    user_template="{question}"
)

The LM invoker handles model communication while the prompt builder formats your templates consistently.

Create the LM Request Processor

from gllm_inference.request_processor import LMRequestProcessor

lm_request_processor = LMRequestProcessor(
    prompt_builder=prompt_builder,
    lm_invoker=lm_invoker,
)

This combines your prompt formatting and model invocation into a complete processing pipeline.

3) Implement Concurrent Streaming

The concurrent execution allows you to process the request and stream tokens simultaneously:

Create the Processing Task

Use asyncio.create_task() to run the processor concurrently with streaming:

# Run the processor and stream concurrently
processor_task = asyncio.create_task(
    lm_request_processor.process(
        prompt_kwargs={"question": "I want to go to Tokyo, Japan. What should I do?"},
        event_emitter=event_emitter
    )
)

The event_emitter parameter enables streaming - without it, you'd only get the final response.

Process Streaming Events

Iterate through streaming events as they arrive:

async for event in streamer.stream():
    token = json.loads(event)
    print(token)

Each event contains token information with metadata like timestamp and content type.

Clean Up Resources

Ensure proper cleanup after streaming completes:

await processor_task
await event_emitter.close()

This waits for the processor to finish and properly closes the event system.

📂 Complete Guide Files

Here's the full implementation that brings everything together:

import json
import asyncio
from dotenv import load_dotenv
from gllm_inference.lm_invoker import OpenAILMInvoker
from gllm_inference.request_processor import LMRequestProcessor
from gllm_core.event import EventEmitter
from gllm_core.event.handler import StreamEventHandler
from gllm_inference.prompt_builder import PromptBuilder

load_dotenv()

async def main():
    # Setup event system for streaming
    streamer = StreamEventHandler()
    event_emitter = EventEmitter([streamer])

    # Initialize LM invoker and processor
    lm_invoker = OpenAILMInvoker(model_name="gpt-4o-mini")
    prompt_builder = PromptBuilder(
        system_template="You are a helpful assistant who specializes in recommending activities.",
        user_template="{question}"
    )
    lm_request_processor = LMRequestProcessor(
        prompt_builder=prompt_builder,
        lm_invoker=lm_invoker,
    )

    # Run the processor and stream concurrently
    # If you want real-time tokens → run processor + streamer concurrently.
    # If you only care about the final response → just await process() and parse the result
    processor_task = asyncio.create_task(
        lm_request_processor.process(
            prompt_kwargs={"question": "I want to go to Tokyo, Japan. What should I do?"},
            event_emitter=event_emitter
        )
    )

    async for event in streamer.stream():
        token = json.loads(event)
        print(token)

    await processor_task
    await event_emitter.close()

if __name__ == "__main__":
    asyncio.run(main())

Run the Streaming Example

Execute the Script

Run your streaming LMRP script:

python stream_lmrp.py

Observe Real-time Output

When running the streaming example, you'll see real-time token output:

{'value': 'Here', 'level': 'INFO', 'type': 'response', 'timestamp': '2025-01-19T10:15:30.123456'}
{'value': ' are', 'level': 'INFO', 'type': 'response', 'timestamp': '2025-01-19T10:15:30.134567'}
{'value': ' some', 'level': 'INFO', 'type': 'response', 'timestamp': '2025-01-19T10:15:30.145678'}
{'value': ' great', 'level': 'INFO', 'type': 'response', 'timestamp': '2025-01-19T10:15:30.156789'}
{'value': ' activities', 'level': 'INFO', 'type': 'response', 'timestamp': '2025-01-19T10:15:30.167890'}
...

Each token includes:

value: The token or text fragment
level: Log level ('INFO' for response tokens)
type: Event type ('response' for generated content)
timestamp: Precise generation timestamp

Tips

Alternative Implementation Patterns

Pattern 1: Real-time Display

async for event in streamer.stream():
    token_data = json.loads(event)
    if token_data['type'] == 'response':
        print(token_data['value'], end='', flush=True)

Pattern 2: Collecting Full Response

full_response = ""
async for event in streamer.stream():
    token_data = json.loads(event)
    if token_data['type'] == 'response':
        full_response += token_data['value']

Pattern 3: Conditional Processing

async for event in streamer.stream():
    token_data = json.loads(event)
    # Only process content tokens, skip metadata
    if token_data['type'] == 'response' and token_data['value'].strip():
        process_token(token_data['value'])

When to Use Streaming vs Standard Processing

Use Streaming When:

User experience is priority (immediate feedback)
Generating long responses (articles, explanations)
Building interactive applications
Need to process partial responses

Use Standard Processing When:

Simple, quick responses
Batch processing scenarios
When final response structure is needed before proceeding
Processing structured outputs that require complete validation

PreviousExtend LM Capabilities with Tools NextBuild End-to-End RAG Pipeline

Last updated 2 months ago

Was this helpful?

hashtagInstallation

hashtagProject Setup

hashtagBuild Your Streaming LMRP System

hashtag1) Set Up Event Handling Components

hashtag2) Configure LMRP Components

hashtag3) Implement Concurrent Streaming

hashtag📂 Complete Guide Files

hashtagRun the Streaming Example

hashtagTips

hashtagAlternative Implementation Patterns

hashtagPattern 1: Real-time Display

hashtagPattern 2: Collecting Full Response

hashtagPattern 3: Conditional Processing

hashtagWhen to Use Streaming vs Standard Processing