Stream LM Output
This guide will walk you through creating real-time streaming responses using LM Request Processor (LMRP) with event handling.
Stream output with LMRP allows you to receive AI responses in real-time as tokens are generated, providing immediate feedback to users while maintaining the full pipeline capabilities of LMRP including prompt formatting and processing.
For example, when asking about Tokyo travel recommendations, instead of waiting for the complete response, you can see each word appearing progressively: "Here" → "are" → "some" → "great" → "activities" → "in" → "Tokyo" → "..."
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-core# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-core# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-coreYou can either:
You can refer to the guide whenever you need explanation or want to clarify how each part works.
Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.
Both options will work—choose based on whether you prefer speed or learning by doing!
Project Setup
Environment Configuration
Ensure you have a file named .env in your project directory with the following content:
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"Build Your Streaming LMRP System
1) Set Up Event Handling Components
The event system manages real-time token streaming from the language model:
Import Required Libraries
Start by importing the necessary dependencies for streaming:
Create the Event System
Set up the streaming components that will handle real-time events:
2) Configure LMRP Components
The LMRP components work together to process prompts and generate streaming responses:
Set up LM Invoker and Prompt Builder
The LM invoker handles model communication while the prompt builder formats your templates consistently.
Create the LM Request Processor
This combines your prompt formatting and model invocation into a complete processing pipeline.
3) Implement Concurrent Streaming
The concurrent execution allows you to process the request and stream tokens simultaneously:
Create the Processing Task
Use asyncio.create_task() to run the processor concurrently with streaming:
Process Streaming Events
Iterate through streaming events as they arrive:
Each event contains token information with metadata like timestamp and content type.
Clean Up Resources
Ensure proper cleanup after streaming completes:
This waits for the processor to finish and properly closes the event system.
📂 Complete Guide Files
Here's the full implementation that brings everything together:
Run the Streaming Example
Execute the Script
Run your streaming LMRP script:
Observe Real-time Output
When running the streaming example, you'll see real-time token output:
Each token includes:
value: The token or text fragment
level: Log level ('INFO' for response tokens)
type: Event type ('response' for generated content)
timestamp: Precise generation timestamp
Tips
Alternative Implementation Patterns
Pattern 1: Real-time Display
Pattern 2: Collecting Full Response
Pattern 3: Conditional Processing
When to Use Streaming vs Standard Processing
Use Streaming When:
User experience is priority (immediate feedback)
Generating long responses (articles, explanations)
Building interactive applications
Need to process partial responses
Use Standard Processing When:
Simple, quick responses
Batch processing scenarios
When final response structure is needed before proceeding
Processing structured outputs that require complete validation
Last updated