Audio Interface
Add audio input/output to AIP agents using one interface that supports multiple implementations.
One interface:
create_audio_session(...)inglaip-sdkMany implementations: provider-specific session backends under the same API
Current implementation:
livekit(available now)Use-case docs: meeting-specific integrations are documented separately
{% hint style="info" %} Audio interface is beta and local-only. You must run the LiveKit server and client yourself. The CLI does not expose audio sessions yet. This page documents a design preview; APIs and behavior may change before release. {% endhint %}
Interface First (Provider-Agnostic)
The entrypoint stays the same across providers: create_audio_session(...). Implementation selection is explicit: pass implementation="...". config["provider"] is a compatibility fallback when implementation is omitted. If both are omitted, session creation raises ValueError.
Hypothetical provider example
This shows the interface shape before choosing a concrete transport:
import asyncio
from glaip_sdk import Agent
async def main() -> None:
agent = Agent(name="my-agent", instruction="You are a helpful assistant.")
session = agent.create_audio_session(
implementation="my-provider",
config={
"io": {"input_enabled": True, "output_enabled": True},
"my_provider": {"endpoint": "...", "token": "..."},
},
)
await session.run()
if __name__ == "__main__":
asyncio.run(main())SDK Usage (Minimum)
Use this as the smallest working snippet in glaip-sdk:
This call is intentionally explicit. The following fails because no implementation is provided:
Custom implementation wiring (same interface)
Architecture Overview (Interface + Providers)
This diagram stays interface-level (provider architecture), not product-specific meeting workflows.
How AIP Uses the Audio Interface (General)
For any provider, the runtime flow in AIP is:
Select implementation with
implementation="...".Pass provider-specific options under
config.Run the returned session via
await session.run().
config["provider"] is optional and mainly useful for compatibility paths that cannot pass implementation directly. When implementation is explicit, config["provider"] is redundant.
Current AIP Implementation: LiveKit
LiveKit is the implementation available now in AIP. The section below covers LiveKit-specific knobs and precedence behavior.
LiveKit Customization Surface
This table lists the configurable fields for the current AIP audio implementation (implementation="livekit").
io.input_enabled
No
True
Enable/disable microphone input processing.
io.output_enabled
No
True
Enable/disable spoken audio output.
io.input_device
No
None
Select specific input device name/id.
io.output_device
No
None
Select specific output device name/id.
livekit.url
Yes
none
LiveKit server URL used by the session.
livekit.api_key
Yes*
LIVEKIT_API_KEY env
LiveKit API key for token generation.
livekit.api_secret
Yes*
LIVEKIT_API_SECRET env
LiveKit API secret for token generation.
livekit.room_name
No
auto-generated room name
Target room for the audio session.
livekit.identity
No
aip-agent
Identity used by the agent participant in the room.
livekit.openai_stt_model
No
gpt-4o-transcribe
STT model for OpenAI speech recognition plugin.
livekit.openai_tts_model
No
gpt-4o-mini-tts
TTS model for OpenAI speech synthesis plugin.
livekit.openai_voice
No
echo
Voice preset for OpenAI TTS.
livekit.openai_use_realtime_stt
No
True
Toggle realtime STT mode in OpenAI plugin.
livekit.openai_api_key
No
OPENAI_API_KEY env
Override API key for OpenAI STT/TTS plugin calls.
livekit.openai_base_url
No
provider default
Override OpenAI base URL for plugin calls.
Note: fields marked Yes* are required by runtime, but can come from either config or environment variable.
Optional fallback keys:
model.provider
Deciding whether OpenAI STT/TTS should be enabled
"openai" activates OpenAI speech path even without livekit.openai_* fields.
model.model
livekit.openai_tts_model is not set
Used as fallback TTS model value.
model.voice
livekit.openai_voice is not set
Used as fallback voice value.
Custom provider extension remains available via register_audio_session_implementation("name", factory).
Meeting-specific actor integrations (for example Google Meet + Meemo + Attendee stream bridge) are documented in a separate use-case page.
Key Terms
Audio session: Runtime object returned by
create_audio_session(...)that manages start/stop/wait lifecycle for voice interaction.Provider / implementation: Transport backend selected by
implementation="..."(for example"livekit").LiveKit provider: Current AIP audio implementation using LiveKit Python SDK and LiveKit Agents runtime.
AIP runtime: Agent reasoning + tool-calling execution path that handles transcript input and generates reply text.
LiveKit Example (Explicit Selection)
Use explicit implementation selection and keep provider-specific settings under config["livekit"]:
For the LiveKit implementation, STT/TTS model selection is configured under config["livekit"] (openai_stt_model, openai_tts_model, openai_voice). config["model"] is optional and used only as a fallback when provider-specific fields are omitted.
Model Config Precedence (model vs livekit)
model vs livekit)config["model"]is not required forimplementation="livekit".OpenAI STT/TTS wiring activates when either:
config["model"]["provider"] == "openai", orany
config["livekit"]["openai_*"]field is set.
Precedence for TTS model and voice:
TTS model:
livekit.openai_tts_model->model.model-> defaultgpt-4o-mini-ttsVoice:
livekit.openai_voice->model.voice-> defaultecho
Practical recommendation: if you already set openai_stt_model, openai_tts_model, and openai_voice under livekit, you can omit config["model"] to avoid duplication.
Use ${ENV_VAR} references for secrets in runnable config (supported by AudioSessionConfig parsing). Use <SECRET> placeholders only in non-runnable documentation examples.
LiveKit prerequisites, setup commands, and local test flow are currently captured in the repository's LiveKit local development runbook.
Provider Model
The audio interface is provider-agnostic. Use implementation="..." to pick the backend (for example "livekit" today). Provider-specific settings are passed via config.
Current AIP implementation support: LiveKit AgentSession-based local audio sessions.
Turn Sequence (Audio -> STT -> AIP -> TTS)
The AIP turn logic is consistent across providers. Transport and streaming APIs change by provider.
Tool Call Visibility
Tool calls are handled by the underlying agent runtime (e.g. LangGraph) the same way they are for text-only runs.
For the demo workflow in this repo:
run with
AIP_AUDIO_DEBUG=1to print transcripts and final repliesuse the agent's standard streaming/logging to observe tool events
Configuration Tips
Audio input/output: Set
input_enabledoroutput_enabledtoFalseto run input-only or output-only sessions.Devices: Supply
input_deviceoroutput_devicewhen multiple audio devices are present.STT/TTS: Provider-specific. LiveKit handles audio transport; transcription and synthesis live in the LiveKit worker/agent. Providers that expose model selection use
AudioModelConfig(see the GL SDK realtime session tutorial).Provider config:
LiveKitConfigexpects the server URL,api_key, andapi_secret;room_nameandidentityare optional.
Limitations
Local-only; no AIP-hosted audio service yet.
livekitis the currently documented AIP provider implementation.CLI support is intentionally deferred.
Troubleshooting
AudioSessionUnavailableError
LiveKit deps are missing
Install published extras: pip install "glaip-sdk[audio]". Monorepo contributors can run make -C python/aip-agents install-audio.
AudioConfigError
URL/api key/secret missing
Check LiveKitConfig and env vars LIVEKIT_API_KEY / LIVEKIT_API_SECRET.
No audio / device error
Device not available
Disable audio output or set input_device/output_device.
Related Documentation
Agents guide — manage agent configs and runtime overrides.
Tools guide — inspect tool definitions and outputs.
Security & privacy — handle credentials and sensitive data.
External References
最后更新于