microphone-linesAudio Interface

Add audio input/output to AIP agents using one interface that supports multiple implementations.

  • One interface: create_audio_session(...) in glaip-sdk

  • Many implementations: provider-specific session backends under the same API

  • Current implementation: livekit (available now)

  • Use-case docs: meeting-specific integrations are documented separately

{% hint style="info" %} Audio interface is beta and local-only. You must run the LiveKit server and client yourself. The CLI does not expose audio sessions yet. This page documents a design preview; APIs and behavior may change before release. {% endhint %}

Interface First (Provider-Agnostic)

The entrypoint stays the same across providers: create_audio_session(...). Implementation selection is explicit: pass implementation="...". config["provider"] is a compatibility fallback when implementation is omitted. If both are omitted, session creation raises ValueError.

Hypothetical provider example

This shows the interface shape before choosing a concrete transport:

import asyncio
from glaip_sdk import Agent


async def main() -> None:
    agent = Agent(name="my-agent", instruction="You are a helpful assistant.")
    session = agent.create_audio_session(
        implementation="my-provider",
        config={
            "io": {"input_enabled": True, "output_enabled": True},
            "my_provider": {"endpoint": "...", "token": "..."},
        },
    )
    await session.run()


if __name__ == "__main__":
    asyncio.run(main())

SDK Usage (Minimum)

Use this as the smallest working snippet in glaip-sdk:

This call is intentionally explicit. The following fails because no implementation is provided:

Custom implementation wiring (same interface)

Architecture Overview (Interface + Providers)

This diagram stays interface-level (provider architecture), not product-specific meeting workflows.

spinner

How AIP Uses the Audio Interface (General)

For any provider, the runtime flow in AIP is:

  1. Select implementation with implementation="...".

  2. Pass provider-specific options under config.

  3. Run the returned session via await session.run().

config["provider"] is optional and mainly useful for compatibility paths that cannot pass implementation directly. When implementation is explicit, config["provider"] is redundant.

Current AIP Implementation: LiveKit

LiveKit is the implementation available now in AIP. The section below covers LiveKit-specific knobs and precedence behavior.

LiveKit Customization Surface

This table lists the configurable fields for the current AIP audio implementation (implementation="livekit").

Config key
Required
Default / fallback
What it customizes

io.input_enabled

No

True

Enable/disable microphone input processing.

io.output_enabled

No

True

Enable/disable spoken audio output.

io.input_device

No

None

Select specific input device name/id.

io.output_device

No

None

Select specific output device name/id.

livekit.url

Yes

none

LiveKit server URL used by the session.

livekit.api_key

Yes*

LIVEKIT_API_KEY env

LiveKit API key for token generation.

livekit.api_secret

Yes*

LIVEKIT_API_SECRET env

LiveKit API secret for token generation.

livekit.room_name

No

auto-generated room name

Target room for the audio session.

livekit.identity

No

aip-agent

Identity used by the agent participant in the room.

livekit.openai_stt_model

No

gpt-4o-transcribe

STT model for OpenAI speech recognition plugin.

livekit.openai_tts_model

No

gpt-4o-mini-tts

TTS model for OpenAI speech synthesis plugin.

livekit.openai_voice

No

echo

Voice preset for OpenAI TTS.

livekit.openai_use_realtime_stt

No

True

Toggle realtime STT mode in OpenAI plugin.

livekit.openai_api_key

No

OPENAI_API_KEY env

Override API key for OpenAI STT/TTS plugin calls.

livekit.openai_base_url

No

provider default

Override OpenAI base URL for plugin calls.

Note: fields marked Yes* are required by runtime, but can come from either config or environment variable.

Optional fallback keys:

Fallback key
Used when
Notes

model.provider

Deciding whether OpenAI STT/TTS should be enabled

"openai" activates OpenAI speech path even without livekit.openai_* fields.

model.model

livekit.openai_tts_model is not set

Used as fallback TTS model value.

model.voice

livekit.openai_voice is not set

Used as fallback voice value.

Custom provider extension remains available via register_audio_session_implementation("name", factory).

Meeting-specific actor integrations (for example Google Meet + Meemo + Attendee stream bridge) are documented in a separate use-case page.

Key Terms

  • Audio session: Runtime object returned by create_audio_session(...) that manages start/stop/wait lifecycle for voice interaction.

  • Provider / implementation: Transport backend selected by implementation="..." (for example "livekit").

  • LiveKit provider: Current AIP audio implementation using LiveKit Python SDK and LiveKit Agents runtime.

  • AIP runtime: Agent reasoning + tool-calling execution path that handles transcript input and generates reply text.

LiveKit Example (Explicit Selection)

Use explicit implementation selection and keep provider-specific settings under config["livekit"]:

For the LiveKit implementation, STT/TTS model selection is configured under config["livekit"] (openai_stt_model, openai_tts_model, openai_voice). config["model"] is optional and used only as a fallback when provider-specific fields are omitted.

Model Config Precedence (model vs livekit)

  • config["model"] is not required for implementation="livekit".

  • OpenAI STT/TTS wiring activates when either:

    • config["model"]["provider"] == "openai", or

    • any config["livekit"]["openai_*"] field is set.

  • Precedence for TTS model and voice:

    • TTS model: livekit.openai_tts_model -> model.model -> default gpt-4o-mini-tts

    • Voice: livekit.openai_voice -> model.voice -> default echo

Practical recommendation: if you already set openai_stt_model, openai_tts_model, and openai_voice under livekit, you can omit config["model"] to avoid duplication.

Use ${ENV_VAR} references for secrets in runnable config (supported by AudioSessionConfig parsing). Use <SECRET> placeholders only in non-runnable documentation examples.

LiveKit prerequisites, setup commands, and local test flow are currently captured in the repository's LiveKit local development runbook.

Provider Model

The audio interface is provider-agnostic. Use implementation="..." to pick the backend (for example "livekit" today). Provider-specific settings are passed via config.

Current AIP implementation support: LiveKit AgentSession-based local audio sessions.

Turn Sequence (Audio -> STT -> AIP -> TTS)

The AIP turn logic is consistent across providers. Transport and streaming APIs change by provider.

spinner

Tool Call Visibility

Tool calls are handled by the underlying agent runtime (e.g. LangGraph) the same way they are for text-only runs.

For the demo workflow in this repo:

  • run with AIP_AUDIO_DEBUG=1 to print transcripts and final replies

  • use the agent's standard streaming/logging to observe tool events

Configuration Tips

  • Audio input/output: Set input_enabled or output_enabled to False to run input-only or output-only sessions.

  • Devices: Supply input_device or output_device when multiple audio devices are present.

  • STT/TTS: Provider-specific. LiveKit handles audio transport; transcription and synthesis live in the LiveKit worker/agent. Providers that expose model selection use AudioModelConfig (see the GL SDK realtime session tutorial).

  • Provider config: LiveKitConfig expects the server URL, api_key, and api_secret; room_name and identity are optional.

Limitations

  • Local-only; no AIP-hosted audio service yet.

  • livekit is the currently documented AIP provider implementation.

  • CLI support is intentionally deferred.

Troubleshooting

Symptom
Likely cause
Fix

AudioSessionUnavailableError

LiveKit deps are missing

Install published extras: pip install "glaip-sdk[audio]". Monorepo contributors can run make -C python/aip-agents install-audio.

AudioConfigError

URL/api key/secret missing

Check LiveKitConfig and env vars LIVEKIT_API_KEY / LIVEKIT_API_SECRET.

No audio / device error

Device not available

Disable audio output or set input_device/output_device.

External References

最后更新于