Audio to Text Converter

Introduction

The Audio to Text Converter module provides a unified interface for transcribing audio content from multiple sources using different STT providers and VLMs. It supports various input formats including local files, URLs, base64-encoded strings, and YouTube videos.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-multimodal" 

Quickstart

Here is how to use GeminiAudioToText transcription:

from gllm_multimodal.modality_converter.audio_to_text import GeminiAudioToText

converter = GeminiAudioToText(
    api_key="your-gemini-api-key",
    model="gemini-2.5-flash",  # Default model
    max_retries=3,  # Number of retry attempts
    timeout=300,  # Timeout in seconds
)
transcripts = asyncio.run(converter.convert("path/to/audio/file"))

for transcript in transcripts:
    print(
        f"[{transcript.start_time:.2f}s - {transcript.end_time:.2f}s] ({transcript.lang_id or 'unknown'}) {transcript.text}"
    )

Output:

How to Customize Your Prompt

GeminiAudioToText uses a default system and user prompt for transcription. Here is how to use a custom prompt:

Last updated