Audio to Text Converter
Introduction
The Audio to Text Converter module provides a unified interface for transcribing audio content from multiple sources using different STT providers and VLMs. It supports various input formats including local files, URLs, base64-encoded strings, and YouTube videos.
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-multimodal" # you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-multimodal"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "gllm-multimodal"Quickstart
Here is how to use GeminiAudioToText transcription:
from gllm_multimodal.modality_converter.audio_to_text import GeminiAudioToText
converter = GeminiAudioToText(
api_key="your-gemini-api-key",
model="gemini-2.5-flash", # Default model
max_retries=3, # Number of retry attempts
timeout=300, # Timeout in seconds
)
transcripts = asyncio.run(converter.convert("path/to/audio/file"))
for transcript in transcripts:
print(
f"[{transcript.start_time:.2f}s - {transcript.end_time:.2f}s] ({transcript.lang_id or 'unknown'}) {transcript.text}"
)Output:
How to Customize Your Prompt
GeminiAudioToText uses a default system and user prompt for transcription. Here is how to use a custom prompt:
Last updated