Video to Caption

Introduction

The Video to Caption component converts videos into natural language captions using multimodal Language Models. It is built on top of the LMBasedVideoToCaption converter, which uses LMRequestProcessor and multimodal LM invokers to understand both the visual and temporal aspects of a video and return structured captions.

Typical use cases include:

  1. Generating captions for long-form videos to power downstream search or retrieval.

  2. Creating short highlight captions for clips in social feeds or internal video libraries.

  3. Producing textual context that can be fed into RAG pipelines or evaluation workflows.

Installation


# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-multimodal gllm-inference

Quickstart

The simplest way to initialize the Video to Caption component is to use the built-in preset.

import asyncio

from gllm_inference.schema import Attachment
from gllm_multimodal.modality_converter.video_to_text.video_to_caption import LMBasedVideoToCaption

video = Attachment.from_path("./sample_video.mp4")
converter = LMBasedVideoToCaption.from_preset("default")

# The converter expects raw bytes for the video input
result = asyncio.run(converter.convert(video.data))

# The result is a TextResult object
print(f"Video Summary: {result.result}")

# Access detailed segments from metadata
for segment in result.metadata["segments"]:
    print(f"Segment ({segment['start_time']}s - {segment['end_time']}s):")
    for caption in segment.get("segment_caption", []):
        print(f"  - {caption}")

Expected output format

The LMBasedVideoToCaption returns a TextResult object with the following structure:

  • result (str): A high-level summary of the entire video.

  • tag (str): The tag identifying the result type (always "caption").

  • metadata (dict): Contains detailed captioning information:

    • video_summary (str): Same as result, the video summary.

    • segments (list): List of video segments, each containing:

      • start_time (float): Segment start time in seconds.

      • end_time (float): Segment end time in seconds.

      • segment_caption (list[str]): Captions for the segment.

      • keyframes (list): Keyframe descriptions with time_offset and caption.

      • transcripts (list): Transcript entries with text, start_time, end_time, and lang_id.

Contextual video captioning

Sometimes the raw video alone does not provide enough context. Video to Caption supports passing additional metadata to help the model generate more relevant and domain-specific captions.

The supported fields are defined by the Caption schema and include:

  1. image_one_liner (str, optional): Brief one-line summary or title of the video.

  2. image_description (str, optional): Longer free-form description of what the video is about.

  3. domain_knowledge (str, optional): Domain-specific hints that are not present in the description.

  4. image_metadata (dict, optional): Arbitrary metadata, such as duration or frame rate.

  5. number_of_captions (int, optional): Number of captions to generate.

Using image one liner and description

Adding domain knowledge and metadata

Using attachment context

Video to Caption also supports attachment context, which allows you to pass supporting attachments such as slides, images, or transcripts alongside the main video. These attachments are exposed to the LVLM as additional context to improve caption quality.

Customize model and prompt

By default, the preset uses the configured LVLM from the multimodality presets. For advanced use cases, you can provide your own LMRequestProcessor configuration to fully control model, prompt, and parsing behavior.

Was this helpful?