Image to Caption

Introduction

The Image to Caption component converts images to natural language captions for multimodal AI workflows. It generates multiple captions using multimodal LLMs (e.g., Gemini) and incorporates context such as image metadata, domain knowledge, and reference attachments to generate contextual captions.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-multimodal" 

Quickstart

The simplest way to initialize Image to Caption component is to use the built-in preset.

file-image
33KB
import asyncio

from gllm_inference.schema import Attachment
from gllm_multimodal.modality_converter.image_to_text.image_to_caption import LMBasedImageToCaption

image = Attachment.from_path("./obat.webp")
converter = LMBasedImageToCaption.from_preset("default")
captions = asyncio.run(converter.convert(image.data))
print(f"Captions: {captions.result}")

Output:

Contextual Image Captioning

Sometimes giving only the image doesn't tell the whole story. Image to Caption supports passing additional context for more contextual image captioning.

Image One Liner

image_one_liner is a brief, one-line summary or title of the image.

Output:

Image Description

image_description adds detailed description of the image's content from relevant sources (e.g. article, pdf page).

Domain Knowledge

domain_knowledge provides relevant, keyword-based domain-specific information that are not present in image_description.

Attachment Context

Beyond textual input, the use of supporting images is highly beneficial. This method's primary strength is its capacity to deliver spatial or structural metadata about the primary image (including positional data), which substantially enhances the Large Language Model's (LLM's) overall contextual comprehension.

Combined

You can also combine image_one_liner, image_description, and domain_knowledge together for fully contextual captioning.

Customize Model

When using preset, the captioning model can be changed via the DEFAULT_IMAGE_CAPTIONING_MODEL_ID environment variable

Customize Model and Prompt

Using a custom LM Request Processor allows you to customize model and/or prompt.

Last updated