Image to Caption
Introduction
The Image to Caption component converts images to natural language captions for multimodal AI workflows. It generates multiple captions using multimodal LLMs (e.g., Gemini) and incorporates context such as image metadata, domain knowledge, and reference attachments to generate contextual captions.
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-multimodal" # you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-multimodal"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "gllm-multimodal"Quickstart
The simplest way to initialize Image to Caption component is to use the built-in preset.
import asyncio
from gllm_inference.schema import Attachment
from gllm_multimodal.modality_converter.image_to_text.image_to_caption import LMBasedImageToCaption
image = Attachment.from_path("./obat.webp")
converter = LMBasedImageToCaption.from_preset("default")
captions = asyncio.run(converter.convert(image.data))
print(f"Captions: {captions.result}")Output:
Contextual Image Captioning
Sometimes giving only the image doesn't tell the whole story. Image to Caption supports passing additional context for more contextual image captioning.
Image One Liner
image_one_liner is a brief, one-line summary or title of the image.
Output:
Image Description
image_description adds detailed description of the image's content from relevant sources (e.g. article, pdf page).
Domain Knowledge
domain_knowledge provides relevant, keyword-based domain-specific information that are not present in image_description.
Attachment Context
Beyond textual input, the use of supporting images is highly beneficial. This method's primary strength is its capacity to deliver spatial or structural metadata about the primary image (including positional data), which substantially enhances the Large Language Model's (LLM's) overall contextual comprehension.
Combined
You can also combine image_one_liner, image_description, and domain_knowledge together for fully contextual captioning.
Customize Model
When using preset, the captioning model can be changed via the DEFAULT_IMAGE_CAPTIONING_MODEL_ID environment variable
Customize Model and Prompt
Using a custom LM Request Processor allows you to customize model and/or prompt.
Last updated