PPTX

gllm-docproc | Tutorial : PPTX Loader | Use Case: Advanced DPO Pipeline | API Reference

PPTX Loader is a component designed for extracting information from a PPTX file and converting it into a standardized JSON format.

This page provides a list of all supported PPTX Loader in Document Processing Orchestrator (DPO).

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pptx]"

You can use the following as a sample file: pptx-example.pptx.

PythonPPTX Loader

PythonPPTXLoader is responsible for extracting text, tables, images, and charts from PPTX documents. The text is extracted per shape (paragraphs and runs), tables are normalized into markdown, images are base64 encoded, and charts are converted into minimal structured text with metadata.

1

Create a script called main.py:

from gllm_docproc.loader.pptx import PythonPPTXLoader

source = "./data/source/pptx-example.pptx"

# initialize the PPTX Loader
loader = PythonPPTXLoader()

# load source
loaded_elements = loader.load(source)
2

Run the script:

python main.py
3

The loader will generate the following: output JSON.

Last updated