PPTX
gllm-docproc | Tutorial : PPTX Loader | Use Case: Advanced DPO Pipeline | API Reference
PPTX Loader is a component designed for extracting information from a PPTX file and converting it into a standardized JSON format.
This page provides a list of all supported PPTX Loader in Document Processing Orchestrator (DPO).
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pptx]"# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pptx]"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pptx]"You can use the following as a sample file: pptx-example.pptx.
PythonPPTX Loader
PythonPPTXLoader is responsible for extracting text, tables, images, and charts from PPTX documents.
The text is extracted per shape (paragraphs and runs), tables are normalized into markdown, images are base64 encoded, and charts are converted into minimal structured text with metadata.
Create a script called main.py:
from gllm_docproc.loader.pptx import PythonPPTXLoader
source = "./data/source/pptx-example.pptx"
# initialize the PPTX Loader
loader = PythonPPTXLoader()
# load source
loaded_elements = loader.load(source)Run the script:
python main.pyThe loader will generate the following: output JSON.
Last updated