Simple DPO Pipeline (Loader)

Let's try to build a simple Document Processing Orchestrator (DPO) pipeline to process a PDF file using DPO Loader.

Loader is designed for extracting information from the provided source.

Prerequisites

This example specifically requires:

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pdf]"

You can use the following as a sample file: pdf-example.pdf.

Running the Pipeline

1

Create a script called main.py:

import json
from gllm_docproc.loader.pdf import PyMuPDFLoader

source = "pdf-example.pdf"

# initialize the PyMuPDF Loader
loader = PyMuPDFLoader()

# load source
loaded_elements = loader.load(source)

print(json.dumps(loaded_elements, indent=4))
2

Run the script:

python main.py
3

The loader will generate the following: (complete output here)

[
    {
        "text": "[image_eb8a84830c46d6db]",
        "structure": "image",
        "metadata": {
            "source": "pdf-example.pdf",
            "source_type": "pdf",
            "loaded_datetime": "2025-07-13 19:17:54",
            "coordinates": [
                44,
                568,
                619,
                96
            ],
            "layout_width": 612,
            "layout_height": 792,
            "page_number": 1,
            "media": [
                {
                    "media_type": "image",
                    "media_content": "iVBORw0KGgoAAAANSUhEUgAACp4AAAFvCAIAAADW8b2gAAAACXBIWXMAAA7EAAAOxAGVKw4bAAALa0lEQVR4nO3BMQEAAADCoPVPbQo/ogatmAABZvIIPgAAAABJRU5ErkJggg==",
                    "media_content_type": "base64",
                    "media_id": "image_eb8a84830c46d6db"
                }
            ]
        }
    },
    {
        "text": "[Header] This is the Header of the Document",
        "structure": "uncategorized",
        "metadata": {
            "source": "pdf-example.pdf",
            "source_type": "pdf",
            "loaded_datetime": "2025-07-13 19:17:54",
            "font_size": 12,
            "font_family": "TimesNewRomanPSMT",
            "font_color": "#000000",
            "coordinates": [
                72,
                292,
                49,
                36
            ],
            "links": [],
            "layout_width": 612,
            "layout_height": 792,
            "page_number": 1,
            "sorted_element_format": [
                [
                    12,
                    "TimesNewRomanPSMT",
                    "#000000"
                ]
            ]
        }
    },
    ...
]

You can use use the extracted data (text, font size, font family, font color, coordinates, links, page number, etc) to enrich the data before you embed them into a vector store (typically Elasticsearch).

Last updated