PDF

gllm-docproc | Tutorial : PDF Loader | Use Case: Advanced DPO Pipeline | API Reference

PDF Loader is a component designed for extracting information from PDF documents. PDF documents can vary significantly in terms of layout and structure.

This page provides a list of all supported PDF Loader in Document Processing Orchestrator (DPO).

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pdf]"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pdf]"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pdf]"

You can use the following as a sample file: pdf-example.pdf.

PyMuPDF Loader

PyMuPDFLoader is responsible to extract text and images in base64 format within PDF document. The text is extracted per paragraph, based on how the PyMuPDF library detects the paragraphs.

Create a script called main.py:

from gllm_docproc.loader.pdf import PyMuPDFLoader

source = "./data/source/pdf-example.pdf"

# initialize the PyMuPDF Loader
loader = PyMuPDFLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

PyMuPDF Span Loader

PyMuPDFLoader is responsible to extract text and images in base64 format within PDF document. The text is extracted per span—continuous character segments that share identical formatting.

Create a script called main.py:

from gllm_docproc.loader.pdf import PyMuPDFSpanLoader

source = "./data/source/pdf-example.pdf"

# initialize the PyMuPDF Span Loader
loader = PyMuPDFSpanLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

PreviousJSON NextPPTX

Last updated 5 months ago

Was this helpful?

hashtagInstallation

hashtagPyMuPDF Loader

hashtagPyMuPDF Span Loader

Installation

PyMuPDF Loader

PyMuPDF Span Loader