PDF

gllm-docproc | Tutorial : PDF Loader | Use Case: Advanced DPO Pipeline | API Reference

PDF Loader is a component designed for extracting information from PDF documents. PDF documents can vary significantly in terms of layout and structure.

This page provides a list of all supported PDF Loader in Document Processing Orchestrator (DPO).

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pdf]"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pdf]"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pdf]"

You can use the following as a sample file: pdf-example.pdf.

Recommendation

For open-source version, we recommend to use combination of PyMuPDF and PDF Plumber. See Multi-Loader PDF Extraction.

For SaaS version, we recommend to use Azure AI Document Intelligence Loader.

PyMuPDF Loader

PyMuPDFLoader is responsible to extract text and images in base64 format within PDF document. The text is extracted per paragraph, based on how the PyMuPDF library detects the paragraphs.

Create a script called main.py:

from gllm_docproc.loader.pdf import PyMuPDFLoader

source = "./data/source/pdf-example.pdf"

# initialize the PyMuPDF Loader
loader = PyMuPDFLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

PDF Plumber Loader

PDFPlumberLoader is responsible to extract tables from PDF documents. It identifies tables based on clear, well-defined borders. As a result, tables with missing or incomplete borders won't be detected.

Create a script called main.py:

from gllm_docproc.loader.pdf import PDFPlumberLoader

source = "./data/source/pdf-example.pdf"

# initialize the PDF Plumber Loader
loader = PDFPlumberLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

Multi-Loader PDF Extraction

In certain cases, we might need to combine multiple loader to enhance information extraction. Below is a sample implementation to load PDF document using PyMuPDFLoader and PDFPlumberLoader.

Create a script called main.py:

from gllm_docproc.loader import PipelineLoader
from gllm_docproc.loader.pdf import PDFPlumberLoader, PyMuPDFLoader

source = "./data/source/pdf-example.pdf"

# initialize pipelineLoader
pipelineLoader = PipelineLoader()

# add Text Loader for PDF document (order matters, add PyMuPDFLoader first)
pipelineLoader.add_loader(PyMuPDFLoader())

# add Table Loader for PDF document
pipelineLoader.add_loader(PDFPlumberLoader())

# load source
loaded_elements = pipelineLoader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

Azure AI Document Intelligence Loader

AzureAIDocumentIntelligenceLoader extract text, tables, and images from PDF document using the Azure AI Document Intelligence, a cloud-based Azure AI service.

Create a script called main.py:

from gllm_docproc.loader.pdf import AzureAIDocumentIntelligenceLoader

source = "./data/source/pdf-example.pdf"

# initialize the Azure AI Document Intelligence Loader
loader = AzureAIDocumentIntelligenceLoader(
    endpoint="AZURE_AI_ENDPOINT",
    key="AZURE_AI_KEY",
)

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

PreviousJSON NextPPTX

Last updated 1 month ago

Was this helpful?

hashtagInstallation

hashtagRecommendation

hashtagPyMuPDF Loader

hashtagPDF Plumber Loader

hashtagMulti-Loader PDF Extraction

hashtagAzure AI Document Intelligence Loader

Installation

Recommendation

PyMuPDF Loader

PDF Plumber Loader

Multi-Loader PDF Extraction

Azure AI Document Intelligence Loader