# PDF

[**`gllm-docproc`**](https://github.com/GDP-ADMIN/gl-sdk/tree/main/libs/gllm-docproc/gllm_docproc/loader/pdf) | **Tutorial** : [PDF Loader](https://gdplabs.gitbook.io/sdk/~/revisions/w6A7tUKJGDYFXuci5HcW/tutorials/document-processing-orchestrator/loader/pdf) | **Use Case**: [advanced-dpo-pipeline](https://gdplabs.gitbook.io/sdk/~/revisions/w6A7tUKJGDYFXuci5HcW/how-to-guides/build-document-processing-pipeline/advanced-dpo-pipeline "mention") | [API Reference](https://api.python.docs.glair.ai/generative-internal/library/gllm_docproc/api/loader.html)

**PDF Loader** is a component designed for **extracting information from PDF documents**. PDF documents can vary significantly in terms of layout and structure.

This page provides a list of all supported PDF Loader in Document Processing Orchestrator (DPO).

<details>

<summary>Prerequisites</summary>

This example specifically requires completion of all setup steps listed on the [Prerequisites](https://gdplabs.gitbook.io/sdk/~/revisions/w6A7tUKJGDYFXuci5HcW/gen-ai-sdk/prerequisites) page.

</details>

## **Installation**

{% tabs %}
{% tab title="Linux, macOS, or Windows WSL" %}

```bash
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pdf]"
```

{% endtab %}

{% tab title="Windows Powershell" %}

```powershell
# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pdf]"
```

{% endtab %}

{% tab title="Windows Command Prompt" %}

```bash
# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[pdf]"
```

{% endtab %}
{% endtabs %}

You can use the following as a sample file: [pdf-example.pdf](https://assets.analytics.glair.ai/generative/pdf/pdf-example.pdf).

## **PyMuPDF Loader**

**PyMuPDFLoader** is responsible to extract **text** and **images** in `base64` format within PDF document. The **text is extracted per paragraph**, based on how the `PyMuPDF` library detects the paragraphs.

{% stepper %}
{% step %}
Create a script called `main.py`:

<pre class="language-python"><code class="lang-python">from gllm_docproc.loader.pdf import PyMuPDFLoader

source = "./data/source/pdf-example.pdf"

# initialize the PyMuPDF Loader
<strong>loader = PyMuPDFLoader()
</strong>
# load source
loaded_elements = loader.load(source)
</code></pre>

{% endstep %}

{% step %}
Run the script:

```bash
python main.py
```

{% endstep %}

{% step %}
The loader will generate the following: [output JSON](https://assets.analytics.glair.ai/generative/pdf/pymupdfloader-output.json).
{% endstep %}
{% endstepper %}

## **PyMuPDF Span Loader**

**PyMuPDFLoader** is responsible to extract **text** and **images** in `base64` format within PDF document. The **text is extracted per span**—continuous character segments that share identical formatting.

{% stepper %}
{% step %}
Create a script called `main.py`:

<pre class="language-python"><code class="lang-python">from gllm_docproc.loader.pdf import PyMuPDFSpanLoader

source = "./data/source/pdf-example.pdf"

# initialize the PyMuPDF Span Loader
<strong>loader = PyMuPDFSpanLoader()
</strong>
# load source
loaded_elements = loader.load(source)
</code></pre>

{% endstep %}

{% step %}
Run the script:

```bash
python main.py
```

{% endstep %}

{% step %}
The loader will generate the following: [output JSON](https://assets.analytics.glair.ai/generative/pdf/pymupdfspanloader-output.json).
{% endstep %}
{% endstepper %}
