DOCX

gllm-docproc | Tutorial: DOCX Loader | Use Case: Advanced DPO Pipeline | API Reference

DOCX Loader is a component designed for extracting information from a DOCX file and converting it into a standardized JSON format.

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[docx]"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[docx]"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[docx]"

You can use the following as a sample file: docx-example.docx.

Recommendation

We recommend to use DOCX2Python Loader.

DOCX2Python Loader

DOCX2PythonLoader is responsible to extract Text, Tables, and Images from within DOCX document by using the python-docx library.

Create a script called main.py:

from gllm_docproc.loader.docx import DOCX2PythonLoader

source = "./data/source/docx-example.docx"

# initialize DOCX2Python Loader
loader = DOCX2PythonLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

Python DOCX Loader

PythonDOCXLoader is responsible to extract Text and Tables from within DOCX document body by using the docx2python library.

Create a script called main.py:

from gllm_docproc.loader.docx import PythonDOCXLoader

source = "./data/source/docx-example.docx"

# initialize Python DOCX Loader
loader = PythonDOCXLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

Python DOCX Table Loader

PythonDOCXTableLoader is responsible to extract Tables from within DOCX document body by using the python-docx library.

Create a script called main.py:

from gllm_docproc.loader.docx import PythonDOCXTableLoader

source = "./data/source/docx-example.docx"

# initialize Python DOCX Table Loader
loader = PythonDOCXTableLoader()

# load source
loaded_elements = loader.load(source)

Run the script:

python main.py

The loader will generate the following: output JSON.

PreviousCSV NextHTML

Last updated 1 month ago

Was this helpful?

hashtagInstallation

hashtagRecommendation

hashtagDOCX2Python Loader

hashtagPython DOCX Loader

hashtagPython DOCX Table Loader

Installation

Recommendation

DOCX2Python Loader

Python DOCX Loader

Python DOCX Table Loader