DOCX

gllm-docproc | Tutorial : DOCX Parser | Use Case: Advanced DPO Pipeline | API Reference

DOCX Parser is responsible for parsing the text structure within DOCX documents. It maps loaded elements from the DOCX Loader into structures such as header, title, footer, heading, and paragraph, based on their style names.

This page provides guide to use DOCX Parser in Document Processing Orchestrator (DPO).

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[docx]"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[docx]"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[docx]"

You can use the following as a sample file: loaded_elements.json.

Create a script called main.py:

import json

from gllm_docproc.parser.document import DOCXParser

# loaded_elements (input) that you want to Parse
with open('./data/source/loaded_elements.json', 'r') as file:
    loaded_elements = json.load(file)

# initialize the DOCX Parser
parser = DOCXParser()

# parse loaded elements
parsed_elements = parser.parse(loaded_elements)

Run the script:

python main.py

The parser will generate the following: output JSON.

PreviousParser NextHTML

Last updated 23 days ago

Was this helpful?

hashtagInstallation

Installation