XLSX

gllm-docprocarrow-up-right | Tutorial : XLSX Parser | Use Case: Advanced DPO Pipeline | API Referencearrow-up-right

XLSX Parser is responsible for parsing the table structure within XLSX documents. It maps loaded elements from the XLSX Loader into structures such as table, converting raw table data to markdown format and handling sheet names as table captions.

This page provides guide to use XLSX Parser in Document Processing Orchestrator (DPO).

chevron-rightPrerequisiteshashtag

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[xlsx]"

You can use the following as a sample file: loaded_elements.jsonarrow-up-right.

1

Create a script called main.py:

import json

from gllm_docproc.parser.document import XLSXParser

# loaded_elements (input) that you want to Parse
with open('./data/source/loaded_elements.json', 'r') as file:
    loaded_elements = json.load(file)

# initialize the XLSX Parser
parser = XLSXParser()

# parse loaded elements
parsed_elements = parser.parse(loaded_elements)
2

Run the script:

python main.py
3

The parser will generate the following: output JSONarrow-up-right.

Last updated

Was this helpful?