Structured Element Chunker

Structured Element Chunker is designed to segment elements based on their structural information, while meticulously maintaining the hierarchical information within both the element metadata and the text of the element. This approach ensures that each chunked element retains its context and relationship to the overall document structure, facilitating a more nuanced and accurate analysis.

chevron-rightPrerequisiteshashtag

If you want to try the snippet code in this page:

  • Completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc"

You can use the following as a sample file: inputarrow-up-right.

1

Create a script called main.py:

import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker

# elements (input) that you want to Chunk
with open('./data/source/parsed_elements.json', 'r') as file:
    elements = json.load(file)

# initialize StructuredElementChunker
chunker = StructuredElementChunker()

# chunk elements
chunked_elements = chunker.chunk(elements)
print(chunked_elements)
2

Run the script:

python main.py
3

The loader will generate the following: output JSONarrow-up-right.

Smart Bypass Feature

When enabled, the Smart Bypass Feature automatically adapts the Structured Element Chunker's strategy based on document character length to prevent over-fragmentation and optimize processing. Depending on the document's total character count:

  1. < small_document_threshold_chars → Bypasses processing and returns a single chunk.

  2. small_document_threshold_chars - medium_document_threshold_chars → Applies heading-level chunking.

  3. > medium_document_threshold_chars → Applies standard chunking.

import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker

# elements (input) that you want to Chunk
with open('./source/structuredelementchunker-with-smartbypass-input.json', 'r') as file:
    elements = json.load(file)

# initialize StructuredElementChunker with Smart Bypass feature
chunker = StructuredElementChunker(
    enable_smart_bypass=True,
    small_document_threshold_chars=500,
    medium_document_threshold_chars=25000
)

# chunk elements
chunked_elements = chunker.chunk(elements, excluded_structures=[])
print(chunked_elements)

In the example above, the input JSONarrow-up-right qualifies as a medium-sized file. As a result, the output JSONarrow-up-right shows that the elements are split at the heading level.

Customize Structured Element Chunker

You can customize Structured Element Chunker like so:

To get better understanding of how the above code works, here you can access the example inputarrow-up-right and outputarrow-up-right.

Last updated

Was this helpful?