Structured Element Chunker

Structured Element Chunker is designed to segment elements based on their structural information, while meticulously maintaining the hierarchical information within both the element metadata and the text of the element. This approach ensures that each chunked element retains its context and relationship to the overall document structure, facilitating a more nuanced and accurate analysis.

Prerequisites

If you want to try the snippet code in this page:

Completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"

You can use the following as a sample file: input.

Create a script called main.py:

import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker

# elements (input) that you want to Chunk
with open('./data/source/parsed_elements.json', 'r') as file:
    elements = json.load(file)

# initialize StructuredElementChunker
chunker = StructuredElementChunker()

# chunk elements
chunked_elements = chunker.chunk(elements)
print(chunked_elements)

Run the script:

python main.py

The loader will generate the following: output JSON.

Smart Bypass Feature

When enabled, the Smart Bypass Feature automatically adapts the Structured Element Chunker's strategy based on document character length to prevent over-fragmentation and optimize processing. Depending on the document's total character count:

< small_document_threshold_chars → Bypasses processing and returns a single chunk.
small_document_threshold_chars - medium_document_threshold_chars → Applies heading-level chunking.
> medium_document_threshold_chars → Applies standard chunking.

import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker

# elements (input) that you want to Chunk
with open('./source/structuredelementchunker-with-smartbypass-input.json', 'r') as file:
    elements = json.load(file)

# initialize StructuredElementChunker with Smart Bypass feature
chunker = StructuredElementChunker(
    enable_smart_bypass=True,
    small_document_threshold_chars=500,
    medium_document_threshold_chars=25000
)

# chunk elements
chunked_elements = chunker.chunk(elements, excluded_structures=[])
print(chunked_elements)

In the example above, the input JSON qualifies as a medium-sized file. As a result, the output JSON shows that the elements are split at the heading level.

Customize Structured Element Chunker

You can customize Structured Element Chunker like so:

import json
from typing import Any

from langchain_text_splitters import RecursiveCharacterTextSplitter

from gllm_docproc.chunker.structured_element import StructuredElementChunker
from gllm_docproc.chunker.table import MARKDOWN, TableChunker
from gllm_docproc.model.element import AUDIO, FOOTER, FOOTNOTE, HEADER, IMAGE, VIDEO, Element

# elements (input) that you want to Chunk
with open("./data/source/parsed_elements.json", "r") as file:
    parsed_elements = json.load(file)

# initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n#", "\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""], chunk_size=1800, chunk_overlap=360
)

# initialize table chunker
table_chunker = TableChunker(chunk_size=4000, chunk_overlap=0, table_format=MARKDOWN)

# initialize StructuredElementChunker
chunker = StructuredElementChunker(
    text_splitter=text_splitter, table_chunker=table_chunker, is_parent_structure_info_included=True
)

# initialize excluded structures
excluded_structures = [HEADER, FOOTER, FOOTNOTE, IMAGE, VIDEO, AUDIO]


# initialize enrich chunk function
def enrich_chunk(chunk: Element, elements: list[Element]) -> Element:
    position: list[dict[str, Any]] = [
        {
            "coordinates": element.metadata.coordinates,
            "page_number": element.metadata.page_number,
        }
        for element in elements
        if hasattr(element.metadata, "coordinates") and hasattr(element.metadata, "page_number")
    ]
    if position:
        chunk.metadata.position = position
    return chunk


# chunk elements
chunked_elements = chunker.chunk(parsed_elements, excluded_structures=excluded_structures, enrich_chunk=enrich_chunk)

To get better understanding of how the above code works, here you can access the example input and output.

PreviousChunker NextTable Chunker

Last updated 28 days ago

Was this helpful?

hashtagInstallation

hashtagSmart Bypass Feature

hashtagCustomize Structured Element Chunker

Installation

Smart Bypass Feature

Customize Structured Element Chunker