Structured Element Chunker
Structured Element Chunker is designed to segment elements based on their structural information, while meticulously maintaining the hierarchical information within both the element metadata and the text of the element. This approach ensures that each chunked element retains its context and relationship to the overall document structure, facilitating a more nuanced and accurate analysis.
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc"
You can use the following as a sample file: input.
1
Create a script called main.py
:
import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker
# elements (input) that you want to Chunk
with open('./data/source/parsed_elements.json', 'r') as file:
elements = json.load(file)
# initialize StructuredElementChunker
chunker = StructuredElementChunker()
# chunk elements
chunked_elements = chunker.chunk(elements)
print(chunked_elements)
2
Run the script:
python main.py
3
The loader will generate the following: output JSON.
Customize Structured Element Chunker
You can customize Structured Element Chunker like so:
import json
from typing import Any
from langchain_text_splitters import RecursiveCharacterTextSplitter
from gllm_docproc.chunker.structured_element import StructuredElementChunker
from gllm_docproc.chunker.table import MARKDOWN, TableChunker
from gllm_docproc.model.element import AUDIO, FOOTER, FOOTNOTE, HEADER, IMAGE, VIDEO, Element
# elements (input) that you want to Chunk
with open("./data/source/parsed_elements.json", "r") as file:
parsed_elements = json.load(file)
# initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
separators=["\n#", "\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""], chunk_size=1800, chunk_overlap=360
)
# initialize table chunker
table_chunker = TableChunker(chunk_size=4000, chunk_overlap=0, table_format=MARKDOWN)
# initialize StructuredElementChunker
chunker = StructuredElementChunker(
text_splitter=text_splitter, table_chunker=table_chunker, is_parent_structure_info_included=True
)
# initialize excluded structures
excluded_structures = [HEADER, FOOTER, FOOTNOTE, IMAGE, VIDEO, AUDIO]
# initialize enrich chunk function
def enrich_chunk(chunk: Element, elements: list[Element]) -> Element:
position: list[dict[str, Any]] = [
{
"coordinates": element.metadata.coordinates,
"page_number": element.metadata.page_number,
}
for element in elements
if hasattr(element.metadata, "coordinates") and hasattr(element.metadata, "page_number")
]
if position:
chunk.metadata.position = position
return chunk
# chunk elements
chunked_elements = chunker.chunk(parsed_elements, excluded_structures=excluded_structures, enrich_chunk=enrich_chunk)
To get better understanding of how the above code works, here you can access the example input and output.
Last updated