Structured Element Chunker
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc"# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"1
import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker
# elements (input) that you want to Chunk
with open('./data/source/parsed_elements.json', 'r') as file:
elements = json.load(file)
# initialize StructuredElementChunker
chunker = StructuredElementChunker()
# chunk elements
chunked_elements = chunker.chunk(elements)
print(chunked_elements)2
python main.py3
Smart Bypass Feature
import json
from gllm_docproc.chunker.structured_element import StructuredElementChunker
# elements (input) that you want to Chunk
with open('./source/structuredelementchunker-with-smartbypass-input.json', 'r') as file:
elements = json.load(file)
# initialize StructuredElementChunker with Smart Bypass feature
chunker = StructuredElementChunker(
enable_smart_bypass=True,
small_document_threshold_chars=500,
medium_document_threshold_chars=25000
)
# chunk elements
chunked_elements = chunker.chunk(elements, excluded_structures=[])
print(chunked_elements)Customize Structured Element Chunker
Last updated
Was this helpful?