Chunk Processor
What’s a Chunk Processor?
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-retrieval"# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-retrieval"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-retrieval"Deduplicating Chunks
import asyncio
from gllm_core.schema import Chunk
from gllm_retrieval.chunk_processor import DedupeChunkProcessor
chunks = [
Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}),
Chunk(id="chunk-2", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}),
Chunk(id="chunk-3", content="Bangkok, Thailand", metadata={"source": "source-3"}),
Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), # Duplicate id with chunk-1
Chunk(id="chunk-4", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), # Duplicate content with chunk-2
]
processor = DedupeChunkProcessor()
result = asyncio.run(processor.process_chunks(chunks))
print(result)Last updated
Was this helpful?