cubesIngestion Pipeline: Index to Vector Database

circle-exclamation

Overview

Now that you have successfully read a PDF, suppose you need to store the data into a vector database. Later, you will do a retrieval into this vector database. Suppose you already have Elasticsearch database up and running.

chevron-rightPrerequisiteshashtag

This example specifically requires:

  • Completion of all setup steps listed on the Prerequisites page.

  • A working Elasticsearch instance and credential to write into it.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[pdf]"

You can use the following as a sample file: pdf-example.pdfarrow-up-right.

Ingestion Pipeline: Index to Vector Database

1

Continue the previous output in main.py:

from gllm_docproc.loader.pdf import PyMuPDFLoader

source = "pdf-example.pdf"


# load source
from gllm_docproc.loader.pdf import PDFPlumberLoader, PyMuPDFLoader
from gllm_docproc.loader.pipeline_loader import PipelineLoader

pipeline_loader = PipelineLoader()
pipeline_loader.add_loader(PyMuPDFLoader())
pipeline_loader.add_loader(PDFPlumberLoader())

loaded_elements = loader.load(source)

# parse
from gllm_docproc.parser.document import PDFParser
from gllm_docproc.parser.table import TableCaptionParser
from gllm_docproc.parser.pipeline_parser import PipelineParser

pipeline = PipelineParser()
pipeline.add_parser(PDFParser())
pipeline.add_parser(TableCaptionParser())

parsed_elements = parser.parse(loaded_elements)

# chunk
from gllm_docproc.chunker.structured_element import StructuredElementChunker
chunker = StructuredElementChunker()
chunked_elements = chunker.chunk(parsed_elements)

# index to vector database
from gllm_docproc.indexer.vector.vector_db_indexer import VectorDBIndexer
indexer = VectorDBIndexer()

result = indexer.index(
    elements=chunked_elements,
    file_id="file_001",
    vectorizer_kwargs={
        "model": "openai/text-embedding-3-small",  # Format: "provider/model_name"
        "api_key": "<OPENAI_API_KEY>",
    },
    db_engine="elasticsearch",  # Supported: "chroma", "elasticsearch", "opensearch"
    db_config={
        "url": "http://localhost:9200", # change to your Elasticsearch URL
        "index_name": "my_index", # change to your index name
    },
)
2

Run the script:

python main.py
3

The pipeline load the PDF, parse, chunk, and finally index it into vector database (Elasticsearch). You can then perform retrieval into the Elasticsearch.

circle-check

Check out gllm-docproc for more reusable building blocks for implementing ingestion pipeline.

Last updated

Was this helpful?