Vector DB

gllm-docprocarrow-up-right | Tutorial: Vector DB Indexer | Use Case: Advanced DPO Pipeline | API Referencearrow-up-right

Vector DB Indexer is a component designed for indexing parsed document elements into vector databases using vector capability implementations for Retrieval-Augmented Generation (RAG) applications.

chevron-rightPrerequisiteshashtag

This example specifically requires completion of all setup steps listed on the Prerequisites page.

You should be familiar with these concepts and components:

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-docproc

You can use the following as a sample file: structuredelementchunker-output.jsonarrow-up-right.

1

Create a script called main.py:

import json

from gllm_docproc.indexer.vector.vector_db_indexer import VectorDBIndexer

# Read elements from JSON file
file_path = "./structuredelementchunker-output.json"

with open(file_path, "r", encoding="utf-8") as f:
    elements = json.load(f)

indexer = VectorDBIndexer()

# Index the elements with required configuration
result = indexer.index(
    elements=elements,
    file_id="file_001",
    vectorizer_kwargs={
        "model": "openai/text-embedding-3-small",  # Format: "provider/model_name"
        "api_key": "<OPENAI_API_KEY>",
    },
    db_engine="elasticsearch",  # Supported: "chroma", "elasticsearch", "opensearch"
    db_config={
        "url": "http://localhost:9200",
        "index_name": "my_index",
    },
)
2

Run the script:

python main.py
circle-info

Vector Store Support: The Vector DB Indexer works with any implementation of VectorCapability, including Elasticsearch, ChromaDB, Redis, and In-Memory stores. See Supported Vector Data Store for a complete list.

Last updated

Was this helpful?