Index Your Data with Vector Data Store

This guide will walk you through setting up a data store and index your local data to a data store.

Prerequisites

This example specifically requires:

Completion of all setup steps listed on the Prerequisites page.

You should be familiar with these concepts and components:

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-datastore

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-datastore

You can either:

You can refer to the guide whenever you need explanation or want to clarify how each part works.
Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Initialize Vector Data Store

When running the pipeline, you may encounter an error like this:

[2025-08-26T14:36:10+0700.550 chromadb.telemetry.product.posthog ERROR] Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

Don't worry about this, since we do not use this Chroma feature. Your data store will still work.

First, we need to set up a vector data store. In this example, we will use in-memory Chroma Vector Data Store. To initialize the data store, we need two components: EM Invoker and Vector Data Store.

Option 1: Directly from a Chunk

All data stores support storing data in a structured format using the Chunk schema. Think of chunks as standardized containers for your data - they provide a consistent way to represent information across different storage types, making it easy to switch between datastores or combine them in your application.

After that, we can simply use add_chunks() method provided by the Vector Data Store.

To load the data, you can run the script below:

from gllm_core.schema import Chunk
from gllm_datastore.vector_data_store import ChromaVectorDataStore
from gllm_inference.em_invoker import OpenAIEMInvoker

# Initialize vector store with embedding model
vector_store = ChromaVectorDataStore(
    collection_name="documents", 
    embedding=OpenAIEMInvoker(model_name="text-embedding-3-small")
)

# Add chunks to the store
chunks = [
    Chunk(content="AI is the future."),
    Chunk(content="Parrot is a bird."),
]
await vector_store.add_chunks(chunks)

Option 2: Loading Data from CSV Files

For real-world applications, you'll often need to load data from structured files like CSV. Suppose your project has the following structure:

<project-name>/
├── data/
│   └── imaginary_animals.csv
└── indexer.py

To load the data, you can run the script below:

import asyncio
import csv
from dotenv import load_dotenv
from gllm_core.schema import Chunk
from gllm_datastore.vector_data_store import ChromaVectorDataStore
from gllm_inference.em_invoker import OpenAIEMInvoker

load_dotenv()

# Initialize vector store with persistent storage
vector_store = ChromaVectorDataStore(
    collection_name="documents",
    client_type="persistent",             # use a Persistent Chroma DB
    persist_directory="data",             # 👈 where the data is located
    embedding=OpenAIEMInvoker(model_name="text-embedding-3-small")
)

# Load documents from CSV file
async def load_csv_data():
    with open("data/imaginary_animals.csv", "r") as f:
        reader = csv.DictReader(f)
        chunks = [
            Chunk(
                content=row["description"], 
                metadata={"name": row["name"]}
            ) 
            for row in reader
        ]
    
    await vector_store.add_chunks(chunks)
    print(f"Successfully indexed {len(chunks)} documents from CSV file")

if __name__ == "__main__":
    asyncio.run(load_csv_data())

Key features of this approach:

Persistent Storage: Uses client_type="persistent" to save data to disk
Metadata Support: Stores additional information (like animal names) in chunk metadata
Batch Loading: Efficiently loads all CSV rows at once
Structured Data: Converts CSV rows into standardized Chunk objects

After running this script, you’ll see an SQLite database created in your project directory.

CSV File Format Example:

name,description
Fire Phoenix,A mythical bird that rises from ashes with brilliant flames
Crystal Unicorn,A magical creature with a horn made of pure crystal
Shadow Wolf,A mysterious wolf that can blend into shadows

Querying Data

To query data using semantic search, we utilize query() method. This will return list[Chunk]

results: list[Chunk] = await vector_store.query(query="artificial intelligence")

When querying data loaded from CSV, you can access both content and metadata:

# Query for mythical creatures
results = await vector_store.query(query="magical creatures")

for chunk in results:
    print(f"Name: {chunk.metadata.get('name', 'Unknown')}")
    print(f"Description: {chunk.content}")
    print("---")

📂 Complete Guide Files

For the complete code, please visit our GitHub Cookbook Repository.

PreviousGetting Started NextUse Preset RAG Pipeline

Last updated 2 months ago

Was this helpful?

hashtagInstallation

hashtagInitialize Vector Data Store

hashtagOption 1: Directly from a Chunk

hashtagOption 2: Loading Data from CSV Files

hashtagQuerying Data

hashtag📂 Complete Guide Files