Index Your Data with Vector Data Store

This guide will walk you through setting up a data store and index your local data to a data store.

Prerequisites

This example specifically requires:

  1. Completion of all setup steps listed on the Prerequisites page.

You should be familiar with these concepts and components:

View full project code on GitHub

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" gllm-inference gllm-datastore

You can either:

  1. You can refer to the guide whenever you need explanation or want to clarify how each part works.

  2. Follow along with each step to recreate the files yourself while learning about the components and how to integrate them.

Both options will work—choose based on whether you prefer speed or learning by doing!

Initialize Vector Data Store

When running the pipeline, you may encounter an error like this:

[2025-08-26T14:36:10+0700.550 chromadb.telemetry.product.posthog ERROR] Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

Don't worry about this, since we do not use this Chroma feature. Your data store will still work.

First, we need to set up a vector data store. In this example, we will use in-memory Chroma Vector Data Store. To initialize the data store, we need two components: EM Invoker and Vector Data Store.

Option 1: Directly from a Chunk

All data stores support storing data in a structured format using the Chunk schema. Think of chunks as standardized containers for your data - they provide a consistent way to represent information across different storage types, making it easy to switch between datastores or combine them in your application.

After that, we can simply use add_chunks() method provided by the Vector Data Store.

To load the data, you can run the script below:

Option 2: Loading Data from CSV Files

For real-world applications, you'll often need to load data from structured files like CSV. Suppose your project has the following structure:

To load the data, you can run the script below:

Key features of this approach:

  • Persistent Storage: Uses client_type="persistent" to save data to disk

  • Metadata Support: Stores additional information (like animal names) in chunk metadata

  • Batch Loading: Efficiently loads all CSV rows at once

  • Structured Data: Converts CSV rows into standardized Chunk objects

After running this script, you’ll see an SQLite database created in your project directory.

CSV File Format Example:

Querying Data

To query data using semantic search, we utilize query() method. This will return list[Chunk]

When querying data loaded from CSV, you can access both content and metadata:

📂 Complete Guide Files

For the complete code, please visit our GitHub Cookbook Repository.

Last updated