Data Store

What's a Data Store?

A Data Store is a flexible, capability-based abstraction for storing and querying text chunks. It acts as a lightweight shell where you plug in only the features you need—fulltext search, vector search, or both.

Because all backends inherit from the same base class, the public API stays consistent. For example, switching from Chroma to Elasticsearch (or any other backend) means changing only the constructor; your code that interacts with store.fulltext or store.vector stays the same.

This design gives you a single entry point — one store, one set of handlers — regardless of how or where your data is persisted.

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-datastore

Quick start

from gllm_datastore.data_store.chroma.data_store import ChromaDataStore, ChromaClientType
from gllm_datastore.core.filters import filter as F
from gllm_inference.em_invoker.openai_em_invoker import OpenAIEMInvoker

em_invoker = OpenAIEMInvoker(model_name="text-embedding-3-small")
store = (
    ChromaDataStore(
        collection_name="customer-notes",
        client_type=ChromaClientType.MEMORY,
    )
    .with_fulltext()
    .with_vector(em_invoker=em_invoker)
)

Now store.fulltext and store.vector are ready. Every capability exposes async CRUD helpers, so call them inside an async context:

Capability menu

Fulltext capability

  1. Reads and writes plain text chunks plus metadata.

  2. Supports exact filters through QueryFilter or the helper filter API.

  3. Offers fuzzy search via retrieve_fuzzy.

  4. Needed when you want to turn the data store into a cache (store.as_cache(...) requires fulltext).

Vector capability

  1. Stores embeddings and enables semantic search.

  2. Needs an embedding model invoker (BaseEMInvoker) when you register it.

  3. Lets you mix semantic and metadata filters.

Registering capabilities

Each backend inherits from BaseDataStore, so the registration keywords are always the same.

Capability
Register with
Required arguments
Common extras

Fulltext

with_fulltext(**kwargs)

Depends on backend (for Chroma: collection_name, client)

num_candidates for fuzzy search

Vector

with_vector(em_invoker=...)

em_invoker is mandatory

num_candidates, backend specific

Registration returns the same store, so you can chain calls. When a capability is missing you will get NotRegisteredException the moment you access store.vector or store.fulltext.

Using the store end to end

1. Prepare chunks

Use gllm_core.schema.Chunk. Each chunk must have id, content, and optional metadata.

2. Write data

Call both only when you registered both capabilities. Otherwise skip the missing one.

3. Query data

Takeaways

  • Register only the capabilities you plan to use.

  • Interact with capabilities through the handler properties (store.fulltext, store.vector).

  • Backends differ in setup but stay compatible at the capability level.

API Reference

For more information about the data store, please take a look at our API Reference page.

Last updated