serverData Store

What's a Data Store?

A Data Store is a flexible, capability-based abstraction for storing and querying text chunks. It acts as a lightweight shell where you plug in only the features you need—fulltext search, vector search, hybrid search (fulltext + vector in one call), or a combination.

Because all backends inherit from the same base class, the public API stays consistent. For example, switching from Chroma to Elasticsearch (or any other backend) means changing only the constructor; your code that interacts with store.fulltext, store.vector, or store.hybrid stays the same.

This design gives you a single entry point — one store, one set of handlers — regardless of how or where your data is persisted. See Supported Datastores for a comprehensive list of backends and their capabilities.

chevron-rightPrerequisiteshashtag

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-datastore
circle-exclamation

Quick start

from gllm_core.schema import Chunk
from gllm_datastore.data_store import ChromaDataStore
from gllm_datastore.data_store.chroma.data_store import ChromaClientType
from gllm_datastore.core.filters import filter as F
from gllm_inference.em_invoker import OpenAIEMInvoker

em_invoker = OpenAIEMInvoker(model_name="text-embedding-3-small")
store = (
    ChromaDataStore(
        collection_name="customer-notes",
        client_type=ChromaClientType.MEMORY,
    )
    .with_fulltext()
    .with_vector(em_invoker=em_invoker)
)

Now store.fulltext and store.vector are ready. Every capability exposes async CRUD helpers, so call them inside an async context:

Capability menu

Fulltext capability

  1. Reads and writes plain text chunks plus metadata.

  2. Supports exact filters through QueryFilter or the helper filter API.

  3. Offers fuzzy search via retrieve_fuzzy.

  4. Needed when you want to turn the data store into a cache (store.as_cache(...) requires fulltext).

Vector capability

  1. Stores embeddings and enables semantic search.

  2. Needs an embedding model invoker (BaseEMInvoker) when you register it.

  3. Lets you mix semantic and metadata filters.

Hybrid capability

  1. Combines fulltext (e.g. BM25) and vector search in a single query with configurable weights.

  2. Configure via a list of SearchConfig (FULLTEXT and/or VECTOR); each VECTOR entry requires an embedding model invoker.

  3. Use store.hybrid.create(), store.hybrid.retrieve(), and store.hybrid.retrieve_by_vector() for unified indexing and retrieval.

Encryption capability

  1. Provides transparent field-level encryption for chunk content and metadata.

  2. Works seamlessly with fulltext and vector capabilities.

  3. Encrypts data during write operations and decrypts during read operations.

  4. See Encryption for detailed usage and configuration.

Registering capabilities

Each backend inherits from BaseDataStore, so the registration keywords are always the same for datastore capabilities.

Capability
Register with
Required arguments
Common extras

Fulltext

with_fulltext(**kwargs)

Depends on backend (for Chroma: collection_name, client)

num_candidates for fuzzy search

Vector

with_vector(em_invoker=...)

em_invoker is mandatory

num_candidates, backend specific

Hybrid

with_hybrid(config=...)

config (list of SearchConfig) is mandatory

Backend-specific

Encryption

with_encryption(encryptor=...)

encryptor and fields are mandatory

Registration returns the same store, so you can chain calls. Accessing an unregistered capability raises NotRegisteredException. Accessing a capability that the backend does not support raises NotSupportedException.

Using the store end to end

1. Prepare chunks

Use gllm_core.schema.Chunk. Each chunk must have id, content, and optional metadata.

2. Write data

Call both only when you registered both capabilities. Otherwise skip the missing one.

3. Query data

When the backend supports hybrid capability, register it with with_hybrid(config=...) and use store.hybrid for create and retrieve. Hybrid combines fulltext and vector scores in one call with configurable weights.

Advanced Features

  • Batching: Handle large datasets efficiently with automatic or manual batching.

  • Query Filter: Use the unified DSL for portable metadata filtering.

Takeaways

  • Register only the capabilities you plan to use.

  • Interact with capabilities through the handler properties (store.fulltext, store.vector, store.hybrid when registered).

  • Backends differ in setup but stay compatible at the capability level.

API Reference

For more information about the data store, please take a look at our API Reference pagearrow-up-right.

Last updated

Was this helpful?