Ingestion

Overview

Language models (LMs) are powerful, but they don’t have information about your private documents. Ingestion (or Document Processing Orchestrator–DPO) lets you process documents and store them into a retrieval source (e.g., vector database, graph database, SQL database) so they can be used later in the Retrieval process.

At a high level, ingestion enables you to:

Extract data from documents (e.g., PDF, DOCX, HTML, etc.)
Chunk the data
Enrich the data with additional metadata
Index the data into a retrieval source

Core Design Pattern

Structure

At its core, an ingestion pipeline is a staged, modular flow. Each stage has a single responsibility and produces an explicit artifact (usually a JSON-like intermediate), so you can swap components without rewriting everything. Each stage is also optional, you can mix and match based on your needs.

Typical stages:

Router – chooses the correct sub-pipeline/component variants (e.g., pick Downloader/Loader/Parser based on file type).
Downloader – downloads resources from a given source and saves them into files.
Loader – extracts raw elements/text from a file/source into a standardized intermediate format (text + structure + metadata).
Parser – refines/derives element structures based on Loader output (e.g., classify into header/title/paragraph/table/image, etc.).
Chunker – splits structured elements into retrieval-friendly chunks.
Data Generator – enriches data with more information (e.g., metadata augmentation).
Indexer – indexes processed data into a datastore (e.g., vector DB, graph RAG).

Capabilities

In typical ingestion implementations, you’ll often need these capabilities:

Multi-source ingestion: DPO can ingest from different upstream sources via Downloaders (e.g., direct URL, Google Drive, HTML crawlers).
Standardized intermediate representation: Loader outputs a consistent schema (text, structure, metadata) to make downstream processing uniform.
Structure-aware parsing: Parser assigns/normalizes structures (header/title/heading/table/image/etc.), improving chunking and indexing quality.
Chunking strategies: Chunker supports different chunking approaches (e.g., structured element chunking, table chunking).
Metadata enrichment: Data Generator enriches extracted/chunked content with more information.
Multi-store indexing: Indexer supports indexing into different data store targets (e.g., Vector DB, Graph RAG).

Implementation Patterns

Five implementation patterns are provided below, sorted by customization level (low → high). Use these as “reference architectures” for deciding how much to build vs reuse.

1. Minimal Ingestion Pipeline (Fastest to ship)

Goal: Quick ingestion for a small number of files without indexing. Data sources: Local files / direct URLs / Google Drive Customization: Low

Composition:

Downloader (optional) → Loader

Notes:

Best for prototypes, demos, low-volume ingestion, and no need to store into a data store.
Keep defaults; avoid branching logic.

Implementation references:

2. Enriched Ingestion with Metadata

Goal: Improve retrieval quality by enriching content (metadata, tags, derived fields). Data sources: Same as (1) Customization: Medium–High (Data Generator logic)

Composition:

Loader/Parser produce clean structure → Chunker → Data Generator enriches (e.g., additional metadata fields) → Indexer

When to use:

You want to enrich the data (e.g. add image captioning using LLM)
You want better filtering, grouping, UI facets, or downstream policy enforcement.
You want consistent “document identity” signals (file_id/source link mapping) across sources.

Implementation references:

3. [DRAFT] Web Ingestion

Goal: Ingest a single URL or a list of URLs with optional keywords, crawl/search the web as needed, extract content, and index the results into a data store for retrieval. Data sources: from end-users / other services Customization: Medium–High (Data Generator logic)

Web Ingestion pattern is still a draft. This section is shared to give you a preview of the intended approach. The exact design and implementation details are subject to change.

External services:

Smart Search:
- Input: search query / seed URLs
- Output: list of canonical URLs
Smart Crawl:
- Input: URL(s)
- Output: fetched pages + metadata (HTML, title, headers, timestamps, source URL, etc.)

Composition:

Create web ingestion record in database → if URL is supported by Smart Crawl (SC), invoke SC → else invoke Smart Search → (optional) Data Generator enriches → Indexer

When to use:

Your users want to "ingest this website / page / docs site into my knowledge base”
Your users want to “ingest all pages returned by this search query”

Implementation references:

GL Smart Search

PreviousDesign Patterns NextProject Use Case

Last updated 22 days ago

Was this helpful?

hashtagOverview

hashtagCore Design Pattern

hashtagStructure

hashtagCapabilities

hashtagImplementation Patterns

hashtag1. Minimal Ingestion Pipeline (Fastest to ship)

hashtag2. Enriched Ingestion with Metadata

hashtag3. [DRAFT] Web Ingestion

Overview

Core Design Pattern

Structure

Capabilities

Implementation Patterns

1. Minimal Ingestion Pipeline (Fastest to ship)

2. Enriched Ingestion with Metadata

3. [DRAFT] Web Ingestion