Ingestion
Overview
Language models (LMs) are powerful, but they don’t have information about your private documents. Ingestion (or Document Processing Orchestrator–DPO) lets you process documents and store them into a retrieval source (e.g., vector database, graph database, SQL database) so they can be used later in the Retrieval process.
At a high level, ingestion enables you to:
Extract data from documents (e.g., PDF, DOCX, HTML, etc.)
Chunk the data
Enrich the data with additional metadata
Index the data into a retrieval source
Core Design Pattern
Structure

At its core, an ingestion pipeline is a staged, modular flow. Each stage has a single responsibility and produces an explicit artifact (usually a JSON-like intermediate), so you can swap components without rewriting everything. Each stage is also optional, you can mix and match based on your needs.
Typical stages:
Router – chooses the correct sub-pipeline/component variants (e.g., pick Downloader/Loader/Parser based on file type).
Downloader – downloads resources from a given source and saves them into files.
Loader – extracts raw elements/text from a file/source into a standardized intermediate format (text + structure + metadata).
Parser – refines/derives element structures based on Loader output (e.g., classify into header/title/paragraph/table/image, etc.).
Chunker – splits structured elements into retrieval-friendly chunks.
Data Generator – enriches data with more information (e.g., metadata augmentation).
Indexer – indexes processed data into a datastore (e.g., vector DB, graph RAG).
Capabilities
In typical ingestion implementations, you’ll often need these capabilities:
Multi-source ingestion: DPO can ingest from different upstream sources via Downloaders (e.g., direct URL, Google Drive, HTML crawlers).
Standardized intermediate representation: Loader outputs a consistent schema (
text,structure,metadata) to make downstream processing uniform.Structure-aware parsing: Parser assigns/normalizes structures (header/title/heading/table/image/etc.), improving chunking and indexing quality.
Chunking strategies: Chunker supports different chunking approaches (e.g., structured element chunking, table chunking).
Metadata enrichment: Data Generator enriches extracted/chunked content with more information.
Multi-store indexing: Indexer supports indexing into different data store targets (e.g., Vector DB, Graph RAG).
Implementation Patterns
Five implementation patterns are provided below, sorted by customization level (low → high). Use these as “reference architectures” for deciding how much to build vs reuse.
1. Minimal Ingestion Pipeline (Fastest to ship)
Goal: Quick ingestion for a small number of files without indexing. Data sources: Local files / direct URLs / Google Drive Customization: Low
Composition:
Downloader (optional) → Loader
Notes:
Best for prototypes, demos, low-volume ingestion, and no need to store into a data store.
Keep defaults; avoid branching logic.
Implementation references:
2. Enriched Ingestion with Metadata
Goal: Improve retrieval quality by enriching content (metadata, tags, derived fields). Data sources: Same as (1) Customization: Medium–High (Data Generator logic)
Composition:
Loader/Parser produce clean structure → Chunker → Data Generator enriches (e.g., additional metadata fields) → Indexer
When to use:
You want to enrich the data (e.g. add image captioning using LLM)
You want better filtering, grouping, UI facets, or downstream policy enforcement.
You want consistent “document identity” signals (file_id/source link mapping) across sources.
Implementation references:
3. [DRAFT] Web Ingestion
Goal: Ingest a single URL or a list of URLs with optional keywords, crawl/search the web as needed, extract content, and index the results into a data store for retrieval. Data sources: from end-users / other services Customization: Medium–High (Data Generator logic)
Web Ingestion pattern is still a draft. This section is shared to give you a preview of the intended approach. The exact design and implementation details are subject to change.
External services:
Smart Search:
Input: search query / seed URLs
Output: list of canonical URLs
Smart Crawl:
Input: URL(s)
Output: fetched pages + metadata (HTML, title, headers, timestamps, source URL, etc.)
Composition:
Create web ingestion record in database → if URL is supported by Smart Crawl (SC), invoke SC → else invoke Smart Search → (optional) Data Generator enriches → Indexer
When to use:
Your users want to "ingest this website / page / docs site into my knowledge base”
Your users want to “ingest all pages returned by this search query”
Implementation references:
Last updated
Was this helpful?