Supported Documents
This page lists the types of documents that are supported by the document processing orchestrator.
Document
DOCX
PDF
PPTX
XLSX
Google Docs (from URL)
Google Slides (from URL)
Google Spreadsheet (from URL)
Plain Text
JSON
CSV
HTML
Java
JavaScript (JS)
JSX
Log files (.log)
Markdown
Python
TypeScript (TS)
TSX
Plain text files (.txt)
URL
Any public URL that is not behind any protection, e.g. IP block and/or anti-bot measures.
YouTube URLs - Will be ingested to text; may sometimes fail due to limitations from Google.
Image
These will be ingested as text.
HEIC
HEIF
JPEG (.jpg/.jpeg)
PNG
WEBP
TIFF
Audio
These will be ingested as text.
FLAC
MP3
OGG
WAV
Video
These will be ingested as text.
MP4
MPEG
MOV
AVI
MKV
WEBM
FLV
WMV
3GP
OGV
ASF
MP2T
OGG
Output
Document processing orchestrator can save into the following data stores:
Vector database
Tabular database
Knowledge Graph
Limitations
These limitations are not planned to be supported:
PDF: Cannot extract advanced math equations.
DOCX: Cannot extract math equations.
URL:
Cannot bypass protected URLs.
Might be solved using FirecrawlDownloader (leverages Firecrawl).
Cannot access social media (e.g. Facebook, Instagram, X, TikTok).
Cannot get a specific part from HTML.
Customization available by extending the
gllm-docproclibrary.
Cannot process executable or package files (e.g. DMG, EXE, GZ, TAR, ZIP).
Cannot process files with proprietary exstensions (e.g. AI, PSD, DLL).
Cannot crawl or scrape URLs periodically.
Projects are responsible to manage their own scheduler or cron-jobs.
Last updated