# Supported Documents

This page lists the types of documents that are supported by the [document processing orchestrator](https://gdplabs.gitbook.io/sdk/~/revisions/beykCxz0UanaEX0sPJJu/tutorials/document-processing-orchestrator).

## Document

1. DOCX
2. PDF
3. PPTX
4. XLSX
5. Google Docs (from URL)
6. Google Slides (from URL)
7. Google Spreadsheet (from URL)

## Plain Text

1. CSV
2. HTML
3. Java
4. JavaScript (JS)
5. JSX
6. Log files (.log)
7. Markdown
8. Python
9. TypeScript (TS)
10. TSX
11. Plain text files (.txt)

## URL

1. Any public URL that is not behind any protection, e.g. IP block and/or anti-bot measures.
2. YouTube URLs - Will be ingested to text; may sometimes fail due to limitations from Google.

## Image

*These will be ingested as text.*

1. HEIC
2. HEIF
3. JPEG (.jpg/.jpeg)
4. PNG
5. WEBP
6. TIFF

## Audio

*These will be ingested as text.*

1. FLAC
2. MP3
3. OGG
4. WAV

## Video

*These will be ingested as text.*

1. MP4
2. MPEG
3. MOV
4. AVI
5. MKV
6. WEBM
7. FLV
8. WMV
9. 3GP
10. OGV
11. ASF
12. MP2T
13. OGG

## Output

Document processing orchestrator can save into the following data stores:

1. Vector database
2. Tabular database
3. Knowledge Graph

## Limitations

### Future Support

These limitations are planned to be addressed in the future:

1. Transient processing (extract-only, instead of extract and ingest).

### Unsupported

These limitations are not planned to be supported:

1. PDF: Cannot extract advanced math equations.
2. DOCX: Cannot extract math equations.
3. URL:
   1. Cannot bypass protected URLs.
      1. Might be solved using FirecrawlDownloader (leverages [Firecrawl](https://www.firecrawl.dev/)).
   2. Cannot access social media (e.g. Facebook, Instagram, X, TikTok).
   3. Cannot get a specific part from HTML.
      1. Customization available by extending the [`gllm-docproc`](https://github.com/GDP-ADMIN/gl-sdk/blob/main/libs/gllm-docproc/gllm_docproc) library.
4. Cannot process executable or package files (e.g. DMG, EXE, GZ, TAR, ZIP).
5. Cannot process files with proprietary exstensions (e.g. AI, PSD, DLL).
6. Cannot crawl or scrape URLs periodically.
   1. Projects are responsible to manage their own scheduler or cron-jobs.
