Downloader

Downloader is designed for downloading resources from a given source and save the output into a file.

This page provides a list of all supported Downloader in Document Processing Orchestrator (DPO).

Prerequisites

If you want to try the snippet code in this page:

Completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"

HTML Downloader

HTML Downloader allows you to download HTML Document from a URL and save it as a JSON file.

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://books.toscrape.com/"
output_path = "downloader/output/download"

# Initialize downloader
downloader = HTMLDownloader()

# Download input
downloader.download(source, output_path)

Run the script:

python main.py

The downloader will generate the following: output JSON.

HTML Downloader allows you to crawl a URL and save it as JSON files.

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://quotes.toscrape.com/"
output_path = "downloader/output/crawl"

# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["quotes.toscrape.com"])

# Download input
downloader.download_crawl(source, output_path)

Run the script:

python main.py

The downloader will generate JSON files at the specified output location.

HTML Downloader allows you to download from a sitemap link and save it as JSON files

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://indonesiakaya.com/pustaka_cat-sitemap.xml"
output_path = "downloader/output/crawl_sitemap"

# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["indonesiakaya.com"])

# Download input
downloader.download_sitemap(source, output_path)

Run the script:

python main.py

The downloader will generate JSON files at the specified output location.

Firecrawl Downloader

Firecrawl Downloader allows you to download HTML Document from a URL using Firecrawl service and save it as a JSON file. You need to have Firecrawl API Key to use this.

Create a script called main.py:

from gllm_docproc.downloader.html.firecrawl_downloader import HTMLFirecrawlDownloader

source = "https://books.toscrape.com/"
output_path = "downloader/output/download"

# Initialize downloader
downloader = HTMLFirecrawlDownloader(api_key="<YOUR_API_KEY>")

# Download input
downloader.download(source, output_path)

Run the script:

python main.py

The downloader will generate something along:

{
    "metadata": {
        "favicon": "https://books.toscrape.com/static/oscar/favicon.ico",
        "viewport": "width=device-width",
        "language": "en-us",
        "title": "\n    All products | Books to Scrape - Sandbox\n",
        "robots": "NOARCHIVE,NOCACHE",
        "description": "",
        "created": "24th Jun 2016 09:29",
        "scrapeId": "4af45949-b185-4690-8938-ca22dcd0409e",
        "sourceURL": "https://books.toscrape.com/",
        "url": "https://books.toscrape.com/",
        "statusCode": 200,
        "contentType": "text/html",
        "proxyUsed": "basic",
        "creditsUsed": 1
    },
    "success": true,
    "element_metadata": {
        "source": "https://books.toscrape.com/",
        "source_type": "html",
        "loaded_datetime": "2025-07-29 18:29:18"
    },
    "content": "<!DOCTYPE html> ..."
}

You can also access the Firecrawl instance if you need to use Firecrawl method directly:

from gllm_docproc.downloader.html.firecrawl_downloader import HTMLFirecrawlDownloader

source = "https://books.toscrape.com/"

# Initialize downloader
downloader = HTMLFirecrawlDownloader(api_key="<YOUR_API_KEY>")

scrape_result = downloader.firecrawl_instance.scrape_url(source, formats=['markdown', 'html'])
print(scrape_result.markdown)

PreviousDocument Processing Orchestrator NextLoader

Last updated 18 hours ago