Downloader

gllm-docproc | Tutorial : Downloader | Use Case: Advanced DPO Pipeline | API Reference

Downloader is designed for downloading resources from a given source and save the output into a file.

This page provides a list of all supported Downloader in Document Processing Orchestrator (DPO).

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"

HTML Downloader

HTML Downloader allows you to download HTML Document from a URL and save it as a JSON file.

1

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://books.toscrape.com/"
output_path = "downloader/output/download"

# Initialize downloader
downloader = HTMLDownloader()

# Download input
downloader.download(source, output_path)
2

Run the script:

python main.py
3

The downloader will generate the following: output JSON.

HTML Downloader allows you to crawl a URL and save it as JSON files.

1

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://quotes.toscrape.com/"
output_path = "downloader/output/crawl"

# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["quotes.toscrape.com"])

# Download input
downloader.download_crawl(source, output_path)
2

Run the script:

python main.py
3

The downloader will generate JSON files at the specified output location.

HTML Downloader allows you to download from a sitemap link and save it as JSON files

1

Create a script called main.py:

2

Run the script:

3

The downloader will generate JSON files at the specified output location.

Firecrawl Downloader

Firecrawl Downloader allows you to download HTML Document from a URL using Firecrawl service and save it as a JSON file. You need to have Firecrawl API Key to use this.

1

Create a script called main.py:

2

Run the script:

3

The downloader will generate something along:

4

You can also access the Firecrawl instance if you need to use Firecrawl method directly:

Last updated