HTML Downloader

gllm-docprocarrow-up-right | Tutorial: HTML Downloader | Use Case: Advanced DPO Pipeline | API Referencearrow-up-right

HTML Downloader allows you to download HTML Document from a URL and save it as a JSON file.

This page provides guide to use HTML Downloader in Document Processing Orchestrator (DPO).

chevron-rightPrerequisiteshashtag

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"

Basic Usage

1

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://books.toscrape.com/"
output_path = "downloader/output/download"

# Initialize downloader
downloader = HTMLDownloader()

# Download input
downloader.download(source, output_path)
2

Run the script:

python main.py
3

The downloader will generate the following: output JSONarrow-up-right.

Crawl URL

HTML Downloader allows you to crawl a URL and save it as JSON files.

1

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://quotes.toscrape.com/"
output_path = "downloader/output/crawl"

# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["quotes.toscrape.com"])

# Download input
downloader.download_crawl(source, output_path)
2

Run the script:

python main.py
3

The downloader will generate JSON files at the specified output location.

Download from Sitemap

HTML Downloader allows you to download from a sitemap link and save it as JSON files.

1

Create a script called main.py:

2

Run the script:

3

The downloader will generate JSON files at the specified output location.

Last updated

Was this helpful?