HTML Downloader

gllm-docproc | Tutorial: HTML Downloader | Use Case: Advanced DPO Pipeline | API Reference

HTML Downloader allows you to download HTML Document from a URL and save it as a JSON file.

This page provides guide to use HTML Downloader in Document Processing Orchestrator (DPO).

Prerequisites

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"

# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[html]"

# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[html]"

Basic Usage

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://books.toscrape.com/"
output_path = "downloader/output/download"

# Initialize downloader
downloader = HTMLDownloader()

# Download input
downloader.download(source, output_path)

Run the script:

python main.py

The downloader will generate the following: output JSON.

Crawl URL

HTML Downloader allows you to crawl a URL and save it as JSON files.

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://quotes.toscrape.com/"
output_path = "downloader/output/crawl"

# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["quotes.toscrape.com"])

# Download input
downloader.download_crawl(source, output_path)

Run the script:

python main.py

The downloader will generate JSON files at the specified output location.

Download from Sitemap

HTML Downloader allows you to download from a sitemap link and save it as JSON files.

Create a script called main.py:

from gllm_docproc.downloader.html import HTMLDownloader

source = "https://indonesiakaya.com/pustaka_cat-sitemap.xml"
output_path = "downloader/output/crawl_sitemap"

# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["indonesiakaya.com"])

# Download input
downloader.download_sitemap(source, output_path)

Run the script:

python main.py

The downloader will generate JSON files at the specified output location.

PreviousGoogle Drive NextFirecrawl Downloader

Last updated 23 days ago

Was this helpful?

hashtagInstallation

hashtagBasic Usage

hashtagCrawl URL

hashtagDownload from Sitemap

Installation

Basic Usage

Crawl URL

Download from Sitemap