Downloader
gllm-docproc | Tutorial : Downloader | Use Case: Advanced DPO Pipeline | API Reference
Downloader is designed for downloading resources from a given source and save the output into a file.
This page provides a list of all supported Downloader in Document Processing Orchestrator (DPO).
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "gllm-docproc"HTML Downloader
HTML Downloader allows you to download HTML Document from a URL and save it as a JSON file.
Create a script called main.py:
from gllm_docproc.downloader.html import HTMLDownloader
source = "https://books.toscrape.com/"
output_path = "downloader/output/download"
# Initialize downloader
downloader = HTMLDownloader()
# Download input
downloader.download(source, output_path)Run the script:
python main.pyThe downloader will generate the following: output JSON.
HTML Downloader allows you to crawl a URL and save it as JSON files.
Create a script called main.py:
from gllm_docproc.downloader.html import HTMLDownloader
source = "https://quotes.toscrape.com/"
output_path = "downloader/output/crawl"
# Initialize the downloader and set the allowed domains
downloader = HTMLDownloader(allowed_domains=["quotes.toscrape.com"])
# Download input
downloader.download_crawl(source, output_path)Run the script:
python main.pyThe downloader will generate JSON files at the specified output location.
HTML Downloader allows you to download from a sitemap link and save it as JSON files
Create a script called main.py:
Run the script:
The downloader will generate JSON files at the specified output location.
Firecrawl Downloader
Firecrawl Downloader allows you to download HTML Document from a URL using Firecrawl service and save it as a JSON file. You need to have Firecrawl API Key to use this.
Create a script called main.py:
Run the script:
The downloader will generate something along:
You can also access the Firecrawl instance if you need to use Firecrawl method directly:
Last updated