HTML
gllm-docproc | Tutorial: HTML Loader | Use Case: Advanced DPO Pipeline | API Reference
HTML Loader is a component designed for extracting information from HTML Document and converting it into a standardized JSON format.
Installation
# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"# you can use a Conda environment
$token = (gcloud auth print-access-token)
pip install --extra-index-url "https://oauth2accesstoken:$token@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[html]"# you can use a Conda environment
FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO SET TOKEN=%T
pip install --extra-index-url "https://oauth2accesstoken:%TOKEN%@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-docproc[html]"You can use the following as a sample file: html-example.json.
HTML Flat Loader
HTMLFlatLoader is responsible for extracting elements such as Text, Table, Hyperlink, Image, Video, etc,. from a website.
1
Create a script called main.py:
from gllm_docproc.loader.html import HTMLFlatLoader
source = "[HTML CONTENT FROM html-example.json FILE]"
# initialize the HTMLFlatLoader
loader = HTMLFlatLoader()
# load the source
loaded_elements = loader.load(source)2
Run the script:
python main.py3
The loader will generate the following: output JSON.
Last updated