HTML

gllm-docprocarrow-up-right | Tutorial: HTML Loader | Use Case: Advanced DPO Pipeline | API Referencearrow-up-right

HTML Loader is a component designed for extracting information from HTML Document and converting it into a standardized JSON format.

chevron-rightPrerequisiteshashtag

This example specifically requires completion of all setup steps listed on the Prerequisites page.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-docproc[html]"

You can use the following as a sample file: html-example.jsonarrow-up-right.

HTML Flat Loader

HTMLFlatLoader is responsible for extracting elements such as Text, Table, Hyperlink, Image, Video, etc,. from a website.

1

Create a script called main.py:

from gllm_docproc.loader.html import HTMLFlatLoader

source = "[HTML CONTENT FROM html-example.json FILE]"

# initialize the HTMLFlatLoader
loader = HTMLFlatLoader()

# load the source
loaded_elements = loader.load(source)
2

Run the script:

python main.py
3

The loader will generate the following: output JSONarrow-up-right.

Last updated