Parser

Parser is responsible for defining element structures based on output from Loader.

To give you an idea what a Parser does, this is a snippet of a sample JSON output:

{
    "text": "[Header] This is the Header of the Document",
    "structure": "header", // 👈 parser defines this as "header" (previous value from Loader is "uncategorized")
    "metadata": {
        "source": "pdf-example.pdf",
        "source_type": "pdf",
        ...
    }
}

You can compare with the output from Loader here.

Possible structure values:

PAGE
HEADER
TITLE
HEADING (HEADING 1 through HEADING 6 )
PARAGRAPH
FOOTER
FOOTNOTE
TABLE
IMAGE
AUDIO
VIDEO
UNCATEGORIZED

Our Parser has the following sub components:

Document
HTML parser
1. HTML Flat Parser
Image parser
1. Image MIME Normalizer Parser
2. Image Plain Small Filter Parser
Table Parser
1. Table Caption Parser

PreviousXLSX NextDOCX

Last updated 23 days ago

Was this helpful?