headingParser

Parser is responsible for defining element structures based on output from Loader.

To give you an idea what a Parser does, this is a snippet of a sample JSON output:

{
    "text": "[Header] This is the Header of the Document",
    "structure": "header", // 👈 parser defines this as "header" (previous value from Loader is "uncategorized")
    "metadata": {
        "source": "pdf-example.pdf",
        "source_type": "pdf",
        ...
    }
}

You can compare with the output from Loader here.

Possible structure values:

  1. PAGE

  2. HEADER

  3. TITLE

  4. HEADING (HEADING 1 through HEADING 6 )

  5. PARAGRAPH

  6. FOOTER

  7. FOOTNOTE

  8. TABLE

  9. IMAGE

  10. AUDIO

  11. VIDEO

  12. UNCATEGORIZED

Our Parser has the following sub components:

  1. HTML parser

  2. Image parser

    1. Image MIME Normalizer Parser

    2. Image Plain Small Filter Parser

  3. Table Parser

    1. Table Caption Parser

Last updated

Was this helpful?