Parser

Parser is responsible for defining element structures based on output from Loader.

To give you an idea what a Parser does, this is a snippet of a sample JSON output:

{
    "text": "[Header] This is the Header of the Document",
    "structure": "header", // 👈 parser defines this as "header" (previous value from Loader is "uncategorized")
    "metadata": {
        "source": "pdf-example.pdf",
        "source_type": "pdf",
        ...
    }
}

You can compare with the output from Loader here.

Possible structure values:

  1. PAGE

  2. HEADER

  3. TITLE

  4. HEADING (HEADING 1 through HEADING 6 )

  5. PARAGRAPH

  6. FOOTER

  7. FOOTNOTE

  8. TABLE

  9. IMAGE

  10. AUDIO

  11. VIDEO

  12. UNCATEGORIZED

Our Parser has the following sub components:

  1. Document

    1. DOCX Parser

    2. PDF Parser

    3. PPTX Parser

    4. TXT Parser

    5. XLSX Parser

  2. HTML parser

    1. HTML Flat Paser

  3. Image parser

    1. Image MIME Normalizer Parser

    2. Image Plain Small Filter Parser

  4. Table Parser

    1. Table Caption Parser

Last updated