Parser
Parser is responsible for defining element structures based on output from Loader.
To give you an idea what a Parser does, this is a snippet of a sample JSON output:
{
"text": "[Header] This is the Header of the Document",
"structure": "header", // 👈 parser defines this as "header" (previous value from Loader is "uncategorized")
"metadata": {
"source": "pdf-example.pdf",
"source_type": "pdf",
...
}
}You can compare with the output from Loader here.
Possible structure values:
PAGEHEADERTITLEHEADING(HEADING 1throughHEADING 6)PARAGRAPHFOOTERFOOTNOTETABLEIMAGEAUDIOVIDEOUNCATEGORIZED
Our Parser has the following sub components:
Document
DOCX Parser
PDF Parser
PPTX Parser
TXT Parser
XLSX Parser
HTML parser
HTML Flat Paser
Image parser
Image MIME Normalizer Parser
Image Plain Small Filter Parser
Table Parser
Table Caption Parser
Last updated