PDF to JSON Text Extraction

The PDF to JSON Text Extraction component delivers solutions for parsing PDF documents and performing per page text and table extraction from unstructured data sources. This tool is adept at handling both digital PDFs, which are generated directly from electronic sources, and OCR (Optical Character Recognition) PDFs, which are created by converting scanned images of documents into editable and searchable formats. Document metadata is also extracted from source PDF documents when it is available.

The PDF to JSON Text Extraction is comprised of two separate component choices, depending on the speed and performance required. One uses a native extraction model, another utilizes Gemini LLM. These models transform unstructured PDF documents into unified data which is ready for seamless integration using Datastreamer pipelines into various products.

Example Use Cases

Financial Reports: It extracts text information from financial documents, such as balance sheets and executive summaries.
Technical Reports: Extract text from user manuals, product specifications, and technical guides.
Market Research: Includes surveys, consumer reviews, and industry analyses in PDF published data for business insights.

Here is the sample JSON schema output generated by the PDF Table Schema component:

"results": [
	{
			"data": {
			"pdf_report": {
			"pdf_processing_time": 25,
			"total_pages": 14,
			"pdf_file_name": "Company ABC 2025 Q1 Finanical summary.pdf"
		},
      "extracted_metadata": {
          "title": "",
          "author": "",
          "creation_date": "",
          "additional_metadata": {
            "subject": null,
            "description": null,
            "keywords": [],
            "last_modified_date": null,
            "version": null,
            "file_format": null,
            "language": null,
            "publisher": null,
            "document_id": null
          }
        },
			"extracted_content": [
		{
			"page_no": 1,
			"text": "Message to Stakeholders ...",
      "tables": [
              {
                "table_no": 1,
                "table_text": [
                  [ "Financial Results 1"]
                ],
                "table_metadata": null
              }
            ]
		},
		{
			"page_no": 2,
			"text": "Financial Statements and Product Description ..."
		}, ...                                                
    ]   
 }