pdf2json

GitHub repo size GitHub language count GitHub top language GitHub last commit

pdf2json is a node.js module that converts binary PDF to JSON and text. Built with pdf.js, it extracts text content and interactive form elements for server-side processing and command-line use.

Features

PDF text extraction: extracts textual content of PDF documents into structured JSON.
Form element handling: parses interactive form fields within PDFs for flexible data capture.
Server-side and command-line versatility: Integrate with web services for remote PDF processing or use as a standalone command-line tool for local file conversion.
Swift Performance: fast performance with zero dependencies (since v3.1.6)
Community driven: decade+ long community driven development ensures continuous improvement.
Zero dependencies: completely dependency-free since v3.1.6, only pure JavaScript code.

Install

npm i pdf2json

Or, install it globally:

npm i pdf2json -g

To update with latest version:

npm update pdf2json -g

To Run in RESTful Web Service or as command line Utility

More details can be found at the bottom of this document.

Test

After install, run command line:

npm test

pretest step builds bundles and source maps for both ES Module and CommonJS, output to ./dist directory. The Jest test suit is defined in ./test/_test_.cjs with commonJS, test run will also cover parse-r and parse-fd with ES Modules via command line.

The default Jest test suits are essential tests for all PRs. But it only covers a portion of all testing PDFs, for more broader coverage, run:

npm run test:forms

It'll scan and parse 260 PDF AcroForm files under ./test/pdf, runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text file for each PDF. It usually takes ~20s in my MacBook Pro to complete, check ./test/target/ for outputs.

update on 4/27/2024: parsing 260 PDFs by npm run test:forms on M2 Mac takes 7~8s

To run Jest test suits with commonJS bundle only

npm run test:jest

Test Exception Handlings

After install, run command line:

npm run test:misc

It'll scan and parse all PDF files under ./test/pdf/misc, also runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for 15 PDF fields, 12 are expected to success while the other three's exceptions are expected to catch with stack trace for:

bad XRef entry for pdf/misc/i200_test.pdf
unsupported encryption algorithm for pdf/misc/i43_encrypted.pdf
Invalid XRef stream header for pdf/misc/i243_problem_file_anon.pdf

Test Streams

After install, run command line:

npm run parse-r

It scans 165 PDF files under ./test/pdf/fd/form/, parses with Stream API, then generates output to ./test/target/fd/form/.

More test scripts with different command line options can be found at package.json.

Disabling Test logs

For CI/CD, you probably would like to disable unnecessary logs for unit testing.

The code has two types of logs:

The logs that consume the console.log and console.warn APIs;
And the logs that consume our own base/shared/util.js log function.

To disable the first type, you could mock the console.log and console.warn APIs, but to disable the second one, you can either set the env variable PDF2JSON_DISABLE_LOGS to "1", passes -s (silect) in command line, or pass in VERBOSITY_LEVEL to be 0 when invoking PDFParser.loadPDF (ex. src/cli/p2jcli.js).

Code Example

Parse a PDF file then write to a JSON file:

import fs from "fs";
import PDFParser from "pdf2json"; 

const pdfParser = new PDFParser();

pdfParser.on("pdfParser_dataError", (errData) =>
 console.error(errData.parserError)
);
pdfParser.on("pdfParser_dataReady", (pdfData) => {
 fs.writeFile(
  "./pdf2json/test/F1040EZ.json",
  JSON.stringify(pdfData),
  (data) => console.log(data)
 );
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Or, call directly with buffer:

fs.readFile(pdfFilePath, (err, pdfBuffer) => {
 if (!err) {
  pdfParser.parseBuffer(pdfBuffer);
 }
});

Or, use more granular page level parsing events (v2.0.0)

pdfParser.on("readable", (meta) => console.log("PDF Metadata", meta));
pdfParser.on("data", (page) =>
 console.log(page ? "One page paged" : "All pages parsed", page)
);
pdfParser.on("error", (err) => console.error("Parser Error", err));

Parse a PDF then write a .txt file (which only contains textual content of the PDF)

import fs from "fs";
import PDFParser from "pdf2json"; 

const pdfParser = new PDFParser(this, 1);

pdfParser.on("pdfParser_dataError", (errData) =>
 console.error(errData.parserError)
);
pdfParser.on("pdfParser_dataReady", (pdfData) => {
 fs.writeFile(
  "./pdf2json/test/F1040EZ.content.txt",
  pdfParser.getRawTextContent(),
  () => {
   console.log("Done.");
  }
 );
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Parse a PDF then write a fields.json file that only contains interactive forms' fields information:

import fs from "fs";
import PDFParser from "pdf2json"; 

const pdfParser = new PDFParser();

pdfParser.on("pdfParser_dataError", (errData) =>
 console.error(errData.parserError)
);
pdfParser.on("pdfParser_dataReady", (pdfData) => {
 fs.writeFile(
  "./pdf2json/test/F1040EZ.fields.json",
  JSON.stringify(pdfParser.getAllFieldsTypes()),
  () => {
   console.log("Done.");
  }
 );
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Alternatively, you can pipe input and output streams: (requires v1.1.4)

import fs from "fs";
import PDFParser from "pdf2json";

const inputStream = fs.createReadStream(
 "./pdf2json/test/pdf/fd/form/F1040EZ.pdf",
 { bufferSize: 64 * 1024 }
);
const outputStream = fs.createWriteStream(
 "./pdf2json/test/target/fd/form/F1040EZ.json"
);

inputStream
 .pipe(new PDFParser())
 .pipe(new StringifyStream())
 .pipe(outputStream);

With v2.0.0, last line above changes to

inputStream
 .pipe(this.pdfParser.createParserStream())
 .pipe(new StringifyStream())
 .pipe(outputStream);

For additional output streams support:

    //private methods
#generateMergedTextBlocksStream() {
  return new Promise( (resolve, reject) => {
   const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".merged.json"), resolve, reject);
   this.pdfParser.getMergedTextBlocksStream().pipe(new StringifyStream()).pipe(outputStream);
  });
 }

#generateRawTextContentStream() {
  return new Promise( (resolve, reject) => {
   const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".content.txt"), resolve, reject);
   this.pdfParser.getRawTextContentStream().pipe(outputStream);
  });
 }

#generateFieldsTypesStream() {
  return new Promise( (resolve, reject) => {
   const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".fields.json"), resolve, reject);
   this.pdfParser.getAllFieldsTypesStream().pipe(new StringifyStream()).pipe(outputStream);
  });
 }

#processAdditionalStreams() {
  const outputTasks = [];
  if (PROCESS_FIELDS_CONTENT) {//needs to generate fields.json file
      outputTasks.push(this.#generateFieldsTypesStream());
  }
  if (PROCESS_RAW_TEXT_CONTENT) {//needs to generate content.txt file
      outputTasks.push(this.#generateRawTextContentStream());
  }
  if (PROCESS_MERGE_BROKEN_TEXT_BLOCKS) {//needs to generate json file with merged broken text blocks
      outputTasks.push(this.#generateMergedTextBlocksStream());
  }
  return Promise.allSettled(outputTasks);
}

Note, if primary JSON parsing has exceptions, none of additional stream will be processed. See p2jcmd.js for more details.

API Reference

events:
- pdfParser_dataError: will be raised when parsing failed
- pdfParser_dataReady: when parsing succeeded
alternative events: (v2.0.0)
- readable: first event dispatched after PDF file metadata is parsed and before processing any page
- data: one parsed page succeeded, null means last page has been processed, single end of data stream
- error: exception or error occurred
start to parse PDF file from specified file path asynchronously:

    function loadPDF(pdfFilePath);

If failed, event "pdfParser_dataError" will be raised with error object: {"parserError": errObj}; If success, event "pdfParser_dataReady" will be raised with output data object: {"formImage": parseOutput}, which can be saved as json file (in command line) or serialized to json when running in web service. note: "formImage" is removed from v2.0.0, see breaking changes for details.

Get all textual content from "pdfParser_dataReady" event handler:

    function getRawTextContent();

returns text in string.

Get all input fields information from "pdfParser_dataReady" event handler:

    function getAllFieldsTypes();

Pdf2json

Install / Use

README

pdf2json

Features

Install

Test

Test Exception Handlings

Test Streams

Disabling Test logs

Code Example

API Reference