Textricator is a tool to extract text from documents and generate structured data.

If you have a bunch of PDFs with the same format (or one big, consistently formatted PDF) and you want to extract the data to CSV, XML, or JSON, Textricator can help! It can even work on OCR'ed documents!

Textricator is released under the GNU Affero General Public License Version 3.

Textricator is deployed to Maven Central with GAV io.mfj:textricator.

This application is actively used and developed by Measures for Justice. We welcome feedback, bug reports, and contributions. Create an issue, send a pull request, or email us at textricator@mfj.io. If you use Textricator, please let us know. Send us your mailing address and we will mail you a sticker.

io.mfj.textricator.Textricator is the main entry point for library usage.

io.mfj.textricator.cli.TextricatorCli is the command-line interface.

The CLI has three subcommands, to use the three main features of Textricator:

text - Extract text from the PDF and generate JSON.
table - Parse the text that is in columns and rows. See Table section.
form - Parse the text with a configured finite state machine. See Form section.

Quick Start

Install Java (version 11+)
- Windows & Macos: Download from https://java.com and install.
- Linux: Use your package manager.
Download the latest build of Textricator from https://repo1.maven.org/maven2/io/mfj/textricator/ - click on the directory for the latest version and download textricator-VERSION-bin.tgz (or textricator-VERSION-bin.zip for Windows).
Extract it.
Run a shell
- Windows: run Windows Powershell (it should be in the start menu)
  - The following examples start with ./textricator. On Windows, use .\textricator.bat.
- MacOS: Run Terminal (type "terminal" in Spotlight)
Show help
- ./textricator --help
Download the example files to the textricator directory:
- https://github.com/measuresforjustice/textricator/blob/main/src/test/resources/io/mfj/textricator/examples/school-employee-list.pdf
- https://github.com/measuresforjustice/textricator/blob/main/src/test/resources/io/mfj/textricator/examples/school-employee-list.yml
Extract raw text from a PDF to standard out
- ./textricator text --input-format=pdf.pdfbox school-employee-list.pdf
Parse a PDF to CSV
- ./textricator form --config=school-employee-list.yml school-employee-list.pdf school-employee-list.csv
  - This uses the configuration file school-employee-list.yml to parse school-employee-list.pdf. To parse your own PDF form, you will need to write your own configuration file. See the Form section for details. If your PDF has a tabular layout, see the Table section.

Logging

Use the --debug flag to log everything. Logging is written to standard error.

Textricator uses SLF4J for logging, with the Logback implementation. If you are using Textricator as a library, you may want to exclude ch.qos.logback:logback-classic. Textricator does not include a /logback.xml, so it will not conflict with other logging configurations, so long as TextricatorCli.main() is not invoked.

Extracting text

To extract the text from a PDF, run textrictor text --input-format=pdf.itext5 input.pdf input-text.csv for any input.pdf and then open input-text.csv in your favorite spreadsheet program. It will show you every bit of text that Textricator sees with its position, size, and font information. This information is very useful for building configuration to parse tables or forms using Textricator (see the following two sections).

Try --input-format=pdf.itext7 and --input-format=pdf.pdfbox to see how Textricator extracts the texts using the different parser engines. Some work better for some documents than others.

Table

In table mode, the data is grouped into columns based on the x-coordinate of the text.

Example

This is an example for src/test/resources/io/mfj/textricator/examples/probes.pdf.

# All measurements are in points. 1 point = 1/72 of an inch.
# x-coordinates are from the left edge of the page.
# y-coordinates are from the top edge of the page.

# Use the built-in pdfbox extractor
extractor: "pdf.pdfbox"

# Ignore everything above 88pt from the top
top: 88

# Ignore everything below 170pt from the top
bottom: 170

# If multiple text segments are withing 2pt vertically, consider them in the same row.
maxRowDistance: 2

# Define the columns, based on the x-coordinate where the column starts:
cols:
  "name": 0
  "launched": 132
  "speed": 235
  "cospar": 249
  "power": 355
  "mass": 415

types:
  "name":
    label: "Name"

  "launched":
    label: "Launch Date"

  "speed":
    label: "Speed (km/s)"
    type: "number"

  "cospar":
    label: "COSPAR ID"

  "power":
    label: "Power (watts)"
    type: "number"

  "mass":
    label: "Mass (pounds)"
    # Add .0 to the end of mass
    replacements:
      -
        pattern: "^(.*)$"
        replacement: "$1.0"

# Omit if Power is less than 200
filter: 'power >= 200'

Form

In form mode, the data is parsed by Textricator using a finite-state machine (FSM), and the FSM and additional parsing and formatting parameters are configured with YAML, indicated by command line option --config.

Conditions

State transitions are selected by evaluating conditions. Conditions are expressions parsed by Expr.

Available variables:

ulx - x coordinate of the upper-left corner of the text box
uly - y coordinate of the upper-left corner of the text box
lrx - x coordinate of the lower-right corner of the text box
lry - y coordinate of the lower-right corner of the text box
text - the text
page - page number
page_prev - page number of the previous text
fontSize - font size
font - font name
color - text color
bgcolor - background color
width - width of the text box
height - height of the text box
ulx_rel - difference in ulx between the previous and current texts
uly_rel - difference in uly between the previous and current texts
lrx_rel - difference in lrx between the previous and current texts
lry_rel - difference in lry between the previous and current texts
added Variables

Example

This is an example for src/test/resources/io/mfj/textricator/examples/school-employee-list.pdf.

# Use the built-in pdfbox parser
extractor: "pdf.pdfbox"

# All measurements are in points. 1 point = 1/72 of an inch.
# x-coordinates are from the left edge of the page.
# y-coordinates are from the top edge of the page.
header:
    # ignore anything less than this many points from the top, default and per-page
  default: 130
footer:
    # ignore anything more than this many points from the top, default and per-page
  default: 700

# Text segments are generally parsed in order, top to bottom, left to right.
# If two text segments have y-coordinates within this many points, consider them on the same line,
# and process the one further left first, even if it is 0.4pt lower on the page.
maxRowDistance: 2

# Define the output data record.
# Since the main record type we're collecting information on is our employees,
# we'll have that be the root type for our harvested information.
rootRecordType: employee
recordTypes:
  employee:
    label: "employee" # Labels are used when nested recordTypes come into play, like this document.
    valueTypes:
      # Not sure what to name a valueType? Just make something up!
      - employee
      - name
      - hiredate
      - occupation
      - showinfo
      - bool1
      - bool2
      - bool3
      - salary
    children:
      # In this example, there are multiple children nested under an employee,
      # so we'll treat it as a 'child' to the 'employee' recordType.
      - child
  child:
    label: "child"
    valueTypes:
      - child
      - grade

valueTypes:
  employee:
    # In the CSV, use "Employee ID" as the column header instead of "employee".
    label: "Employee ID"
  name:
    label: "Name"
  hiredate:
    label: "Hire Date"
  occupation:
    label: "Occupation"
  salary:
    label: "Salary"
  showinfo:
    label: "Important Info?"
  bool1:
    label: "Boolean 1"
  bool2:
    label: "Boolean 2"
  bool3:
    label: "Boolean 3"
  child:
    label: "Attending Child"
  grade:
    label: "Grade"

# Now we define the finite-state machine
# Let's name the state that our machine starts off with:
initialState: "INIT"

# When each text segment is encountered, each transition for the current state is checked.
states:
  INIT:
    transitions:
      # The first bit of text we reach is 'ID-0001', so we'll try the only transition that should work here.
      -
        # If this condition matches (which it should)
        condition: employee # Curious about the condition? Sxroll further down to the conditions section of this YAML.
        # Then we'll switch to the 'employee' state!
        nextState: employee

  employee: # ID number with the format 'ID-####'
    startRecord: true # When we enter this stage, we'll create a new "case" record.
    transitions:
      - # Now we move on to the name label. Once again, by varifying the condition and moving on after that.
        condition: namelabel
        nextState: namelabel

  namelabel:
    include: false # The label isn't important information in and of itself, so we can just not include it in the data.
    t

Textricator

Install / Use

README

Quick Start

Logging

Extracting text

Table

Example

Form

Conditions

Available variables:

Example