<br> <br> <center> <h1>Extracting Semi-Structured Data from PDFs on a large scale</h1> </center> <br>

Towards a more general approach for extracting semi-structured data

Financial data is often contained in semi-structured PDFs. While many tools exist for data extraction, not all are suitable in every case. Semi-structured hereby refers to the fact that PDFs, in contrast to html, regularly contain information in varying structure: Headlines may or may not exist; the number of pages often varies along with the size and position of characters.

Using insights found on a blog post, the following pages will present what the contained data looks like and consider a more general solution for extracting data from PDFs.

Technical Details

For reading PDF files, I am using PDFQuery, while the extraction of the layout is done with the help of pdfminer. PDFQuery turned out to be a lot faster (~5 times) in reading the document, while pdfminer provides the necessary tools to extract the layouts. For the scale of a few thousand documents with multiple pages, a combination of the two was the best choice.

The PDF layout we are dealing with comes in the form of a LTPage object. Each page in a PDF is described by a LTPage object and the hierarchical structure of lines, boxes, rectangles etc. which it contains. For the full hierarchy look here.

Extract Layout and Characters

The following code - mainly taken from the blog-post mentioned above - will extract all LTPage objects from an example document. The contents of the document are anonymized.

from pdfquery import PDFQuery

import pdfminer
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

import matplotlib.pyplot as plt
from matplotlib import patches
%matplotlib inline

import pandas as pd

def extract_page_layouts(file):
    """
    Extracts LTPage objects from a pdf file.
    modified from: http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
    Tests show that using PDFQuery to extract the document is ~ 5 times faster than pdfminer.
    """
    laparams = LAParams()
    
    with open(file, mode='rb') as pdf_file:
        print("Open document %s" % pdf_file.name)
        document = PDFQuery(pdf_file).doc

        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed

        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        layouts = []
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            layouts.append(device.get_result())
    
    return layouts

example_file = "data/example_anonymous.pdf"
page_layouts = extract_page_layouts(example_file)
print("Number of pages: %d" % len(page_layouts))

Open document data/example_anonymous.pdf
Number of pages: 1

The page consists merely of lines/rectangles and text contained in the LTTextBox objects.

current_page = page_layouts[0]
for obj in set(type(o) for o in current_page):
    print(obj)

<class 'pdfminer.layout.LTTextBoxHorizontal'>
<class 'pdfminer.layout.LTRect'>

The following code separates the text from the other objects and shows the first three LTTextBoxes.

texts = []
rects = []

# seperate text and rectangle elements
for elem in current_page:
    if isinstance(elem, pdfminer.layout.LTTextBoxHorizontal):
        texts.append(elem)
    elif isinstance(elem, pdfminer.layout.LTRect):
        rects.append(elem)
texts[:3]

[<LTTextBoxHorizontal(0) 53.030,762.147,104.478,784.697 'Jane Doe\nFoo Bar Ltd.\n'>,
 <LTTextBoxHorizontal(1) 53.160,676.982,142.979,687.302 'Heading 1 is short\n'>,
 <LTTextBoxHorizontal(2) 92.640,637.577,146.067,646.927 'Segment 1-1\n'>]

We could already access the text in the LTTextBoxes, but we have no idea what the structure looks like yet. So let us break it down to each individual character and visualize the document's structure.

Visualize the PDF structure

TEXT_ELEMENTS = [
    pdfminer.layout.LTTextBox,
    pdfminer.layout.LTTextBoxHorizontal,
    pdfminer.layout.LTTextLine,
    pdfminer.layout.LTTextLineHorizontal
]

def flatten(lst):
    """Flattens a list of lists"""
    return [item for sublist in lst for item in sublist]

def extract_characters(element):
    """
    Recursively extracts individual characters from 
    text elements. 
    """
    if isinstance(element, pdfminer.layout.LTChar):
        return [element]

    if any(isinstance(element, i) for i in TEXT_ELEMENTS):
        return flatten([extract_characters(e) for e in element])

    if isinstance(element, list):
        return flatten([extract_characters(l) for l in element])

    return []

# extract characters from texts
characters = extract_characters(texts)

Here comes the neat trick that Iain uses in his post to give an understanding of what the page looks like: He uses the bounding boxes describing each element in the PDF file and visualizes them.

def draw_rect_bbox(bbox, ax, color):
    """
    Draws an unfilled rectable onto ax.
    """
    x0,y0,x1,y1 = tuple(bbox)
    ax.add_patch( 
        patches.Rectangle(
            (x0, y0),
            x1 - x0,
            y1 - y0,
            fill=False,
            color=color
        )    
    )
    
def draw_rect(rect, ax, color="black"):
    draw_rect_bbox(rect.bbox, ax, color)

xmin, ymin, xmax, ymax = current_page.bbox
size = 6
num_pages = 2

fig, axes = plt.subplots(1,num_pages, figsize = (num_pages*size, size * (ymax/xmax)), sharey=True, sharex=True)

# rects and chars
ax = axes[0]
for rect in rects:
    draw_rect(rect, ax)
    
for c in characters:
    draw_rect(c, ax, "red")

# chars and TextBoxes
ax = axes[1]
for c in characters:
    draw_rect(c, ax, "red")    

for textbox in texts:
    draw_rect(textbox, ax, "blue")
    

plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
plt.show()

png

On the left, I plotted all lines/rectangles and the bounding boxes of the characters. On the right, I plotted the bounding boxes of the characters and the TextBoxes. Here you can see why I talk about semi-structured data: the content of the pdf is arranged in rows and columns but there are no real separators to easily distinguish between the end of one logical entity and the beginning of another. The lines may indicate headlines but this conclusion does not seem to be consistent throughout the document. We have to find another approach in order to get this data into structure. Depending on the goal, there are several ways, each of them with its own advantages and disadvantages. By looking at the visualized document structure, I decided to approach this problem by also structuring the text row- and column-wise.

Structuring the text data row-column-wise

We already extracted the LTChar objects. Now we arrange them row-wise and finally look at what we are doing this for: the text.

def arrange_text(characters):
    """
    For each row find the characters in the row
    and sort them horizontally.
    """
    
    # find unique y0 (rows) for character assignment
    rows = sorted(list(set(c.bbox[1] for c in characters)), reverse=True)
    
    sorted_rows = []
    for row in rows:
        sorted_row = sorted([c for c in characters if c.bbox[1] == row], key=lambda c: c.bbox[0])
        sorted_rows.append(sorted_row)
    return sorted_rows

sorted_rows = arrange_text(characters)

def extract_text(rows):
    row_texts = []
    for row in rows:
        row_text = ["".join([c.get_text() for c in row])]
        row_texts.append(row_text)
    return row_texts

row_texts = extract_text(sorted_rows)
row_texts[:18]

[['Jane Doe'],
 ['Foo Bar Ltd.'],
 ['Berechnung 2014'],
 ['(Calculation 2014)'],
 ['Heading 1 is short'],
 ['€'],
 ['Segment 1-1'],
 ['7PlatzhalterPlaceholder102.714,00'],
 ['/23BPlatzhalterPlaceholder505,00'],
 ['Segment 1-2'],
 ['/524PlatzhalterPlaceholder871,80'],
 ['3BPlatzhalterPlaceholder-103,34'],
 ['1AB9PlatzhalterPlaceholder1.234,83'],
 ['/XYZPlatzhalterPlaceholder-113,04'],
 ['D320PlatzhalterPlaceholder527,27'],
 ['0130PlatzhalterPlaceholder994,33'],
 ['8417PlatzhalterPlaceholder411,50'],
 ['X017PlatzhalterPlaceholder-602,50']]

This already looks readable. We appear to have some general information, a kind of identifier, descriptions in German and English and a monetary unit as well as an amount. However, this does not constitute clear data that one can put in a table or any other useful data-structure. What we need to do now is to separate the text column-wise. I wrote a little piece of code that creates a rectangle which fills the space between each column.

# define a margin that separates two columns
col_margin = 0.5

def create_separators(sorted_rows, margin):
    """Creates bounding boxes to fill the space between columns"""
    separators = []
    for row in sorted_rows:
        for idx, c in enumerate(row[:-1]): 
            if (row[idx+1].bbox[0] - c.bbox[2]) > margin:
                bbox = (c.bbox[2], c.bbox[3], row[idx+1].bbox[0], row[idx+1].bbox[1])
                separator = pdfminer.layout.LTRect(linewidth=2, bbox=bbox)
                separators.append(separator)
    return separators

separators = create_separators(sorted_rows, col_margin)

Visualize the separators to check how well the code works.

xmin, ymin, xmax, y

Pdfreader

Install / Use

README

Towards a more general approach for extracting semi-structured data

Technical Details

Extract Layout and Characters

Visualize the PDF structure

Structuring the text data row-column-wise

Related Skills