Parallel and/or LAzY Analyzer for PDF 🏖️

TL;DR

You can read this document, or just go look at some notebooks to get an idea of what this package does.

About

There are already too many PDF libraries, unfortunately none of which does everything that everybody wants it to do, and we probably don't need another one. It is not recommended that you use this library for anything at all, but if you were going to use it for something, it might be one of these things, which you may currently be doing with pdfminer.six, for instance:

Accessing the document catalog, page tree, structure tree, outline, content streams, cross-reference table, XObjects, fonts, images, annotations, and other low-level PDF metadata.
Obtaining the absolute position and attributes of every character, line, path, and image in every page of a PDF.

Note that while PLAYA Ain't a LAYout Analyzer, it does in fact implement the layout analysis algorithm from pdfminer.six anyways. See the documentation for more information on how to migrate your code. You may be interested to know that PLAYA's implementation is also up to 10x faster (benchmarks), depending on how many CPUs you use.

All that said, the primary purpose of PLAYA is to provide a parallel, parallelizable, pure-Python and Pythonic (for its author's definition of the term), lazy interface to the internals of PDF files.

But, it does more than that! It also includes a command-line interface which can dump out various types of PDF data and metadata quickly. For instance, you might want to dump out all the PDF operators in all the content streams on all the pages:

playa --content-streams my-awesome-document.pdf

Or you could look at the document outline or logical structure tree:

playa --outline some-interesting-stuff.pdf
playa --structure tagged-pdf-wow.pdf

And, yes, it does extract text, or also text objects (with associated metadata):

playa --text fascinating-research-paper.pdf
playa --text-objects colorful-presentation.pdf

Or images, in JPEG and PNM (or sometimes TIFF) format (may not work for all images):

playa --images imagedir splashy-resume.pdf

Or fonts, in various esoteric formats (may not work for all fonts):

playa --fonts fontdir typographic-horror.pdf

If you just want to extract text from a PDF, there are better and/or faster tools and libraries out there, notably pypdfium2 and pypdf, among others. See these benchmarks for a comparison. Nonetheless, you will notice in this comparison that:

PLAYA (using 2 CPUs) is the fastest pure-Python PDF reader by far
PLAYA has no dependencies and no C++
PLAYA is MIT licensed

PLAYA is also very good at reading logical structure trees. On my town's 486-page zoning bylaw, extracting the entire tree with its text contents as JSON using playa --structure takes only 23 seconds, whereas pdfplumber --structure-text takes 69 seconds and pdfinfo -struct-text (which doesn't output JSON) takes 110 seconds.

I cannot stress this enough, text extraction is not the primary use case for PLAYA, because extracting text from PDFs is not fun, and I like fun. Do you like fun? Then read on.

Installation

Installing it should be really simple as long as you have Python 3.8 or newer:

pipx install playa-pdf

Yes it's not just "playa". Sorry about that. If you wish to read certain encrypted PDFs then you will need the crypto add-on:

pipx install playa-pdf[crypto]

Usage

Do you want to get stuff out of a PDF? You have come to the right place! Let's open up a PDF and see what's in it:

pdf = playa.open("my_awesome_document.pdf")
raw_byte_stream = pdf.buffer
a_bunch_of_tokens = list(pdf.tokens)
a_bunch_of_indirect_object_ids = list(pdf.keys())
a_bunch_of_indirect_objects = list(pdf.values())
a_bunch_of_pages = list(pdf.pages)

Yes, a Document is fundamentally a Mapping of object IDs to objects, which are represented to the extent possible by native Python objects. These may not be terribly useful to you, but you might find them interesting. Note that these are "indirect objects" where the actual object is accompanied by an object number and "generation number". If you wish to find all the objects in a PDF file, then you will need to iterate over the objects property:

for indobj in pdf.objects:
    objid, genno, obj = indobj

It is possible you will encounter multiple objects with the same objid due to the "incremental updates" feature of PDF. As expected, you can subscript the document to access indirect objects by number (this will return the object with most recent generation number):

a_particular_object = pdf[42]

Your PDF document probably has some pages. How many? What are their numbers/labels? They could be things like "xvi" (pronounced "gzvee"), 'a", or "42", for instance!

npages = len(pdf.pages)
page_numbers = [page.label for page in pdf.pages]

You can also subscript pages in various other ways, using a slice or an iterable of int, which will give you a new page list object that behaves similarly. Pages and page lists can refer back to their document (using weak reference magic to avoid memory leaks) with their doc property.

Some (by no means all) helpful metadata

A PDF often contains a "document outline" which is a sequence of trees representing the coarse-grained logical structure of the document, accessible via the outline property:

for entry in pdf.outline:
    entry.title, entry.destination, entry.action, entry.element
    for child in entry:
        child.title, child.destination, child.action, child.element
        ...

If you are lucky it has a "logical structure tree". The elements here might even be referenced from the outline above! (or, they might not... with PDF you never know).

for element in pdf.structure:
   for child in element:
       ...
sections = structure.find_all("Sect")
first_p = structure.find("P")

Now perhaps we want to look at a specific page. Okay! You can also look at its contents, more on that in a bit:

page = next(iter(pdf.pages)) # Fast and lazy way to get the first page
page = pdf.pages[0]          # they are numbered from 0
page = pdf.pages["xviii"]    # but you can get them by label (a string)
page = pdf.pages["42"]       # or "logical" page number (also a string)
print(f"Page {page.label} is {page.width} x {page.height}")

Since PDF is at heart a page-oriented, presentation format, many types of metadata are mostly accessible via the page objects. For instance you can access the fonts used in page with, obviously, the fonts property, or the annotations via the annotations property.

For example, annotations (internal or external links) are defined on pages (since their position would not make any sense otherwise). There are umpteen zillion kinds of annotations (PDF 1.7 sect 12.5.6) but they all have at least these attributes in common:

for annot in page.annotations:
    annot.subtype, annot.rect, annot.props

The set of possible entries in annotation dictionaries (PDF 1.7 sect 12.5.2) is vast and confusing and inconsistently implemented. You can access the raw annotation dictionary via props in the Annotation object.

If the document has logical structure, then the pages will also have a slightly different form of logical structure. You can use the find and find_all methods to get all of the enclosing structure elements of a given type (actually a role) for a page. So for instance if you wanted to get the text contents for all the cells in all the tables on a page, assuming the creator of said page was kind enough to check the "PDF/UA" box, you can do:

for table in page.structure.find_all("Table"):
    print(f"Table at {table.bbox}: {[x.text for x in table.contents]}")

Accessing content

What are these "contents" of which you speak, which were surely created by a Content Creator? Well, you can look at the stream of tokens or mysterious PDF objects:

for token in page.tokens:
    ...
for object in page.contents:
    ...

But that isn't very useful, so you can also acce

Playa

Install / Use

README