SkillAgentSearch skills...

Pdf2dataset

Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features

Install / Use

/learn @icaropires/Pdf2dataset

README

pdf2dataset

pdf2dataset pypi Maintainability codecov pypi-stats

Converts a whole subdirectory with any volume (small or huge) of PDF documents to a dataset (pandas DataFrame). No need to setup any external service (no database, brokers, etc). Just install and run it!

Main features

  • Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
  • Support for parallel and distributed processing through ray
  • Extractions are performed by page, making tasks distribution more uniform for handling documents with big differences in number of pages
  • Incremental writing of resulting DataFrame, making possible to process data bigger than memory
  • Error tracking of faulty documents
  • Resume interrupted processing
  • Extract text through pdftotext
  • Use OCR for extracting text through pytesseract
  • Extract images through pdf2image
  • Support to implement custom features extraction
  • Highly customizable behavior through params

Installation

Install Dependencies

Fedora

# "-por" for portuguese, use the documents language
$ sudo dnf install -y gcc-c++ poppler-utils pkgconfig poppler-cpp-devel python3-devel tesseract-langpack-por

Ubuntu (or debians)

$ sudo apt update

# "-por" for portuguese, use the documents language
$ sudo apt install -y build-essential poppler-utils libpoppler-cpp-dev pkg-config python3-dev tesseract-ocr-por

Install pdf2dataset

For usage

$ pip3 install pdf2dataset --user  # Please, isolate the environment

For development

# First, install poetry, clone repository and cd into it
$ poetry install

Usage

Simple - CLI

# Note: path, page and error will always be present in resulting DataFrame

# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  # Most basic, extract all possible features
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=text  # Extract just text
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=image  # Extract just image
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1  # Maximum reducing of parallelism
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true  # For scanned PDFs
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng  # For scanned documents with english text

Resume processing

In case of any interruption, to resume the processing, just use the same path as output and the processing will be resumed automatically. The flag --saving-interval (or the param saving_interval) controls the frequency the output path will be updated, and so, the processing "checkpoints".

Using as a library

Main functions

There're some helper functions to facilitate pdf2dataset usage:

  • extract: function can be used analogously to the CLI
  • extract_text: extract wrapper with features=text
  • extract_image: extract wrapper with features=image
  • image_from_bytes: (pdf2image.utils) get a Pillow Image object given the image bytes
  • image_to_bytes: (pdf2image.utils) get the image bytes given the a Pillow Image object

Basic example

from pdf2dataset import extract

extract('my_pdfs_dir', 'all_features.parquet.gzip')

Small data

One feature, not available to the CLI, is the custom behavior for handling small volumes of data (small can be understood as that: the extraction won't run for hours or days and won't be distributed).

The complete list of differences are:

  • Faster initialization (use multiprocessing instead of ray)
  • Don't save processing progress
  • Distributed processing not supported
  • Don't write dataframe to disk
  • Returns the dataframe
Example:
from pdf2dataset import extract_text

df = extract_text('my_pdfs_dir', small=True)
# ...

Pass list of files paths

Instead of specifying a directory, one can specify a list of files to be processed.

Example:
from pdf2dataset import extract


my_files = [
    './tests/samples/single_page1.pdf',
    './tests/samples/invalid1.pdf',
]

df = extract(my_files, small=True)
# ...

Pass files from memory

If you don't want to specify a directory for the documents, you can specify the tasks that will be processed.

The tasks can be of the form (document_name, document_bytes, page_number) or just (document_name, document_bytes), document_name must ends with .pdf but don't need to be a real file, document_bytes are the bytes of the pdf document and page_number is the number of the page to process (all pages, if not specified).

Example:
from pdf2dataset import extract_text

tasks = [
    ('a.pdf', a_bytes),  # Processing all pages of this document
    ('b.pdf', b_bytes, 1),
    ('b.pdf', b_bytes, 2),
]

# 'df' will contain results from all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
df = extract_text(tasks, 'my_df.parquet.gzip', small=True)

# ...

Returning a list

If you don't want to handle the DataFrame, is possible to return a nested list with the features values. The structure for the resulting list is:

result = List[documents]
documents = List[pages]
pages = List[features]
features = List[feature]
feature = any
  • any is any type supported by pyarrow.
  • features are ordered by the feature name (text, image, etc)
Example:
>>> from pdf2dataset import extract_text
>>> extract_text('tests/samples', return_list=True)
[[[None]],
 [['First page'], ['Second page'], ['Third page']],
 [['My beautiful sample!']],
 [['First page'], ['Second page'], ['Third page']],
 [['My beautiful sample!']]]
  • Features with error will have None value as result
  • Here, extract_text was used, so the only feature is text

Custom Features

With version >= 0.4.0, is also possible to easily implement extraction of custom features:

Example:

This is the structure:

from pdf2dataset import extract, feature, PdfExtractTask


class MyCustomTask(PdfExtractTask):

    @feature('bool_')
    def get_is_page_even(self):
        return self.page % 2 == 0

    @feature('binary')
    def get_doc_first_bytes(self):
        return self.file_bin[:10]

    @feature('string', exceptions=[ValueError])
    def get_wrong(self):
        raise ValueError("There was a problem!")


if __name__ == '__main__':
    df = extract('tests/samples', small=True, task_class=MyCustomTask)
    print(df)

    df.dropna(subset=['text'], inplace=True)  # Discard invalid documents
    print(df.iloc[0].error)
  • First print:
                         path  page doc_first_bytes  ...                  text  wrong                                              error
0                invalid1.pdf    -1   b"I'm invali"  ...                  None   None  image_original:\nTraceback (most recent call l...
1             multi_page1.pdf     2  b'%PDF-1.5\n%'  ...           Second page   None  wrong:\nTraceback (most recent call last):\n  ...
2             multi_page1.pdf     3  b'%PDF-1.5\n%'  ...            Third page   None  wrong:\nTraceback (most recent call last):\n  ...
3   sub1/copy_multi_page1.pdf     1  b'%PDF-1.5\n%'  ...            First page   None  wrong:\nTraceback (most recent call last):\n  ...
4   sub1/copy_multi_page1.pdf     3  b'%PDF-1.5\n%'  ...            Third page   None  wrong:\nTraceback (most recent call last):\n  ...
5             multi_page1.pdf     1  b'%PDF-1.5\n%'  ...            First page   None  wrong:\nTraceback (most recent call last):\n  ...
6  sub2/copy_single_page1.pdf     1  b'%PDF-1.5\n%'  ...  My beautiful sample!   None  wrong:\nTraceback (most recent call last):\n  ...
7   sub1/copy_multi_page1.pdf     2  b'%PDF-1.5\n%'  ...           Second page   None  wrong:\nTraceback (most recent call last):\n  ...
8            single_page1.pdf     1  b'%PDF-1.5\n%'  ...  My beautiful sample!   None  wrong:\nTraceback (most recent call last):\n  ...

[9 rows x 8 columns]
  • Second print:
wrong:
Traceback (most recent call last):
  File "/home/icaro/Desktop/pdf2dataset/pdf2dataset/extract_task.py", line 32, in inner
    result = feature_method(*args, **kwargs)
  File "example.py", line 16, in get_wrong
    raise ValueError("There was a problem!")
ValueError: There was a problem!

Notes:

  • @feature is the decorator used to define new features.
  • The extraction method name must start with the prefix get_ (avoids collisions with attribute names and increases readability)
  • First argument to @feature must be a valid PyArrow type, complete list here
  • exceptions param specify a list of exceptions to be recorded on DataFrame, otherwise they are raised
  • For this example, all available features plus the custom ones are extracted

Results File

The resulting "file" is a directory with structure specified by dask with pyarrow engine, it can be easily read with pandas or dask:

Example with pandas

>>> import pandas as pd
>>> df = pd.read_parquet('my_df.parquet.gzip', engine='pyarrow')
>>> df
                             path  page                  text                   

Related Skills

View on GitHub
GitHub Stars20
CategoryData
Updated1y ago
Forks5

Languages

Python

Security Score

80/100

Audited on Mar 7, 2025

No findings