img2table

img2table is a simple, easy to use, table identification and extraction Python Library based on OpenCV image processing that supports most common image file formats as well as PDF files.

Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.

Installation
Features
Supported file formats
Usage
Examples
Caveats / FYI

Installation <a name="installation"></a>

The library can be installed via pip:

<code>pip install img2table</code>: Standard installation, supporting Tesseract<br> <code>pip install img2table[paddle]</code>: For usage with Paddle OCR<br> <code>pip install img2table[easyocr]</code>: For usage with EasyOCR<br> <code>pip install img2table[surya]</code>: For usage with Surya OCR<br> <code>pip install img2table[gcp]</code>: For usage with Google Vision OCR<br> <code>pip install img2table[aws]</code>: For usage with AWS Textract OCR<br> <code>pip install img2table[azure]</code>: For usage with Azure Cognitive Services OCR

Features <a name="features"></a>

Table identification for images and PDF files, including bounding boxes at the table cell level
Handling of complex table structures such as merged cells
Handling of implicit content - see example
Table content extraction by providing support for OCR services / tools
Extracted tables are returned as a simple object, including a Pandas DataFrame representation
Export extracted tables to an Excel file, preserving their original structure

Supported file formats <a name="supported-file-formats"></a>

Images <a name="images-formats"></a>

Images are loaded using the opencv-python library, supported formats are listed below.

<details> <summary>Supported image formats</summary> <br> <blockquote> <ul> <li>Windows bitmaps - <em>.bmp, </em>.dib</li> <li>JPEG files - <em>.jpeg, </em>.jpg, *.jpe</li> <li>JPEG 2000 files - *.jp2</li> <li>Portable Network Graphics - *.png</li> <li>WebP - *.webp</li> <li>Portable image format - <em>.pbm, </em>.pgm, <em>.ppm </em>.pxm, *.pnm</li> <li>PFM files - *.pfm</li> <li>Sun rasters - <em>.sr, </em>.ras</li> <li>TIFF files - <em>.tiff, </em>.tif</li> <li>OpenEXR Image files - *.exr</li> <li>Radiance HDR - <em>.hdr, </em>.pic</li> <li>Raster and Vector geospatial data supported by GDAL<br> <cite><a href="https://docs.opencv.org/4.x/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56">OpenCV: Image file reading and writing</a></cite></li> </ul> </blockquote> </details> Multi-page images are not supported.

PDF <a name="pdf-formats"></a>

Both native and scanned PDF files are supported.

Usage <a name="usage"></a>

Documents <a name="documents"></a>

Images <a name="images-doc"></a>

Images are instantiated as follows :

from img2table.document import Image

image = Image(src, 
              detect_rotation=False)

<h4>Parameters</h4> <dl> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt> <dd style="font-style: italic;">Image source</dd> <dt>detect_rotation : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Detect and correct skew/rotation of the image</dd> </dl>

<br> The implemented method to handle skewed/rotated images supports skew angles up to 45° and is based on the publication by <a href="https://www.mdpi.com/2079-9292/9/1/55">Huang, 2020</a>.<br> Setting the <code>detect_rotation</code> parameter to <code>True</code>, image coordinates and bounding boxes returned by other methods might not correspond to the original image.

PDF <a name="pdf-doc"></a>

PDF files are instantiated as follows :

from img2table.document import PDF

pdf = PDF(src, 
          pages=[0, 2],
          detect_rotation=False,
          pdf_text_extraction=True)

<h4>Parameters</h4> <dl> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt> <dd style="font-style: italic;">PDF source</dd> <dt>pages : list, optional, default <code>None</code></dt> <dd style="font-style: italic;">List of PDF page indexes to be processed. If None, all pages are processed</dd> <dt>detect_rotation : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Detect and correct skew/rotation of extracted images from the PDF</dd> <dt>pdf_text_extraction : bool, optional, default <code>True</code></dt> <dd style="font-style: italic;">Extract text from the PDF file for native PDFs</dd> </dl>

PDF pages are converted to images with a 200 DPI for table identification.

OCR <a name="ocr"></a>

img2table provides an interface for several OCR services and tools in order to parse table content.<br> If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

<details> <summary>Tesseract<a name="tesseract"></a></summary> <br>

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, 
                   lang="eng", 
                   psm=11,
                   tessdata_dir="...")

<h4>Parameters</h4> <dl> <dt>n_threads : int, optional, default <code>1</code></dt> <dd style="font-style: italic;">Number of concurrent threads used to call Tesseract</dd> <dt>lang : str, optional, default <code>"eng"</code></dt> <dd style="font-style: italic;">Lang parameter used in Tesseract for text extraction</dd> <dt>psm : int, optional, default <code>11</code></dt> <dd style="font-style: italic;">PSM parameter used in Tesseract, run <code>tesseract --help-psm</code> for details</dd> <dt>tessdata_dir : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">Directory containing Tesseract traineddata files. If None, the <code>TESSDATA_PREFIX</code> env variable is used.</dd> </dl>

Usage of Tesseract-OCR requires prior installation. Check documentation for instructions. <br> For Windows users getting environment variable errors, you can check this tutorial <br>

</details> <details> <summary>PaddleOCR<a name="paddle"></a></summary> <br>

<a href="https://github.com/PaddlePaddle/PaddleOCR">PaddleOCR</a> is an open-source OCR based on Deep Learning models.<br> At first use, relevant languages models will be downloaded.

from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en",
                kw={"kwarg": kw_value, ...})

<h4>Parameters</h4> <dl> <dt>lang : str, optional, default <code>"en"</code></dt> <dd style="font-style: italic;">Lang parameter used in Paddle for text extraction, check <a href="https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations">documentation for available languages</a></dd> <dt>kw : dict, optional, default <code>None</code></dt> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.</dd> </dl>

<br> <b>NB:</b> For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually as stated in this <a href="https://github.com/PaddlePaddle/PaddleOCR/issues/7993">issue</a>.

# Example of installation with CUDA 11.8
pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddleocr img2table

If you get an error trying to run PaddleOCR on Ubuntu, please check this <a href="https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037">issue</a> for a working solution.

<br> </details> <details> <summary>EasyOCR<a name="easyocr"></a></summary> <br>

<a href="https://github.com/JaidedAI/EasyOCR">EasyOCR</a> is an open-source OCR based on Deep Learning models.<br> At first use, relevant languages models will be downloaded.

from img2table.ocr import EasyOCR

ocr = EasyOCR(lang=["en"],
              kw={"kwarg": kw_value, ...})

<h4>Parameters</h4> <dl> <dt>lang : list, optional, default <code>["en"]</code></dt> <dd style="font-style: italic;">Lang parameter used in EasyOCR for text extraction, check <a href="https://www.jaided.ai/easyocr">documentation for available languages</a></dd> <dt>kw : dict, optional, default <code>None</code></dt> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the EasyOCR <code>Reader</code> constructor.</dd> </dl>

<br> </details> <details> <summary>docTR<a name="docTR"></a></summary> <br>

<a href="https://github.com/mindee/doctr">docTR</a> is an open-source OCR based on Deep Learning models.<br> In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in the package documentation

from img2table.ocr import DocTR

ocr = DocTR(detect_language=False,
            kw={"kwarg": kw_value, ...})

<h4>Parameters</h4> <dl> <dt>detect_language : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Parameter indicating if language prediction is run on the document</dd> <dt>kw : dict, optional, default <code>None</code></dt> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the docTR <code>ocr_predictor</code> method.</dd> </dl>

Img2table

Install / Use

README