Document Layout Analysis repos for development with PdfPig.

From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

Related projects

PdfPig - Read text content from PDFs in C# (port of PdfBox)
camelot-sharp (port of camelot) - Extract tables from PDF files
tabula-sharp (port of tabula-java) - Extract tables from PDF files
PublayNetSharp - Extract and convert PubLayNet data to PageXml format
PublayNet-maskrcnn-mlnet - Using a MaskRCNN model trained on the PublayNet dataset with ML.Net in C# / .Net for Document layout analysis and page segmmentation task.
PdfPig MLNet Block Classifier - Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM).
PdfPig SVM Region Classifier - Proof of concept of a simple SVM Region Classifier using PdfPig and Accord.Net.
simple-docstrum - A step-by-step implementation of the Docstrum algorithm for pdf documents

Cited by

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis | Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li | website | github

Resources

Text extraction

High precision text extraction from PDF documents | Øyvind Raddum Berg
User-Guided Information Extraction from Print-Oriented Documents | Tamir Hassan
Combining Linguistic and Spatial Information for Document Analysis | Aiello, Monz and Todoran
New Methods for Metadata Extraction from Scientific Literature | Dominika Tkaczyk
A System for Converting PDF Documents into Structured XML Format | Hervé Déjean, Jean-Luc Meunier
Layout and Content Extraction for PDF Documents | Hui Chao, Jian Fan
DocParser: Hierarchical Structure Parsing of Document Renderings | J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel

Word segmentation

An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents | M. Makridis, N. Nikolaou, B. Gatos
Word Extraction Using Area Voronoi Diagram | Zhe Wang, Yue Lu, Chew Lim Tan
A word extraction algorithm for machine-printed documents using a 3D neighborhood graph model | Young-Jung Yu, Hwan-Gue Cho
Recognition of Multi-Oriented, Multi-Sized, and Curved Text | Yao-Yi Chiang, Craig A. Knoblock

example

Page segmentation

Performance Comparison of Six Algorithms for Page Segmentation | Faisal Shafait, Daniel Keysers, and Thomas M. Breuel
A Fast Algorithm for Bottom-Up Document Layout Analysis | Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
Empirical Performance Evaluation Methodology and its Application to Page Segmentation Algorithms: A Review | Pinky Gather, Avininder Singh
Layout Analysis based on Text Line Segment Hypotheses | Thomas M. Breuel
Hybrid Page Layout Analysis via Tab-Stop Detection | presentation | Ray Smith
Extending the Page Segmentation Algorithms of the Ocropus Documentation Layout Analysis System | Amy Alison Winder
Object-Level Document Analysis of PDF Files | Tamir Hassan
Document Image Segmentation as a Spectral Partitioning Problem | Dasigi, Jain and Jawahar
Benchmarking Page Segmentation Algorithms | S. Randriamasy, L. Vincent

Recursive XY Cut

The X-Y cut segmentation algorithm, also referred to as recursive X-Y cuts (RXYC) algorithm, is a tree-based top-down algorithm. The root of the tree represents the entire document page. All the leaf nodes together represent the final segmentation. The RXYC algorithm __recursively splits the document into two or more smaller rectangular blocks which represent the nodes of the tree. At each step of the recursion, the horizontal a

DocumentLayoutAnalysis

Install / Use

README