DocumentLayoutAnalysis
Document Layout Analysis resources repos for development with PdfPig.
Install / Use
/learn @BobLd/DocumentLayoutAnalysisREADME
Document Layout Analysis repos for development with PdfPig.
From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.
Related projects
- PdfPig - Read text content from PDFs in C# (port of PdfBox)
- camelot-sharp (port of camelot) - Extract tables from PDF files
- tabula-sharp (port of tabula-java) - Extract tables from PDF files
- PublayNetSharp - Extract and convert PubLayNet data to PageXml format
- PublayNet-maskrcnn-mlnet - Using a MaskRCNN model trained on the PublayNet dataset with ML.Net in C# / .Net for Document layout analysis and page segmmentation task.
- PdfPig MLNet Block Classifier - Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM).
- PdfPig SVM Region Classifier - Proof of concept of a simple SVM Region Classifier using PdfPig and Accord.Net.
- simple-docstrum - A step-by-step implementation of the Docstrum algorithm for pdf documents
Cited by
- LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis | Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li | website | github
Resources
- Text extraction
- Word segmentation
- Page segmentation
- Zone classification/extraction & Reading order
- NLP & ML
- Related topics
- Datasets
- Output file format
Text extraction
- High precision text extraction from PDF documents | Øyvind Raddum Berg
- User-Guided Information Extraction from Print-Oriented Documents | Tamir Hassan
- Combining Linguistic and Spatial Information for Document Analysis | Aiello, Monz and Todoran
- New Methods for Metadata Extraction from Scientific Literature | Dominika Tkaczyk
- A System for Converting PDF Documents into Structured XML Format | Hervé Déjean, Jean-Luc Meunier
- Layout and Content Extraction for PDF Documents | Hui Chao, Jian Fan
- DocParser: Hierarchical Structure Parsing of Document Renderings | J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel
Word segmentation
- An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents | M. Makridis, N. Nikolaou, B. Gatos
- Word Extraction Using Area Voronoi Diagram | Zhe Wang, Yue Lu, Chew Lim Tan
- A word extraction algorithm for machine-printed documents using a 3D neighborhood graph model | Young-Jung Yu, Hwan-Gue Cho
- Recognition of Multi-Oriented, Multi-Sized, and Curved Text | Yao-Yi Chiang, Craig A. Knoblock

Page segmentation
- Performance Comparison of Six Algorithms for Page Segmentation | Faisal Shafait, Daniel Keysers, and Thomas M. Breuel
- A Fast Algorithm for Bottom-Up Document Layout Analysis | Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
- Empirical Performance Evaluation Methodology and its Application to Page Segmentation Algorithms: A Review | Pinky Gather, Avininder Singh
- Layout Analysis based on Text Line Segment Hypotheses | Thomas M. Breuel
- Hybrid Page Layout Analysis via Tab-Stop Detection | presentation | Ray Smith
- Extending the Page Segmentation Algorithms of the Ocropus Documentation Layout Analysis System | Amy Alison Winder
- Object-Level Document Analysis of PDF Files | Tamir Hassan
- Document Image Segmentation as a Spectral Partitioning Problem | Dasigi, Jain and Jawahar
- Benchmarking Page Segmentation Algorithms | S. Randriamasy, L. Vincent
Recursive XY Cut 
The X-Y cut segmentation algorithm, also referred to as recursive X-Y cuts (RXYC) algorithm, is a tree-based top-down algorithm. The root of the tree represents the entire document page. All the leaf nodes together represent the final segmentation. The RXYC algorithm __recursively splits the document into two or more smaller rectangular blocks which represent the nodes of the tree. At each step of the recursion, the horizontal a
