TesseractStudio.Net

A free Windows graphical interface to the Tesseract 4.0 OCR engine.

Generate Convert Improve

Install / Use

/learn @OpaitSoftware/TesseractStudio.Net

About this skill

Quality Score

0/100

README

Introduction

TessStudio is a Windows program to create, review and correct OCR data in searchable PDF files using Tesseract engine.

Features

Supports image and multipage PDF files, with or without prior OCR data.
Can run or re-run the Tesseract OCR process the current page, all pages or selected pages.
Preserves any visible text on a PDF page while performing OCR on the image elements only.
For multi-page files, multiple instances of the tesseract engine run in parallel for improved performance. The speed improvement depends on the number of processor cores.
Identify and display OCR text at the word level with detected word boundaries visible.
The built-in spell checker automatically tags words not found in the dictionary.
Display PDF pages in the following modes:
- Image with OCR text hidden
- OCR text visible and image hidden
- OCR text visible on faded image
Use any installed font to display OCR text. Fonts are automatically scaled to fit word boundaries.
Click on a visible word to open a text editor to correct OCR mistakes.
Split a selected word at the current cursor position into two words, or merge the selected word with the next word.
Modify or move word boundaries.
Create new OCR words, delete existing words.
Supports any number of Undo and Redo operations.
Save corrections as searchable PDF files. Optionally save as PDF/A or encrypted PDF files.
Experimental support for removing grid lines and handling a mixed-mode page with both light text on dark background and dark text on light background. This is common with table headers.
Capture and examine debug intermediary images and OCR output in text.

License

TessStudio is released under Opait Freeware license which allows any legal use for free. The only limitation is that the product cannot be sold for a fee.

Installation

TessStudio is packaged as a Windows MSI and can be downloaded from the product web site:

https://www.opait.com/tessstudio

Acknowledgments

Tesseract OCR Apache License Version 2.0

Leptonica by Dan Bloomberg [Creative Commons Attribution 3.0 United States License]http://creativecommons.org/licenses/by/3.0/us/)

User Guide

In this section we will use a classic TIFF sample file from CCITT which has been converted to a searchable PDF using Tesseract OCR.

We could also start with the original TIFF file and perform the OCR processing within the TessStudio as described later in this document.

The PDF page is currently in the image display mode which shows the original TIFF image with the OCR text hidden behind the image.

Display Modes

The slider on the toolbar controls relative opacities of the text and image layers:

| Mode | Toolbar | | ----------------- | --------------------- | | Image Mode | | | Text Mode | | | Text Overlay Mode | | | Layout Mode | |

You may also toggle quickly between image and text modes by pressing the text or image icons at the two ends of the slider bar. OCR data can be corrected in one of text or the text overlay modes.

Text Mode

The page is displayed in the text mode with the image layer hidden. The display font, text colors and the opacity of word boundaries can all be controlled using the options dialog.

Text Overlay Mode

The page is displayed in text overlay mode where the original image is visible in the background at opacity level determined by the position of the slider bar.

Layout Mode

The layout mode toggles between correcting OCR text and correcting the layout of the OCR words.

| Layout Mode is Off | Layout Mode is On | | ---------------------------------- | ---------------------------------------- | | Edit text of a recognized word | Resize the bounding rectangle of word | | Split a word at the current cursor | Move the bounding rectangle horizontally | | Merge a word with the next word | Add a new word | | | Delete a word |

Correcting Recognition Mistakes

The only OCR mistake in this document is the cursive signature which has been mis-recognized as "ThA." instead of "Phil.". To correct this problem, we click on the word which opens an edit box, allowing us to replace the text:

Correcting Segmentation Mistakes

Another type of OCR mistake which originates from the layout analysis is when the OCR engine breaks a word into two or more words or lumps adjacent words together. The space character is invisible to the OCR process. The decision to insert one or more spaces between characters needs to rely on algorithms that either use heuristics or machine learning to perform text segmentation. Such algorithms are not always perfect. In fact, whether or not one or more spaces were intended between characters of an image is often ambiguous, even to the human eyes, without considering the context of the characters.

TessStudio can correct these mistakes by splitting a word at the cursor location or by merging two adjacent words. In both cases, you need to select a word and then right click and use the context menus to split or merge words.

In the example above, we have split SLEREXE to two words SLER and EXE. We have also merged the two words COMPANY LIMITED into a single word. These are, of course, for illustration purposes.

Undo / Redo

To restore the original words, you can use the Undo button or press Ctrl+Z several times.

The software supports unlimited levels of Undo and Redo operation.

Correcting Layout Mistakes

The OCR engine will occasionally create spurious words (e.g. recognize a vertical line segment as 'I' or ']'), miss a word altogether (especially true if some text are in inverted colors, white on gray) or misplace word location horizontally (artifact of classification). The layout mode is provided to correct such mistakes.

Move or Resize Word

To move or resize a recognized word, make sure the layout mode is on and click the bounding rectangle of the word. The rectangle will be selected and can be moved or resized horizontally.

Delete Word

To delete a word, select the bounding rectangle in the layout mode and use keyboard Del key or the menus to delete the word and its bounding rectangle.

Add New Word

To add a new word, select the layout mode and move the mouse cursor to where you wish to insert the word. A new word can only be added to an existing text line. The cursor will change to a cross hair cursor when it detects an existing text line. Draw a bounding rectangle for the new word and enter its text value. The new word will inherit its attributes, such as font size, from a neighboring word.

Running OCR

You may run the OCR process on the current document by clicking the selecting the "Start OCR..." command under the OCR menus. TessStudio will create new OCR data for the current page, all pages in the document or for a selected number of pages.

The OCR process will delete any existing OCR data on a page, including all edits made to the data. It will, however, preserve and combine any non-OCR text elements that might already be on a page. The original graphics of the page will also be preserved.

This approach has several advantages over the customary approach of rasterizing the entire PDF page and running the resulting monochrome image through the OCR engine:

Protects fidelity, accuracy and font attributes of regular text.
Excludes graphical paths such as line drawings from the OCR engine.
Reduces the amount of text that the OCR needs to process.
Maintains original look of mixed-mode (text, image, graphics) pages.

When running the OCR process on multiple pages, TessStudio will automatically create multiple instances of the tesseract OCR engine and run them in parallel to improve the performance. The performance improvement depends on the number of processing cores in the computer.

To run the OCR process, select the "Start OCR..." command from the OCR menus, specify the pages that you wish to process and click the Start button.

OCR Options

Various options that control how the OCR is performed can be set from the dialog accessible via the Options... button on the OCR dialog.

General Tab

On the General tab, you can select OCR and spelling languages as well as the number of parallel processes that can be used to accelerate OCR of multi-page documents.

The selection of the OCR and spelling languages is limited to the languages files that have been installed in the computer. You may install or remove these language files using the Manage Languages option:

You may also choose more than one language for processing multi-lingual documents. The spell-checker, however, can only use a single (primary) language.

Selection of multiple languages will be reflected on the OCR dialog (e.g. deu+eng for documents that ha

Related Skills

node-connect

336.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

summarize

336.9k

Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).

feishu-doc

336.9k