Bjjocr
Script to automatically perform zonal OCR on a PDF and rename the PDF according to the results.
Install / Use
/learn @bjjanssen/BjjocrREADME
bjjocr
Script to automatically perform zonal OCR on a PDF and rename the PDF according to the results. It uses tesseract, imagemagick and pdftk.
This script scratches a specific itch our HR department has, namely to process thousands of uniform PDFs twice a month. As such it is geared towards many same-looking PDFs that come in bunches of several hundreds pages per PDF. We need each page as a single sheet identifiable by an ID on the page. So we first burst the PDF into single page PDFs and then run OCR over a zone on each PDF and put the result into a temporary file. We read the temporary file and rename the single page PDF accordingly.
In principle we can do this with as many zones as we like but we must initiate a new tesseract run for every zone.
Quickstart:
1) Install required packages
We run a Debian 7 installation:
$apt-get install tesseract-ocr tesseract-ocr-[yourlanguage] imagemagick pdftk
2) Convert your source file to a tiff-image.
The tiff must fullfill the following requirement
- 300dpi
- 8 bit colordepth
I recommend stripping and trimming, too. Use ImageMagick's convert to create the tiff file:
$convert -depth 8 -density 300 -strip -trim input.pdf output.tiff
3) Use zonal OCR with tesseract.
Tesseract can perform zonal OCR if one or more appropriate zone files are provided. Zonefiles must use the extension .uzn and the same name as the input file, e.g input file = input.tiff then the uzn file must be named input.uzn
The .uzn file format is:
x-coordinate y-coordinate width height identifier
- The x- and y-coordinates define the top left corner of a rectangle.
- Width and height define the dimensions of the rectangle.
- The identifier is unused and currently only helps the user in remembering the defined zone.
- All parameters must be one space apart.
- You can define several zones in one file, but only the first one is used.
$tesseract input.tiff - -l <yourlanguage> -psm 4
Instead of stdout you can use a freely chosen filename. The file will have the extension .txt.
#Future plans
If we can't get the commercial solution working to our demands, we will implement
-
parallel processing of many PDFs.
-
a folder watchdog ([id]notify) to run the script whenever a PDF is dropped into the watched folder.
-
Learn github's markup.
Related Skills
node-connect
336.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.9kCommit, push, and open a PR
