Pdftojson
using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.
Install / Use
/learn @ldenoue/PdftojsonREADME
pdftojson
using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.
Compile
./configure
make
On MacOS, you might need to specify libpng and libfreetype locations, e.g.
./configure --with-libpng-library=/usr/local/Cellar/libpng/1.6.16/lib/ --with-libpng-includes=/usr/local/Cellar/libpng/1.6.16/include/ --with-freetype2-library=/usr/local/lib/ --with-freetype2-includes=/usr/local/include/freetype2/
You will find pdftojson inside the directory xpdf/pdftojson
Usage
pdftojson <input.pdf> <output.json>
File format
The JSON produced looks like: [ { "pages":14, "number":1, "width":612, "height":792, "text":[ [115,162,41,14,0,"What "], ... ] }, { "pages":14, "number":2, "width":612, "height":792, "text":[ [115,162,41,14,0,"Here "], ... ] }, ... ];
For each page, the text array contains: [top,left,width,height,0,text]
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
