Dpsprep
Python DJVU to PDF converter which preserves OCR text and bookmark metadata (e.g. TOC)
Install / Use
/learn @kcroker/DpsprepREADME
dpsprep
This tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e.g. TOC) and text layers (e.g. OCR).
Usage
Full example (the name of the PDF is optional and inferred from the input name):
dpsprep --pool=8 --quality=50 input.djvu output.pdf
If you have OCRmyPDF installed, you can use its PDF optimizer:
dpsprep -O3 input.djvu
You can also skip translating the text layer (it is sometimes not translated well) and redo the OCR (rather than launching the ocrmypdf CLI, we use the API directly and accept options in JSON format):
dpsprep --ocr '{"language": ["rus", "eng"]}' input.djvu
Consult the man file (online) for details; there are a lot of options to consider.
See the next section for different ways to run the program.
Installation
Automated
An easy way to install a dpsprep executable for the current user is via uv:
uv tool install dpsprep --from git+https://github.com/kcroker/dpsprep
For better compression (see below), the compress extra must be specified:
uv tool install dpsprep --from git+https://github.com/kcroker/dpsprep[compress]
Sometimes a particular feature branch need to be tested. For installing a fixed revision (i.e. common/branch/tag), the following should work (if extra-name is needed, use dpsprep@rev[extra-name]):
uv tool install dpsprep --from git+https://github.com/kcroker/dpsprep@rev
The only hard prerequisite is djvulibre (e.g. djvulibre on Arch, libdjvulibre-dev on Ubuntu, etc.). We use the Python bindings from the package djvulibre-python (not to be confused with the unmaintained python-djvulibre; see this pull request).
[!TIP] A few people have reported installation problems; see this possible solution and this sample Dockerfile.
[!NOTE] Note that Windows support in
djvulibre-pythonrequires 64-bitdjvulibre, and they only officially distribute 32-bit Windows packages. If you manage to make it work, consider opening a pull request.
Optional prerequisites are:
libtifffor bitonal image compression.libjpeg(orlibjpeg-turbo) for multitotal (RGB or grayscale) compression.OCRmyPDFandjbig2encfor PDF optimization (see the next section).
libtiff depends on libjpeg, so installing libtiff will likely install both.
For details on how these dependencies can be installed, see the GitHub Actions workflow and the dpsprep package for Arch Linux.
Manual
Setting up the project in is again done via uv. Once inside the cloned repository, the environment for the program can be set up by simply running uv sync --all-extras. After than, the following should work:
uv run dpsprep [OPTIONS] SRC [DEST]
[!NOTE] Previous versions used
pyenvfor managing Python versions andpoetryfor managing dependencies and building. Since then the project migrated touv, which subsumes both and provides other niceties.
You can also build and install the project, for example via pipx:
uv build --wheel
pipx install --include-deps dist/*.whl
[!TIP] The build can fail if the
uv_buildPython package is not installed. Make sure not only theuvbinary, but also the corresponding Python package is available. For example, in the Arch repositories, these are distinct packages,uvandpython-uv. Alternatively, try to install theuv-buildPyPI package (python-uv-buildin Arch) explicitly in this case.
If you want dpsprep to be able to use ocrmypdf from pipx's isolated environment, you must inject it explicitly via
pipx inject dpsprep ocrmypdf
[!TIP] If you are packaging this for some other package manager, consider using PEP-517 tools as shown in this PKGBUILD file.
[!NOTE] Previous versions of the tool itself used to depend on third-party binaries, but this is no longer the case. The test fixtures are checked in, however regenerating them (see
./fixtures/Makefile) requirespdflatex(texlive, among others),gs(Ghostscript),oxipng(oxipng),pdftotext(Poppler),djvudigital(GSDjVU) anddjvused(DjVuLibre). Similarly, the man file is checked in, but building it from markdown depends onronn.
Details
Compression
We perform compression in two stages:
-
The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if
libtiffis available,group4compression is used. -
If OCRmyPDF is installed, its PDF optimization can be used via the flags
-O1to-O3(this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression viajbig2enc.
If manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting --tesseract-timeout to 0) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:
python -m ocrmypdf.optimize <input_file> <level> <output_file>
Text layer
The visible contents of a DjVu file are well-compressed images (see here). But a DjVu file also contains a "text layer" stored as metadata attached to invisible rectangular blocks. PDF does not support such constructs, so we do a little hack.
We render each page as an image and put it as a background in the PDF. We then use a font, invisible1.ttf, taken from here, to "draw" text. Every time we draw a block of text, we rescale the font so that the width of the text matches that of the corresponding DjVu block.
[!NOTE] The font is small (12kb) and contains (invisible) Latin, Cyrillic and Greek characters. Even Chinese characters seem to be working correctly, at least with Evince.
The following screenshot displays the result of converting a DjVu document:

The following screenshot displays the same document without the background image and with the invisible font replaced by Times New Roman:

Since the image is actually drawn on top of the text, there is no harm in using an actual visible font, possibly rendered using a transparent "color". Still, when searching and selecting text, the scrambled letters from the second image would be highlighted. With the invisible font, there are no visible glyphs to highlight, so an illusory "block" containing the text is highlighted instead.
See ./dpsprep/text.py for the implementation.
Kevin's notes regarding the first version
I wrote this with the specific intent of converting ebooks in the DJVU format into PDFs for use with the fantastic (but pricey) Sony Digital Paper System.
DjVu technology is strikingly superior for many ebook applications, yet the Sony Digital Paper System (rev 1.3 US) only supports PDF technology: this is because its primary design purpose is not as an ereader. The device, however, is quite nearly the perfect ereader.
Unfortunately, all presently available DjVu to PDF tools seem to just dump flattened enormous TIFF images. This is ridiculous. Since PDF really can't do that much better on the way it stores image data, a 5-6x bloat cannot be avoided. However, none of the existing tools preserve:
- The OCR'd text content
- Table of Contents or Internal links
This is kind of silly, but until Sony's Digital Paper, there was no need to move functional DjVu files to PDFs. In order to make workable PDFs from DjVu files for use on the Digital Paper System, I have implemented in one location the following procedures detailed here:
By automating the procedure of user zetah for extracting the text and getting it in the correct locations: http://askubuntu.com/questions/46233/converting-djvu-to-pdf (OCR text transfer)
By implementing the procedure of user pyrocrasty for extracting the outline, and putting it into the PDF generated above: http://superuser.com/questions/801893/converting-djvu-to-pdf-and-preserving-table-of-contents-how-is-it-possible (bookmark transfer)
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
