Pdf2doi
A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
Install / Use
/learn @MicheleCotrufo/Pdf2doiREADME
pdf2doi
pdf2doi is a Python library/command-line tool to automatically extract the DOI or other identifiers (e.g. arXiv ID) starting from the .pdf file of a publication
(or from a folder containing several .pdf files), and to retrieve bibliographic information.
It exploits several methods (see below for detailed description) to find a valid identifier of a pdf file, and it validates any result
via web queries to public archives (e.g. http://dx.doi.org).
The validation process also returns raw bibtex infos, which can be used for further processing, such as generating BibTeX entries (pdf2bib) or
automatically renaming pdf files (pdf-renamer).
pdf2doi can be used either from command line, or inside your python script or, only for Windows, directly from the right-click context menu of a pdf file or a folder.
Warning
Versions of pdf2doi prior to the 1.6 are affected by a very annoying bug. By default, after finding the DOI of a pdf paper, pdf2doi will store the DOI into the metadata of the pdf file. Due to a bug, the size of the pdf file would double everytime that a metadata was added. This bug has been fixed in all versions >= 1.6.
If you have pdf files that have been affected by this bug, you can use pdf2doi to fix it. After updating to a version > 1.6, run pdf2doi path/to/folder/containing/pdf/files -id ''. This will restore the pdf files to their original size.
Thanks Ole Steuernagel for pointing out this issue.
Latest stable version
The latest stable version of pdf2doi is the 1.7. See here for the full change log.
[v1.7] - 2024-11-10
Main changes
- Changed url for dx.doi.org validation (https://github.com/MicheleCotrufo/pdf2doi/issues/35)
- Added 'r' in front of strings to suppress warnings in recent Python versions (https://github.com/MicheleCotrufo/pdf2doi/pull/36)
- Changed
pymupdfdependency topymupdf>=1.21.0(https://github.com/MicheleCotrufo/pdf2doi/issues/32 https://github.com/MicheleCotrufo/pdf2doi/issues/28 https://github.com/MicheleCotrufo/pdf2doi/issues/37)
Installation
Use the package manager pip to install pdf2doi.
pip install pdf2doi==1.7
-
Many users have reported (https://github.com/MicheleCotrufo/pdf2doi/issues/32 https://github.com/MicheleCotrufo/pdf2doi/issues/28 https://github.com/MicheleCotrufo/pdf2doi/issues/37) that the installation fails because of some issue related to the installation of the library
pymupdf. We are still not sure what the issue is. A possible fix seems to be installingpymupdfseparately (before installingpdf2doi), viapip install pymupdf>=1.21.0. -
The library
textractprovides additional ways to analyze pdf files, and it is sometimes more powerful thanPyPDF2, but it comes with a large overhead of additional required dependencies, and sometimes it generates version conflicts. The user can decide whether to install it or not.pdf2doiwill only try to use this library if it detects that it is installed. To install it,
pip install textract==1.6.4
pip install pdfminer.six==20191110
Under Windows, after installation of pdf2doi it is also possible to add shortcuts to the right-click context menu.
Used by
Here is a list of applications/repositories that make use of pdf2doi. If you use pdf2doi in your application and you wish to add it to this list, send me a message.
Table of Contents
- Installation
- Description
- Usage
- Installing the shortcuts in the right-click context menu of Windows -Contributing
- License
- Acknowledgment
- Donating
Description
Automatically associating a DOI or other identifiers (e.g. arXiv ID) to a pdf file can be either a very easy or a very difficult (sometimes nearly impossible) task, depending on how much care was placed in crafting the file. In the simplest case (which typically works with most recent publications) it is enough to look into the file metadata. For older publications, the identifier is often found within the pdf text and it can be extracted with the help of regular expressions. In the unluckiest cases, the only method left is to google some details of the publication (e.g. the title or parts of the text) and hope that a valid identifier is contained in one of the first results.
pdf2doi applies sequentially all these methods (starting from the simplest ones) until a valid identifier is found and validated.
Specifically, for a given .pdf file it will, in order,
-
Look into the metadata of the .pdf file (extracted via the library PyPDF2) and check if any of them contains a string that matches the pattern of a DOI or an arXiv ID. Priority is given to metadata which contain the word 'doi' in their label.
-
Check if the name of the pdf file contains any sub-string that matches the pattern of a DOI or an arXiv ID.
-
Scan the text inside the .pdf file, and check for any string that matches the pattern of a DOI or an arXiv ID. The text is extracted with the libraries PyPDF2 and pdfminer. If the library textract is installed,
pdf2doiwill try to use that too. -
Try to find possible titles of the publication. In the current version, possible titles are identified via the libraries pdftitle and PyMuPDF, and by the file name. For each possible title a google search is performed and the plain text of the first results is scanned for valid identifiers.
-
As a last desperate attempt, the first N=1000 characters of the pdf text are used as a query for a google search. The plain text of the first results is scanned for valid identifiers.
Any time that a potential identifier is found, it is also validated by performing a query to a relevant website (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs). This validation process also returns raw BibTeX info when the identifier is valid.
When a valid identifier is found with any method different than the first one, the identifier is also stored inside the metadata of the pdf file. In this way, future lookups of this same file will be able to extract the identifier with the first method, speeding up the search (This feature can be disabled by the user, in case edits to the pdf file are not desired).
The library is far from being perfect. Often, especially for old publications, none of the currently implemented methods will work. Other times the wrong DOI might be extracted: this can happen, for example,
if the DOI of another paper is present in the pdf text and it appears before the correct DOI. A quick and dirty solution to this problem is to look up the identifier manually and then add it to the metadata
of the file, with the methods shown here (from python console) or here (from command line).
In this way, pdf2doi will always retrieve the correct DOI when analyzing this same file in the future, which can be useful when pdf2doi is used to automatize
bibliographic procedures for a large number of files (e.g. via pdf2bib or
pdf-renamer).
Currently, only the format of arXiv identifiers in use after 1 April 2007 is supported.
Usage
pdf2doi can be used either as a stand-alone application invoked from the command line, or by importing it in your python project or, only for Windows, directly from the right-click context menu of a pdf file or a folder.
Command line usage
pdf2doi can be invoked directly from the command line, without having to open a python console.
The simplest command-line invokation is
$ pdf2doi 'path/to/target'
where target is either a valid pdf file or a directory containin
Related Skills
node-connect
334.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
82.1kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
82.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
334.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
