Pdf2bib
A python library/command-line tool to quickly and automatically generate BibTeX data starting from the pdf file of a scientific publication.
Install / Use
/learn @MicheleCotrufo/Pdf2bibREADME
pdf2bib
pdf2bib is a Python library/command-line tool to extract bibliographic information from the .pdf file of a publication
(or from a folder containing several .pdf files), and automatically generate BibTeX entries. The pdf file can be either a paper published in a scientific journal (i.e. with
a DOI associated to it), or an arXiv preprint. The bibliographic information is retrieved by querying public archives, thus an internet connection is required.
pdf2bib can be used either from command line, or inside your python script or, only for Windows, directly from the right-click context menu of a pdf file or a folder.
Warning
pdf2bib uses pdf2doi to find the DOI of a paper. Versions of pdf2doi prior to the 1.6 are affected by a very annoying bug. By default, after finding the DOI of a pdf paper, pdf2doi will store the DOI into the metadata of the pdf file. Due to a bug, the size of the pdf file doubles everytime that a metadata was added. This bug has been fixed in all versions of pdf2doi >= 1.6.
If you have pdf files that have been affected by this bug, you can use pdf2doi to fix it. After updating pdf2doi to a version > 1.6, run pdf2doi path/to/folder/containing/pdf/files -id ''. This will restore the pdf files to their original size.
Thanks Ole Steuernagel for pointing out this issue.
Latest stable version
The latest stable version of pdf2bib is the 1.2. See here for the full change log.
[v1.2] - 2024-06-18
Main changes
- Added the CLI option
-nostore, which allows the user to opt out of the default behaviour ofpdf2doiregarding storing the found identifier into the pdf metadata. When-nostoreis added to the CLI invokation ofpdf2bib, the pdf files will not be modified bypdf2doi.
Added
- Make sure entry id can not contain commas https://github.com/MicheleCotrufo/pdf2bib/pull/8.
- Make sure that the input variable target is converted to a string before processing, and Fix trailing colon for some PDF files https://github.com/MicheleCotrufo/pdf2bib/pull/16.
Installation
Use the package manager pip to install pdf2bib.
pip install pdf2bib==1.2
Under Windows, it is also possible to add shortcuts to the right-click context menu.
Table of Contents
- Installation
- Description
- Usage
- Installing the shortcuts in the right-click context menu of Windows -Contributing
- License
- Acknowledgment
- Donating
Description
pdf2bib relies on the library pdf2doi, which can automatically find a valid identifier of a scientific publication (i.e. either a DOI or an arxiv ID)
starting from a .pdf file. pdf2doi will query public archives (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs) in order to validate any identifier found. The validation process returns raw BibTeX data (see also here), which is then used by
pdf2bib to generate a valid BibTeX entry in the format
@article{[LastNameFirstAuthor][PublicationYear][FirstWordTitle],
title = ...,
volume = ...,
issue = ...,
page = ...,
publisher = ...,
url = ...,
doi = ...,
journal = ...,
year = ...,
month = ...,
author = ...
}
In the current version the format of the BibTeX entry is not customizable by the user (unless you want to change the code - have fun :D), but this functionality will be implemented in future realeses.
Usage
pdf2bib can be used either as a stand-alone application invoked from the command line, or by importing it in your python project or, only for Windows,
directly from the right-click context menu of a pdf file or a folder.
Command line usage
pdf2bib can be invoked directly from the command line, without having to open a python console.
The simplest command-line invokation is
pdf2bib 'path/to/target'
where target is either a valid pdf file or a directory containing pdf files. Adding the optional command '-v' increases the output verbosity,
documenting all steps.
For example, when targeting the folder examples we get the following output
pdf2bib examples -v
[pdf2bib]: Looking for pdf files in the folder examples...
[pdf2bib]: Found 4 pdf files.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\1-s2.0-0021999186900938-main.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1016/0021-9991(86)90093-8 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1016/0021-9991(86)90093-8 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\chaumet_JAP_07.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1063/1.2409490 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1063/1.2409490 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\PhysRevLett.116.061102.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1103/PhysRevLett.116.061102 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1103/PhysRevLett.116.061102 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\s41586-019-1666-5.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1038/s41586-019-1666-5 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1038/s41586-019-1666-5 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/doi'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
@article{jordan1986an,
title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
volume = {63},
issue = {1},
page = {222-235},
publisher = {Elsevier BV},
url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
doi = {10.1016/0021-9991(86)90093-8},
journal = {Journal of Computational Physics},
year = {1986},
month = {3},
author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}
@article{chaumet2007coupled,
title = {Coupled dipole method to compute optical torque: Application to a micropropeller},
volume = {101},
issue = {2},
page = {023106},
publisher = {AIP Publishing},
url = {http://dx.doi.org/10.1063/1.2409490},
doi = {10.1063/1.2409490},
journal = {Journal of Applied Physics},
year = {2007},
month = {1},
author = {Patrick C. Chaumet and C. Billaudeau}
}
@article{2016observation,
title = {Observation of Gravitational Waves from a Binary Black Hole Merger},
volume = {116},
issue = {6},
publisher = {American Physical Society (APS)},
url = {http://dx.doi.org/10.1103/PhysRevLett.116.061102},
doi = {10.1103/physrevlett.116.061102},
journal = {Physical Review Letters},
year = {
Related Skills
node-connect
333.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
82.0kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
82.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
333.7kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
