Pdf2bib

A python library/command-line tool to quickly and automatically generate BibTeX data starting from the pdf file of a scientific publication.

Generate Convert Improve

Install / Use

/learn @MicheleCotrufo/Pdf2bib

About this skill

Quality Score

0/100

README

pdf2bib

pdf2bib is a Python library/command-line tool to extract bibliographic information from the .pdf file of a publication (or from a folder containing several .pdf files), and automatically generate BibTeX entries. The pdf file can be either a paper published in a scientific journal (i.e. with a DOI associated to it), or an arXiv preprint. The bibliographic information is retrieved by querying public archives, thus an internet connection is required.

pdf2bib can be used either from command line, or inside your python script or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Warning

pdf2bib uses pdf2doi to find the DOI of a paper. Versions of pdf2doi prior to the 1.6 are affected by a very annoying bug. By default, after finding the DOI of a pdf paper, pdf2doi will store the DOI into the metadata of the pdf file. Due to a bug, the size of the pdf file doubles everytime that a metadata was added. This bug has been fixed in all versions of pdf2doi >= 1.6.

If you have pdf files that have been affected by this bug, you can use pdf2doi to fix it. After updating pdf2doi to a version > 1.6, run pdf2doi path/to/folder/containing/pdf/files -id ''. This will restore the pdf files to their original size.

Thanks Ole Steuernagel for pointing out this issue.

Latest stable version

The latest stable version of pdf2bib is the 1.2. See here for the full change log.

[v1.2] - 2024-06-18

Main changes

Added the CLI option -nostore, which allows the user to opt out of the default behaviour of pdf2doi regarding storing the found identifier into the pdf metadata. When -nostore is added to the CLI invokation of pdf2bib, the pdf files will not be modified by pdf2doi.

Added

Make sure entry id can not contain commas https://github.com/MicheleCotrufo/pdf2bib/pull/8.
Make sure that the input variable target is converted to a string before processing, and Fix trailing colon for some PDF files https://github.com/MicheleCotrufo/pdf2bib/pull/16.

Installation

Use the package manager pip to install pdf2bib.

pip install pdf2bib==1.2

Under Windows, it is also possible to add shortcuts to the right-click context menu.

Installation
Description
Usage
- Command line usage
  - Creating a bib file from a folder
  - Manually associate the correct identifier to a file from command line
- Usage inside a python script
  - Manually associate the correct identifier to a file
Installing the shortcuts in the right-click context menu of Windows -Contributing
License
Acknowledgment
Donating

Description

pdf2bib relies on the library pdf2doi, which can automatically find a valid identifier of a scientific publication (i.e. either a DOI or an arxiv ID) starting from a .pdf file. pdf2doi will query public archives (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs) in order to validate any identifier found. The validation process returns raw BibTeX data (see also here), which is then used by pdf2bib to generate a valid BibTeX entry in the format

@article{[LastNameFirstAuthor][PublicationYear][FirstWordTitle],
        title = ...,
        volume = ...,
        issue = ...,
        page = ...,
        publisher = ...,
        url = ...,
        doi = ...,
        journal = ...,
        year = ...,
        month = ...,
        author = ...
}

In the current version the format of the BibTeX entry is not customizable by the user (unless you want to change the code - have fun :D), but this functionality will be implemented in future realeses.

Usage

pdf2bib can be used either as a stand-alone application invoked from the command line, or by importing it in your python project or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Command line usage

pdf2bib can be invoked directly from the command line, without having to open a python console. The simplest command-line invokation is

pdf2bib 'path/to/target'

where target is either a valid pdf file or a directory containing pdf files. Adding the optional command '-v' increases the output verbosity, documenting all steps. For example, when targeting the folder examples we get the following output

pdf2bib examples -v
[pdf2bib]: Looking for pdf files in the folder examples...
[pdf2bib]: Found 4 pdf files.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\1-s2.0-0021999186900938-main.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1016/0021-9991(86)90093-8 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1016/0021-9991(86)90093-8 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\chaumet_JAP_07.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1063/1.2409490 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1063/1.2409490 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\PhysRevLett.116.061102.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1103/PhysRevLett.116.061102 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1103/PhysRevLett.116.061102 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\s41586-019-1666-5.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1038/s41586-019-1666-5 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1038/s41586-019-1666-5 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/doi'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
@article{jordan1986an,
        title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
        volume = {63},
        issue = {1},
        page = {222-235},
        publisher = {Elsevier BV},
        url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
        doi = {10.1016/0021-9991(86)90093-8},
        journal = {Journal of Computational Physics},
        year = {1986},
        month = {3},
        author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}
@article{chaumet2007coupled,
        title = {Coupled dipole method to compute optical torque: Application to a micropropeller},
        volume = {101},
        issue = {2},
        page = {023106},
        publisher = {AIP Publishing},
        url = {http://dx.doi.org/10.1063/1.2409490},
        doi = {10.1063/1.2409490},
        journal = {Journal of Applied Physics},
        year = {2007},
        month = {1},
        author = {Patrick C. Chaumet and C. Billaudeau}
}
@article{2016observation,
        title = {Observation of Gravitational Waves from a Binary Black Hole Merger},
        volume = {116},
        issue = {6},
        publisher = {American Physical Society (APS)},
        url = {http://dx.doi.org/10.1103/PhysRevLett.116.061102},
        doi = {10.1103/physrevlett.116.061102},
        journal = {Physical Review Letters},
        year = {

Related Skills

node-connect

333.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

82.0k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

82.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

333.7k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

MicheleCotrufo

View profile

View on GitHub

GitHub Stars86

CategoryDevelopment

Updated26d ago

Forks11

MicheleCotrufo/pdf2bib

Languages

Python

Security Score

85/100

Audited on Feb 26, 2026

No findings