SkillAgentSearch skills...

Pdf2bib

A python library/command-line tool to quickly and automatically generate BibTeX data starting from the pdf file of a scientific publication.

Install / Use

/learn @MicheleCotrufo/Pdf2bib
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

pdf2bib

pdf2bib is a Python library/command-line tool to extract bibliographic information from the .pdf file of a publication (or from a folder containing several .pdf files), and automatically generate BibTeX entries. The pdf file can be either a paper published in a scientific journal (i.e. with a DOI associated to it), or an arXiv preprint. The bibliographic information is retrieved by querying public archives, thus an internet connection is required.

pdf2bib can be used either from command line, or inside your python script or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

DownloadsDownloads Pip Package

Warning

pdf2bib uses pdf2doi to find the DOI of a paper. Versions of pdf2doi prior to the 1.6 are affected by a very annoying bug. By default, after finding the DOI of a pdf paper, pdf2doi will store the DOI into the metadata of the pdf file. Due to a bug, the size of the pdf file doubles everytime that a metadata was added. This bug has been fixed in all versions of pdf2doi >= 1.6.

If you have pdf files that have been affected by this bug, you can use pdf2doi to fix it. After updating pdf2doi to a version > 1.6, run pdf2doi path/to/folder/containing/pdf/files -id ''. This will restore the pdf files to their original size.

Thanks Ole Steuernagel for pointing out this issue.

Latest stable version

The latest stable version of pdf2bib is the 1.2. See here for the full change log.

[v1.2] - 2024-06-18

Main changes

  • Added the CLI option -nostore, which allows the user to opt out of the default behaviour of pdf2doi regarding storing the found identifier into the pdf metadata. When -nostore is added to the CLI invokation of pdf2bib, the pdf files will not be modified by pdf2doi.

Added

  • Make sure entry id can not contain commas https://github.com/MicheleCotrufo/pdf2bib/pull/8.
  • Make sure that the input variable target is converted to a string before processing, and Fix trailing colon for some PDF files https://github.com/MicheleCotrufo/pdf2bib/pull/16.

Installation

Use the package manager pip to install pdf2bib.

pip install pdf2bib==1.2

Under Windows, it is also possible to add shortcuts to the right-click context menu.

Table of Contents

Description

pdf2bib relies on the library pdf2doi, which can automatically find a valid identifier of a scientific publication (i.e. either a DOI or an arxiv ID) starting from a .pdf file. pdf2doi will query public archives (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs) in order to validate any identifier found. The validation process returns raw BibTeX data (see also here), which is then used by pdf2bib to generate a valid BibTeX entry in the format

@article{[LastNameFirstAuthor][PublicationYear][FirstWordTitle],
        title = ...,
        volume = ...,
        issue = ...,
        page = ...,
        publisher = ...,
        url = ...,
        doi = ...,
        journal = ...,
        year = ...,
        month = ...,
        author = ...
}

In the current version the format of the BibTeX entry is not customizable by the user (unless you want to change the code - have fun :D), but this functionality will be implemented in future realeses.

Usage

pdf2bib can be used either as a stand-alone application invoked from the command line, or by importing it in your python project or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Command line usage

pdf2bib can be invoked directly from the command line, without having to open a python console. The simplest command-line invokation is

pdf2bib 'path/to/target'

where target is either a valid pdf file or a directory containing pdf files. Adding the optional command '-v' increases the output verbosity, documenting all steps. For example, when targeting the folder examples we get the following output

pdf2bib examples -v
[pdf2bib]: Looking for pdf files in the folder examples...
[pdf2bib]: Found 4 pdf files.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\1-s2.0-0021999186900938-main.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1016/0021-9991(86)90093-8 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1016/0021-9991(86)90093-8 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\chaumet_JAP_07.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1063/1.2409490 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1063/1.2409490 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\PhysRevLett.116.061102.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1103/PhysRevLett.116.061102 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1103/PhysRevLett.116.061102 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: examples\s41586-019-1666-5.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1038/s41586-019-1666-5 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1038/s41586-019-1666-5 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/doi'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf2bib]: ................
@article{jordan1986an,
        title = {An efficient numerical evaluation of the Green's function for the Helmholtz operator on periodic structures},
        volume = {63},
        issue = {1},
        page = {222-235},
        publisher = {Elsevier BV},
        url = {http://dx.doi.org/10.1016/0021-9991(86)90093-8},
        doi = {10.1016/0021-9991(86)90093-8},
        journal = {Journal of Computational Physics},
        year = {1986},
        month = {3},
        author = {Kirk E Jordan and Gerard R Richter and Ping Sheng}
}
@article{chaumet2007coupled,
        title = {Coupled dipole method to compute optical torque: Application to a micropropeller},
        volume = {101},
        issue = {2},
        page = {023106},
        publisher = {AIP Publishing},
        url = {http://dx.doi.org/10.1063/1.2409490},
        doi = {10.1063/1.2409490},
        journal = {Journal of Applied Physics},
        year = {2007},
        month = {1},
        author = {Patrick C. Chaumet and C. Billaudeau}
}
@article{2016observation,
        title = {Observation of Gravitational Waves from a Binary Black Hole Merger},
        volume = {116},
        issue = {6},
        publisher = {American Physical Society (APS)},
        url = {http://dx.doi.org/10.1103/PhysRevLett.116.061102},
        doi = {10.1103/physrevlett.116.061102},
        journal = {Physical Review Letters},
        year = {

Related Skills

View on GitHub
GitHub Stars86
CategoryDevelopment
Updated26d ago
Forks11

Languages

Python

Security Score

85/100

Audited on Feb 26, 2026

No findings