Molminer
Python library and command-line tool for extracting compounds from scientific literature. Written in Python.
Install / Use
/learn @gorgitko/MolminerREADME
Repository has been archived
Due to missing time for this project, the repository has been archived, and thus it is read-only now.
Most of the issues seem to come from broken dependencies in the conda recipe,
such as libtiff reported here.
The molminer conda package contains all needed shared libraries, and so no extra packages should be installed
into a virtual environment, as conflicts may arise.
If there are problems after installation of the molminer package, such as segmentation fault or illegal instruction,
you can check conda list output and try to uninstall some library packages (lib*) with
conda uninstall <library package> --force.
This is probably against the philosophy of conda, where dependencies, such as shared libraries, should be
installed as extra packages, but OSRA compilation was so complicated that
at that time I have decided to keep all dependencies in one package.
Beside "third-party" shared libraries, OSRA in the conda package is linked against quite old glibc to ensure compatibility on older distros (can't remember exactly now, but I think the compilation was done under Ubuntu 12).
Also, included tools (ChemSpot, OPSIN and Python libraries) were probably updated over the time,
and it is definitely possible to use their newer versions (see installing from source).
You can also try to compile a newer version of OSRA,
as there are notes (1,
2) I have made from my experience
(however, there might be different requirements in the new version).
I hope molminer can still serve well :slightly_smiling_face:
MolMiner
MolMiner is a library and command-line interface for extracting compounds (called "chemical entities") from scientific literature. It extracts chemical entities both from text (Chemical Named Entity Recognition) and 2D structures (Optical Chemical Structure Recognition). It's written in Python (currently supporting only Python 3). It should work on all platforms, but problem is that some dependencies are very hard to compile on Windows. Actually it's a wrapper around several open-source tools for chemical information retrieval, namely [ChemSpot][1], [OSRA][2] and [OPSIN][3], using their command-line interface and adding some extended functionality.
Overview
MolMiner is able to extract chemical entities from scientific literature in various formats including PDF and scanned images. It extracts entities both from text and 2D structures. Text is normalized using part of code from ChemDataExtractor. Text entities are assigned by [ChemSpot][1] to one of classes: "SYSTEMATIC", "IDENTIFIER", "FORMULA", "TRIVIAL", "ABBREVIATION", "FAMILY", "MULTIPLE". IUPAC names are converted to computer-readable format like SMILES or InChI with [OPSIN][3]. 2D stuctures are recognised in document and converted to computer-readable format with [OSRA][2]. Entities successfully converted to computer-readable format are standardized using MolVS library. Entities are also annotated in PubChem and ChemSpider databases using PubChemPy and ChemSpiPy. For processing of PDF files is used [GraphicsMagick][4] and for OCR [Tesseract][5].
Installation
MolMiner self is written in Python, but it uses several binaries and some of them have complicated compilation dependencies. So the easiest way is to install MolMiner including dependencies as a conda package hosted on Anaconda Cloud.
To install MolMiner without dependencies just download this repository and run $ python setup.py install. MolMiner will be then available from shell as molminer and also as a Python library.
Conda package (currently only for linux64)
[Conda][6] is a package, dependency and environment management for any language including Python. MolMiner package includes precompiled dependencies and data files. It also manages all the needed envinronment variables and enables bash auto-completion.
-
Download and install conda.
-
Add channels:
$ conda config --add channels rdkit; conda config --add channels bioconda; conda config --add channels jirinovo; conda config --add channels conda-forge -
Create new virtual environment and install MolMiner:
$ conda create -n my_new_env molminer -
Activate environment:
$ source activate my_new_env -
Use MolMiner:
$ molminer --help
Note that you must always activate virtual environment before using MolMiner. That's because the activation script is also modifying the environmental variables storing the paths to MolMiner data files.
From source (linux)
Binaries
You need all these binaries for MolMiner. They should be installed so path to them is in PATH environmental variable (like /usr/local/bin). I haven't tried to compile these dependencies on Windows, but that doesn't mean it's impossible.
- [OSRA][2]. This is probably the most complicated binary. Official information is here and here. My installation notes are here.
- Compile GraphicsMagick with as many supported image formats as possible (dependencies). It's also used for converting PDF to images and for image editing/transformation.
- Use Tesseract version 4 and up.
- Patched version of OpenBabel is needed.
- Put OSRA data files (
spelling.txt,superatom.txt) to some directory and add this directory toOSRA_DATA_PATHenvironmental variable.
- [ChemSpot][1]. Just download it and:
- Put ChemSpot JAR file to directory accesible from
PATHand rename it tochemspot.jar. - Also put there this bash script. It's used for running ChemSpot. Its first argument is maximum amount of memory for ChemSpot process. Subsequent arguments are forwarded to ChemSpot CLI.
- Put ChemSpot data files (
dict.zip,ids.zip,multiclass.bin) to some directory and add this directory toCHEMSPOT_DATA_PATHenvironmental variable.
- Put ChemSpot JAR file to directory accesible from
- [OPSIN][3]. Just download it and:
- Put OPSIN JAR file to directory accesible from
PATHand rename it toopsin.jar. - Also put there this bash script. It's used for running OPSIN. All arguments are forwarded to OPSIN CLI.
- Put OPSIN JAR file to directory accesible from
- [GraphicsMagick][4]. OSRA needs it for compilation, but its binary is also directly used by MolMiner. Compile it with as many supported image formats as possible (dependencies).
- [Tesseract][5]. OSRA needs it for compilation, but its binary is also directly used by MolMiner. Use version 4 and up.
- Tesseract needs language data files. Download them here, put them to some directory and add this directory to
TESSDATA_PREFIXenvironmental variable.
- Tesseract needs language data files. Download them here, put them to some directory and add this directory to
- poppler-utils. Utils for PDF files built on top of Poppler library.
- Ubuntu (or any OS with
aptpackaging):$ sudo apt-get install poppler-utils
- Ubuntu (or any OS with
- libmagic. Reads the magic bytes of file and determine its MIME type.
- Ubuntu (or any OS with
aptpackaging):$ sudo apt-get install libmagic1 libmagic-dev
- Ubuntu (or any OS with
- OpenJDK. Java runtime environment. Installation.
Paths to data files can be also specified in both MolMiner CLI and library, but defining them in the environmental variables is the easiest way.
Python dependencies
Dependencies listed in setup.py will be installed automatically when you run $ python setup.py install. Unfortunately, there is a complicated dependency RDKit. It's best to install it as a conda package.
Usage
-
Basic syntax is:
$ molminer COMMAND [OPTIONS] [ARGS] -
MolMiner has four commands (you can view them with
$ molminer --help):ocsr: Extract 2D structures with OSRA. OCSR stands for Optical Chemical Structure Recognition.ner: Extract textual chemical entities with ChemSpot. NER stands for Named Entity Recognition.convert: Convert IUPAC names to computer-readable format with OPSIN.extract: Combine all the previous commands.
-
To each command you can view its options with
$ molminer COMMAND --help -
Bash auto-completion is automatically available when MolMiner is installed through conda and virtual environment is activated. Then you can double-press TAB key to show MolMiner commands and options:
$ molminer <TAB><TAB>to see commands and$ molminer ocsr --<TAB><TAB>to see options.
Input
- Input can be single PDF, image or text file. Type of input file will be automatically determined, but you can specify it with
-i [pdf|pdf_scan|image|text]option (textvalue is of course not supported by OSRA, resp. `ocs
