BioConceptVec: <br><small>creating and evaluating literature-based biomedical concept embeddings on a large scale</small>

Text corpora
Named Entity Recognition (NER) tools
BioConceptVec: embeddings and concept files
Tutorial
Datasets
References
Acknowledgments

Text corpora

<a name="text-corpora"></a> We created BioConceptVec using the entire PubMed. The texts were split and tokenized using NLTK. We also lowercased all the words.

Using PubTator for annotating concepts in the PubMed

<a name="pubtator"></a> We employed PubTator to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts.

BioConceptVec: embeddings and concept files

<a name="bioconceptvec"></a> We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the embedding(contains concepts and other words) in binary format and the concept-only file in json format available.

BioConceptVec cbow: embedding (2.4GB) and concept-only (798MB).
BioConceptVec skip-gram: embedding (2.4GB) and concept-only (812MB).
BioConceptVec glove: embedding (2.4GB) and concept-only (835MB).
BioConceptVec fastText: embedding (2.4GB) and concept-only (813MB).

Tutorial

<a name="pubtator"></a> You can find this tutorial on how to use BioConceptVec (for both embedding and concept-only files) for a quick start.

Datasets

<a name="dataset"></a> We also make all the 9 evaluation datasets publicly available. It covers 4 applications:

Drug-Gene interactions. The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes.
Gene-Gene interactions. 5 datasets on gene-gene interactions have the same format as above.
Protein-Protein interaction. It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction.
Drug-Drug interaction. This dataset is from Drug-Drug interaction semeval-2013. Please see the details there.

References

When using our resources, please cite the following papers:

Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. To appear in PLOS Computational Biology.

Acknowledgments

<a name="acknowledgments"></a> This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.

BioConceptVec

Install / Use

README