BioConceptVec
No description available
Install / Use
/learn @ncbi/BioConceptVecREADME
BioConceptVec: <br><small>creating and evaluating literature-based biomedical concept embeddings on a large scale</small>
Table of contents
- Text corpora
- Named Entity Recognition (NER) tools
- BioConceptVec: embeddings and concept files
- Tutorial
- Datasets
- References
- Acknowledgments
Text corpora
<a name="text-corpora"></a> We created BioConceptVec using the entire PubMed. The texts were split and tokenized using NLTK. We also lowercased all the words.
Using PubTator for annotating concepts in the PubMed
<a name="pubtator"></a> We employed PubTator to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts.
BioConceptVec: embeddings and concept files
<a name="bioconceptvec"></a> We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the embedding(contains concepts and other words) in binary format and the concept-only file in json format available.
- BioConceptVec cbow: embedding (2.4GB) and concept-only (798MB).
- BioConceptVec skip-gram: embedding (2.4GB) and concept-only (812MB).
- BioConceptVec glove: embedding (2.4GB) and concept-only (835MB).
- BioConceptVec fastText: embedding (2.4GB) and concept-only (813MB).
Tutorial
<a name="pubtator"></a> You can find this tutorial on how to use BioConceptVec (for both embedding and concept-only files) for a quick start.
Datasets
<a name="dataset"></a> We also make all the 9 evaluation datasets publicly available. It covers 4 applications:
-
Drug-Gene interactions. The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes.
-
Gene-Gene interactions. 5 datasets on gene-gene interactions have the same format as above.
-
Protein-Protein interaction. It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction.
-
Drug-Drug interaction. This dataset is from Drug-Drug interaction semeval-2013. Please see the details there.
References
When using our resources, please cite the following papers:
Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. To appear in PLOS Computational Biology.
Acknowledgments
<a name="acknowledgments"></a> This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.
