Bioasq
OAQA Biomedical Question Answering (BioASQ) System
Install / Use
/learn @oaqa/BioasqREADME
OAQA Biomedical Question Answering (BioASQ) System
The OAQA Biomedical Question Answering (BioASQ) System aims to identify relevant documents, concepts and passages (snippets) and automatically generate exact answer texts to arbitrary biomedical questions (factoid, list, yes/no). It won the best-performing system in the BioASQ QA Challenges in the factoid and list categories two years in a row in 2015 and 2016 (see official results).
System description papers have the most details about the design and implementation of the architecture and the algorithms:
- Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang, and Eric Nyberg. Learning to Answer Biomedical Factoid & List Questions: OAQA at BioASQ 3B. In Proceedings of CLEF 2015 Evaluation Labs and Workshop, 2015. [pdf]
- Zi Yang, Yue Zhou, and Eric Nyberg. Learning to Answer Biomedical Questions: OAQA at BioASQ 4B. In Proceedings of Workshop on Biomedical Natural Language Processing, 2016. [pdf]
Please contact Zi Yang if you have any questions or comments.
Overview
This system uses the ECD/CSE framework (an extension to the Apache UIMA framework which support formal, declarative YAML-based descriptors for the space of system and component configurations to be explored during system optimization), BaseQA type system as well as various natural language processing and information retrieval algorithms and tools.
The system employs a three layered design for both Java source code and YAML descriptors:
| Layer | Description |
| --- | --- |
| baseqa | Domain independent QA components, and the basic input/output definition of a QA pipeline, intermediate data objects, QA evaluation components, and data processing components. [source] [descriptor] |
| bioqa | Biomedical resources that can be used in any biomedical QA task (outside the context of BioASQ). [source] [descriptor] |
| bioasq | BioASQ-specific components, e.g. GoPubMed services. [source] [descriptor] |
Each layer contains packages for each processing step, e.g. preprocess, question analysis, abstract query generation, document retrieval and reranking, concept retrieval and reranking, passage retrieval, answer type prediction, evidence gathering, answer generation and ranking. Please refer to the architecture diagrams in the system description papers
| Workflow | Description | Diagram | | --- | --- | :---: | | Phase A | Document, concept, and snippet retrieval | <a href="http://www.cs.cmu.edu/~ziy/images/bioasq-phase-a.png"><img src="http://www.cs.cmu.edu/~ziy/images/bioasq-phase-a.png" width="10%"></a> | | Phase B (factoid & list) | Exact answer generation for factoid and list questions | <a href="http://www.cs.cmu.edu/~ziy/images/bioasq-phase-b-factlist.png"><img src="http://www.cs.cmu.edu/~ziy/images/bioasq-phase-b-factlist.png" width="10%"></a> | | Phase B (yes/no) | Answer prediction for yes/no questions | <a href="http://www.cs.cmu.edu/~ziy/images/bioasq-phase-b-yesno.png"><img src="http://www.cs.cmu.edu/~ziy/images/bioasq-phase-b-yesno.png" width="10%"></a> |
We define the following workflow descriptors (i.e. entry points) under bioasq for preprocessing, training, evaluation, and testing the Phase A (retrieval tasks) and Phase B (factoid, list and yes/no answer generation).
| Descriptor | Description |
| --- | --- |
| preprocess-kb-cache | Cache the requests and responses of concept and concept search services |
| preprocess-answer-type-gslabel | Label gold-standard answer types |
| phase-a-train-concept-document | Train document and concept reranking models |
| phase-a-train-snippet | Train snippet reranking models |
| phase-a-evaluate, phase-a-test | Evaluate (using development subset) and test (using test set) retrieval performance |
| phase-b-train-answer-type | Train answer type prediction model for factoid and list questions |
| phase-b-train-answer-score | Train answer scoring model for factoid and list questions |
| phase-b-train-answer-collective-score | Train answer collective scoring model for list questions |
| phase-b-train-yesno | Train yes/no prediction model |
| phase-b-evaluate-factoid-list, phase-b-test-factoid-list | Evaluate (using development subset) and test (using test set) factoid and list QA |
| phase-b-evaluate-yesno, phase-b-test-yesno | Evaluate (using development subset) and test (using test set) yes/no QA |
A workflow descriptor can be executed by the ECDDriver, which has been configured as the main class in the Maven exec goal, and thus it can be executed from the command line with the config specified as the path.to.the.descriptor.
The system also depends on other types of resources, including dictionaries, pretrained machine learning models, and service related properties.
Change Notes
- Update Lucene from version 5.5.1 to 6.2.1, which results in change of default similarity.
- Update skr-webapi from version 0.0.4 to 0.0.6, due to an upstream API update to version 2.3.
- Update uts-api from version 0.0.2 to 0.0.3, due to an upstream API update.
- Update the TmTool URL to HTTPS (https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#RESTfulAPIs).
- Bug fixes, including stability enhanced to avoid ConcurrentModificationException in LuceneDocumentScorer and ShapeDistanceCollectiveAnswerScorer, possible DuplicateKey in LuceneInMemorySentenceRetrievalExecutor, retrying if UTS service fails to obtain service ticket.
Setting Up the System
Prerequisites
This system needs to access external structured and unstructured resources for question answering and files for evaluating the system. Due to licensing issues, you may have to obtain these resources or credentials on your own. If you are a CMU OAQA person, please read the internal resource preparation instruction instead.
-
Pre-prerequisites. Java 8, Maven 3, Python 2.
-
(Recommended) UMLS license/account. The system needs to access the online UMLS services (UTS and MetaMap), which require UMLS license/account (username, password, email). You can request them from https://uts.nlm.nih.gov//license.html. Otherwise, you need to remove all the
*-uts-*and*-metamap-*steps from the descriptors, which will hugely hurt the performance.If you want to increase the system's throughput, you may consider to download and install local instances of UMLS and MetaMap services. Currently, we only have the Web services integrated.
-
(Recommended) Medline corpus and Lucene index. The system can use a local Medline index or the GoPubMed Web API for searching the PubMed. However, we recommend a local index because the reranking component may send up to hundreds of search requests per question. Using a Web API can take forever to process one question.
-
Download
.xml.gzor.xmlfiles from https://www.nlm.nih.gov/databases/download/pubmed_medline.html. -
(Optional) Check out the
medline-indexerproject. -
Create a Lucene index using the
StandardAnalyzer. The index should contain three mandatory fields:pmid,abstractText, andarticleTitle. We include an example Java codeMedlineCitationIndexer.javathat indexes.xml.gzor.xmlfiles inside a directory. -
Create a sqlite database that has a
pmid2abstracttable with two fieldspmidandabstract, which is used to fix the section label errors in the provided development set. We include an example Java codeMedlineAbstractStoreBuilder.javathat builds the sqlite file.
-
-
Biomedical ontology dumps and Lucene index. You can skip this step if you don't need relevant concept retrieval, but please also remove the
concept-retrievalandconcept-reranksteps from the descriptors if you do so. If you prefer using a local biomedical ontology index (recommended) to the official GoPubMed service
