A question answering system for biomedical literature.
- Java 1.8+
- Gradle
- An ElasticSearch Endpoint with indexed PubMED abstracts
The datasets generated during and/or analysed during the current study are available in the Zenodo repository.
This includes learned GloVE vectors, vocabulary, named entity databases, nominalization, acronym and morphology databases.
The system is configured by a configuration file. An example configuration file is provided in the project
src/main/resources/bio-answerfinder.properties.example
.
The code is written in Java (1.8+) and uses Gradle to build the project.
LookupUtils
and LookupUtils2
- Drugs /usr/local/lt/drug_bank.db
- Diseases /usr/local/lt/diseases.db
- Gene names /usr/local/lt/hugo.db
- Scigraph ontology lookup
- data/scigraph_lt/ontology-classes-with-labels-synonyms-parents.json
- data/scigraph_lt/scr-classes-with-labels-synonyms-parents.json
/usr/local/lt/nominalization.db
/usr/local/lt/acronyms.db
org.bio_answerfinder.nlp.morph.Lemmanizer
and org.bio_answerfinder.util.SRLUtils
/usr/local/morph/morph.db
Python 3 code for the keyword selection classifier and BERT fine-tuning is in the scripts
directory of the project.
You need Tensorflow (1.12+) and Keras installed into an Python virtual environment.
You need to clone Google BERT github project
git clone https://github.com/google-research/bert
After that, follow instructions on BERT github site to download BERTBase model uncased_L-12_H-768_A-12
.
Similarly, download BioBERT model from https://github.com/naver/biobert-pretrained/releases/tag/v1.1-pubmed
.
Copy run_bioasq_classifier.py
, train_bioasq.sh
, train_bioasq.sh
, train_biobert_bioasq.sh
, predict_biobert_bioasq.sh
scripts to Google BERT clone directory and update the environment variables
in the driver Bash shell scripts to match your project and model
installation directories.
All the annotated datasets generated during this study is under the project directory data
.
To be used by the Python 3 script scripts/question_keyword_selector_v2.py
- data/bioasq/bioasq_manual_100/qsc/qsc_set_train.txt
- data/bioasq/bioasq_manual_100/qsc/qsc_set_test.txt
- data/bioasq/bioasq_manual_100/qsc/qsc_set_pos_tags_train.txt
- data/bioasq/bioasq_manual_100/qsc/qsc_set_pos_tags_test.txt
Note that there is no development data. Both dev.tsv and test.tsv are same. The file dev.tsv is there for the Google BERT code.
- data/bioasq/bioasq_manual_100/train.tsv
- data/bioasq/bioasq_manual_100/dev.tsv
- data/bioasq/bioasq_manual_100/test.tsv
- data/bioasq/bioasq_manual_100/qaengine1/question_results_wmd_defn_focus.txt
- data/evaluation/annotator_1_bert_rank.csv
- data/evaluation/annotator_1_rwmd_rank.csv
- data/evaluation/annotator_2_bert_rank.csv
- data/evaluation/annotator_2_rwmd_rank.csv
- data/evaluation/annotator_3_bert_rank.csv
- data/evaluation/annotator_3_rwmd_rank.csv
- data/evaluation/rwmd_question_answer_candidates.csv
- data/evaluation/bert_question_answer_candidates.csv
To be used by the Python 3 script scripts/show_perfomance.py
- data/evaluation/bert_biobert_comparison.csv
- data/evaluation/biobert_perf_records.json
Ibrahim Burak Ozyurt, Jeffrey S. Grethe. Iterative Document Retrieval via Deep Learning Approaches for Biomedical Question Answering in 15th International Conference on eScience (2019). (doi: 10.1109/eScience.2019.00072 )
- data/rank_test (training/testing datasets for NN models)
- data/rank_annotations (curator rankings for each method tested)
- data/evaluation/bert_factoid_test_perf.csv
- data/evaluation/rank_test_factoid_perf.csv
Due PubMed license restrictions, Pubmed abstracts ElasticSearch index cannot be provided. It can be generated using our ETL system Foundry available in GitHub https://github.com/biocaddie/Foundry-ES
.
- You need about 500GB available space for the processing.
- Install Elasticsearch (
https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html
) - Download PubMed baseline from PubMed FTP site
ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline
(all files ending withxml.gz
). - The gzipped XML files contain multiple records for abstracts which needs to be expanded and parsed. The top element of each file is
PubmedArticleSet
. Each abstract record starts with the XML tagPubmedArticle
. - From each
PubmedArticle
tag, you need to extract the PMID, title and abstract text from the following nested XML tags,- PMID
MedlineCitation/PMID
- title
MedlineCitation/Article/ArticleTitle
- abstract
MedlineCitation/Article/Abstract/AbstractText
- PMID
- To index the PMID, title and abstract text for each PubMed abstract to Elasticsearch, generate a JSON object of the following form
{
"dc": {
"identifier": "<PMID>",
"title": "<title>",
"description": "<description>"
}
}
- The JSON objects for the PubMed abstracts can be indexed by using Elasticsearch bulk indexing web API (
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
) - Assuming the index is generated at the local ES endpoint
http://localhost:9200/pubmed/abstracts
, you need to update yourbio-answerfinder.properties
file to include the line
elasticsearch.url=http://localhost:9200/pubmed/abstracts
gradle clean war
Then deploy to Tomcat or any Java web app container.
For any questions, please contact iozyurt@ucsd.edu.