This repository contains source code for searching covid-19 relevant papers based on the COVID-19 Open Research Dataset (CORD-19). The repository also provides a solution to the tasks in COVID-19 Open Research Dataset Challenge on Kaggle (CORD-19). Update: 2020-04-14.
- Support multiple bag-of-words models (count, tf-idf, bm25).
- Support semantic search models such as fasttext, glove.
- Enable to combine the aforementioned two types of models.
- Provide a live web application that can customize models for end-users.
git clone https://github.com/wangcongcong123/covidsearch.git
cd covidsearch
pip install -e .
from cord import *
# make sure put the paper collections (four .tar.gz files) and medataset csv file under the dataset_folder
dataset_folder = "dataset/"
# load metadata and full texts of papers
metadata = load_metadata_papers(dataset_folder, "metadata.csv")
full_papers = load_full_papers(dataset_folder)
# full_input_instances include title, abstract, body text
full_input_instances = [(id_, metadata[id_]["title"], metadata[id_]["abstract"], body) for id_, body in
full_papers.items() if id_ in metadata]
tfidf_model = FullTextModel(full_input_instances, weights=[3, 2, 1], vectorizer_type="tfidf")
query = "covid-19 transmission characteristics"
top_k = 10
start = time.time()
results = tfidf_model.query(query, top_k=top_k)
print("Query time: ", time.time() - start)
# around 0.3 s after re-run (the first time runs more time for object serilisation)
-
Bag-of-words search # include count, tf-idf, and bm25 (examples/full_text_run.py).
-
Embedding-based search # include fasttext, glove (examples/embedding_run.py).
-
Model Combinations # combination of the aforementioned two types (examples/ensemble_run.py).
-
Pre-train Insights # pre-train insights based on the tasks in kaggle. (examples/insight_from_scratch.py).
-
Insights Extraction # load pre-trained insights by the kaggle tasks. (examples/insight_extract.py).
Try to run python examples/insight_extract.py
where a pre-trained insights file is loaded and presented to you. If you do not want to use the pre-trained insights, you can pre-train it from scratch by python examples/insight_from_scratch.py
. (have a look at this file to customize the pre-training process).
Here just demonstrating pre-trained insights as an example. For customisation (query search), have a hack on app.py and templates/layout.html to easily figure out. Make sure you download the metadata.csv from CORD19 dataset and put it under ./dataset first, then enter:
python app.py
Go browser via http://127.0.0.1:5000, the web application is as follows.
- The server can also be requested in a cross-origin way.
- You send a GET/POST request to obtaining insights by task name.
- A GET request example is like this:
http://127.0.0.1:5000/kaggle_task?task_name=task1
. - A POST request example is like this:
curl -i -X POST -H "Content-Type: application/json" -d "{\"task_name\":\"task1\"}" http://127.0.0.1:5000/kaggle_task
. - Adapt these to Ajax GET/POST request in your case where you want to embed it to your front-end web html pages!
- Try the live one: https://www.thinkingso.cf/kaggle_task?task_name=task1
Feedback and pull requrest are welcome for getting the project better off.