The CancerMine resource is a text-mined knowledgebase of drivers, oncogenes and tumor suppressors in cancer. Abstracts from PubMed and full-text articles from PubMed Central Open Access subset and Author Manuscript Collections are processed to find references to genes as drivers, oncogenes and tumor suppressors in different cancer types.
CancerMine is an automatically updated dataset. You can navigate the data using the web viewer or you can download the latest data from Zenodo (or through the web viewer). You likely would not have to run any of the code in this repository.
This is a Python3 project which has been tested on Centos 6/7 but should work on other Linux operating systems and MacOS. An individual process of this can be run on a laptop or desktop computer. But in order to process all of the literature (PubMed, etc), this should really be run on a cluster or server-like machine. A cluster that uses Slurm or the SunGrid engine (SGE) are supported. Each node needs only 4 GBs on RAM.
This project relies on text mining using Kindred and Snakemake. These can be installed through pip.
It uses biomedical text converted using BioText.
You can clone this repo using Git or download the ZIP file of it.
git clone https://github.com/jakelever/cancermine.git
It uses the BioText project which downloads and converts the biomedical research literature into the BioC XML format. You'll need to clone it and run the conversion scripts.
The code dependencies can be installed with the command below. Remember to install the English language model for Spacy.
pip install kindred snakemake zenodo_get
python -m spacy download en
Installation should take a maximum of 15 minutes (mostly due to the Spacy and language models installation).
We include example input and the expected output data for a small test run in the exampledata/ directory. To run a small test run of the scripts you can follow the steps below. Alternatively, all these commands are in the demoRun.sh file which can be executed independently (after installing dependencies). This file is run during the TravisCI test. This should only take six minutes.
First you need to build the machine learning models. This will extract the training data and build the necessary models from it.
sh buildModelsIfNeeded.sh
Then you need to process the input wordlists to get them ready for quick access. This generates various data structures that are stored in a Python pickle.
python wordlistLoader.py --genes exampledata/mini_terms_genes.tsv --cancers exampledata/mini_terms_cancers.tsv --drugs exampledata/mini_terms_drugs.tsv --conflicting exampledata/mini_terms_conflicting.tsv --wordlistPickle exampledata/mini_terms.pickle
There is a small test input file (examples/test.bioc.xml). It's in BioC XML format which is a format for biomedical corpora. You can run the relation extraction process with the commands below. There are also mini wordlists for test usage which are tiny subsets of the BioWordlists project project used for this.
# Find sentences that contain cancer types and gene names and filter using the terms in the filterTerms.txt
python parseAndFindEntities.py --biocFile exampledata/input.bioc.xml --filterTerms filterTerms.txt --wordlistPickle exampledata/mini_terms.pickle --outSentencesFilename exampledata/intermediate_sentences.json
# Apply the machine learning models to identify actual mentions of drivers, oncogenes and tumor suppressors
python applyModelsToSentences.py --models models/cancermine.driver.model,models/cancermine.oncogene.model,models/cancermine.tumorsuppressor.model --filterTerms filterTerms.txt --wordlistPickle exampledata/mini_terms.pickle --genes exampledata/mini_terms_genes.tsv --cancerTypes exampledata/mini_terms_cancers.tsv --sentenceFile exampledata/intermediate_sentences.json --outData exampledata/intermediate_relations.tsv
# Add the header to this file
cat header.tsv exampledata/intermediate_relations.tsv > exampledata/out_unfiltered.tsv
And then you can run the filter and collate process using the command below on that.
python filterAndCollate.py --inUnfiltered exampledata/out_unfiltered.tsv --outCollated exampledata/out_collated.tsv --outSentences exampledata/out_sentences.tsv
To run the full thing, you should use Snakemake. It manages the download of all the inputs outlined below. But first, you should do a test run (which should only last a minute or so):
MODE=test snakemake --cores 1
Then to do the full run which may take a long time, run the following and set the path to the BIOTEXT directory accordingly:
MODE=full BIOTEXT=$BIOTEXT snakemake --cores 1
Practically, you'll likely want to use a cluster to parallelize this. Please refer to the Snakemake documentation for information about how to use a cluster.
For uploading the output to Zenodo, this project uses bigzenodo and the submission.json file.
It is not possible to exactly reproduce the results as the data in PubMed and PMC are constantly being added to. The data used in the paper is downloadable from Zenodo with the Jun 30th 2018 release.
The text inputs for processing are:
- PubMed abstracts
- PubMed Central Open Access subset (PMCOA) full-text articles
- PubMed Central Author Manuscript Collection (PMCAMC) full-text articles
The text is scanned for references of genes and cancer types. These are based on HUGO gene names and cancer types from the Disease Ontology. These are managed through the BioWordlists project which can be downloaded at https://doi.org/10.5281/zenodo.1286661.
The training data used to build the machine learning models can be found at data/cancermine_corpus.zip. This is stored in BioNLP Shared Task format and has one file per sentence. The raw annotations from the three annotators can be found at data/raw_annotations
There are three final results files from CancerMine. These are hosted on Zenodo and can also be downloaded through the web viewer. Each file is a tab-delimited file with a header, no comments and no quoting.
You likely want cancermine_collated.tsv if you just want the list of cancer gene roles. If you want the supporting sentences, look at cancermine_sentences.tsv. You can use the matching_id column to connect the two files. If you want to dig further and are okay with a higher false positive rate, look at cancermine_unfiltered.tsv.
cancermine_collated.tsv: This contains the cancer gene roles with citation counts supporting them. It contains the normalized cancer and gene names along with IDs for HUGO, Entrez Gene and the Disease Ontology.
cancermine_sentences.tsv: This contains the supporting sentences for the cancer gene roles in the collated file. Each row is a single supporting sentence for one cancer gene role. This file contains information on the source publication (e.g. journal, publication date, etc), the actual sentence and the cancer gene role extracted.
cancermine_unfiltered.tsv: This is the raw output of the applyModelsToSentences.py script across all of PubMed, Pubmed Central Open Access and PubMed Central Author Manuscript Collection. It contains every predicted relation with a prediction score above 0.5. So this may contain many false positives. Each row contain information on the publication (e.g. journal, publication date, etc) along with the sentence and the specific cancer gene role extracted (with HUGO, Entrez Gene and Disease Ontology IDs). This file is further processed to create the other two.
The code in shiny/ is the Shiny code used for the web viewer. If it is helpful, please use the code for your own projects. The list of dependencies is found at the top of the app.R file.
The associated pseudocode.md file contains detailed information about the purpose of the various scripts here.
The code to generate all the figures and text for the paper can be found in paper/. This may be useful for generating an up-to-date version of the plots for a newer version of CancerMine.
- v11 data release: change to Kindred's EntityRecognizer uses strict string matching instead of token matching, so results are minorly different
- v29 data release: WARNING - this release contained buggy data and missed publications in the Author Manuscript Collection. This was after a conversion from PubRunner to BioText+snakemake with some issues
- v30 data release: Fixed issues with conversion to BioText+snakemake with an updated Biowordlist dataset
The paper is now published in Nature Methods. The preprint can still be accessed at bioRxiv. It'd be wonderful if you would cite the paper if you use the methods or data set.
@article{lever2019cancermine,
title={Cancer{M}ine: A literature-mined resource for drivers, oncogenes and tumor suppressors in cancer},
author={Lever, Jake and Zhao, Eric Y and Grewal, Jasleen and Jones, Martin R and Jones, Steven JM},
journal={Nature methods},
volume={16},
number={6},
pages={505},
year={2019},
publisher={Nature Publishing Group}
}
If you encounter any problems, please file an issue along with a detailed description.