-
Notifications
You must be signed in to change notification settings - Fork 3
Using the dataset validator
To facilitate the loading of new studies into its database, cBioPortal provides a set of staging files formats for the various data types. To validate your files you can use the dataset validator script. This document describes how to install and use the validator script.
If you have a git clone of cBioPortal, the validation script validateData.py
can be found in the folder: <your_cbioportal_dir>/core/src/main/scripts/import_data_validator
To update your scripts run
git pull
The script runs in python 2
. If you want the script to generate html reports (recommended way for reading the validation errors, if any), then you will also need to install jinja2
and markdownsafe
. You can use these commands:
sudo pip2 install jinja2
and (does not always seem necessary):
sudo pip2 install markdownsafe
To run the validator type the following when in the folder <your_cbioportal_dir>/core/src/main/scripts/import_data_validator
:
./validateData.py
This will tell you the parameters you can use:
./validateData.py
usage: validateData.py [-h] -s STUDY_DIRECTORY [-u URL_SERVER]
[-html HTML_TABLE] [-v]
validateData.py: error: argument -s/--study_directory is required
-s : point this to the folder where your data files are found -u : (optional) URL to your local cBioPortal server. When not provided, default is: http://localhost/cbioportal -html : (optional) name of the html report file to be generated. Needs the dependencies mentioned above to be installed first -v : (optional) verbose, print out all messages. As default, this option is not set.
As an example, you can try it out with one of the test studies found in <your_cbioportal_dir>/core/src/test/scripts/test_data
. Example, using -v option to also see the progress:
./validateData.py -s ../../../test/scripts/test_data/study_es_0/ -u http://localhost:8080/cbioportal -v
Results in:
INFO: -: Requesting genes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting cancertypes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/patients from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/samples from portal at 'http://localhost:8080/cbioportal'
INFO: data_clinical2.txt: Starting validation of file
INFO: data_clinical2.txt: Validation of file complete
INFO: data_methylation_hm27.txt: Starting validation of file
INFO: data_methylation_hm27.txt: Validation of file complete
INFO: data_expression_median.txt: Starting validation of file
INFO: data_expression_median.txt: Validation of file complete
INFO: brca_tcga_pub.maf: Starting validation of file
INFO: brca_tcga_pub.maf: Validation of file complete
INFO: data_CNA.txt: Starting validation of file
INFO: data_CNA.txt: Validation of file complete
INFO: data_log2CNA.txt: Starting validation of file
INFO: data_log2CNA.txt: Validation of file complete
INFO: -: Validating case lists
INFO: -: Validation of case lists complete
INFO: -: Validation complete
Validation of study succeeded.
When using the -html
option, you will get a report that looks like this in the case above:
Try out also the other test studies (study_es_1
and study_es_3
) in <your_cbioportal_dir>/core/src/test/scripts/test_data
. Example, using -v option:
./validateData.py -s ../../../test/scripts/test_data/study_es_1/ -u http://localhost:8080/cbioportal -v
Results in:
INFO: -: Requesting genes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting cancertypes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/patients from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/samples from portal at 'http://localhost:8080/cbioportal'
INFO: data_clinical2.txt: Starting validation of file
INFO: data_clinical2.txt: Validation of file complete
INFO: data_expression_median.txt: Starting validation of file
ERROR: data_expression_median.txt: line 1: column 3: Sample ID not defined in clinical file; found in file: 'TEST2-A1-A0SD-01'
ERROR: data_expression_median.txt: Invalid column header, file cannot be parsed
INFO: -: Validating case lists
ERROR: cases_all.txt: Sample id not defined in clinical file; found in file: 'INVALID-A2-A0T2-01'
INFO: -: Validation of case lists complete
INFO: -: Validation complete
Validation of study failed.
And respective HTML report: