Skip to content

Using the dataset validator

Sander de Ridder edited this page Mar 1, 2016 · 21 revisions

To facilitate the loading of new studies into its database, cBioPortal provides a set of staging files formats for the various data types. To validate your files you can use the dataset validator script. This document describes how to install and use the validator script.

Installation

If you have a git clone of cBioPortal, the validation script validateData.py can be found in the folder: <your_cbioportal_dir>/core/src/main/scripts/importer

To update your scripts run

git pull

Dependencies

The script runs in python 2. If you want the script to be able to generate html reports (recommended way for reading the validation errors, if any), then you will also need to install jinja2 and markdownsafe. You can use these commands:

sudo pip2 install jinja2

and (does not always seem necessary):

sudo pip2 install markdownsafe

Running the validation

To run the validator first go to the importer folder <your_cbioportal_dir>/core/src/main/scripts/importer and then run the following command:

./validateData.py

This will tell you the parameters you can use:

./validateData.py
usage: validateData.py [-h] -s STUDY_DIRECTORY [-u URL_SERVER]
                       [-html HTML_TABLE] [-v]
validateData.py: error: argument -s/--study_directory is required
  • -s : point this to the folder where your data files are found
  • -u : (optional) URL to the cBioPortal server against which to validate. When not provided, default is: http://localhost/cbioportal
  • -html : (optional) name of the html report file to be generated. Needs the dependencies mentioned above to be installed first
  • -v : (optional) verbose, print out all messages. By default, this option is not set.

Example 1

As an example, you can try the validator with one of the test studies found in <your_cbioportal_dir>/core/src/test/scripts/test_data. Example, assuming port 8080 and using -v option to also see the progress:

./validateData.py -s ../../../test/scripts/test_data/study_es_0/ -u http://localhost:8080/cbioportal -v

Results in:

INFO: -: Requesting genes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting cancertypes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/patients from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/samples from portal at 'http://localhost:8080/cbioportal'
INFO: data_clinical2.txt: Starting validation of file
INFO: data_clinical2.txt: Validation of file complete
INFO: data_methylation_hm27.txt: Starting validation of file
INFO: data_methylation_hm27.txt: Validation of file complete
INFO: data_expression_median.txt: Starting validation of file
INFO: data_expression_median.txt: Validation of file complete
INFO: brca_tcga_pub.maf: Starting validation of file
INFO: brca_tcga_pub.maf: Validation of file complete
INFO: data_CNA.txt: Starting validation of file
INFO: data_CNA.txt: Validation of file complete
INFO: data_log2CNA.txt: Starting validation of file
INFO: data_log2CNA.txt: Validation of file complete
INFO: -: Validating case lists
INFO: -: Validation of case lists complete
INFO: -: Validation complete
Validation of study succeeded.

When using the -html option, a report will be generated, which looks like this for the previous example:

Example 2

More test studies for trying the validator (study_es_1 and study_es_3) are available in <your_cbioportal_dir>/core/src/test/scripts/test_data. Example, assuming port 8080 and using -v option:

./validateData.py -s ../../../test/scripts/test_data/study_es_1/ -u http://localhost:8080/cbioportal -v

Results in:

INFO: -: Requesting genes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting cancertypes from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/patients from portal at 'http://localhost:8080/cbioportal'
INFO: -: Requesting clinicalattributes/samples from portal at 'http://localhost:8080/cbioportal'
INFO: data_clinical2.txt: Starting validation of file
INFO: data_clinical2.txt: Validation of file complete
INFO: data_expression_median.txt: Starting validation of file
ERROR: data_expression_median.txt: line 1: column 3: Sample ID not defined in clinical file; found in file: 'TEST2-A1-A0SD-01'
ERROR: data_expression_median.txt: Invalid column header, file cannot be parsed
INFO: -: Validating case lists
ERROR: cases_all.txt: Sample id not defined in clinical file; found in file: 'INVALID-A2-A0T2-01'
INFO: -: Validation of case lists complete
INFO: -: Validation complete
Validation of study failed.

And respective HTML report: