Skip to content

Z Score normalization script

pieterlukasse edited this page Feb 11, 2016 · 14 revisions

Introduction

For some data types, when uploading to cBioPortal, it is currently necessary to also provide a z-score transformed version of your input file. This version of the data is used by cBioPortal in e.g. the oncoprint functionality. This is extra z-score input file is needed for the following types:

  • mRNA
  • RPPA

How to proceed

cBioPortal expects z-score normalization to take place per gene. You can produce this extra z-score file or let cBioPortal do it for your input files by using NormalizeExpressionLevels.java. Most information that now follows was taken from the comments in NormalizeExpressionLevels.java. We've also added an example of the calculation and example of running the program below.

The cBioPortal NormalizeExpressionLevels method

Given expression and Copy Number Variation data for a set of samples (patients), generate normalized expression values.

Each gene is normalized separately. First, the expression distribution for unaltered copies of the gene is estimated by calculating the mean and variance of the expression values for samples in which the gene is diploid (as reported by the CNV data). We call this the unaltered distribution. If the gene has no diploid samples, then its normalized expression is reported as NA. Otherwise, for every sample, the gene's normalized expression is reported as

(r - mu)/sigma

where r is the raw expression value, and mu and sigma are the mean and standard deviation of the unaltered distribution, respectively.

Syntax

java NormalizeExpressionLevels <copy_number_file> <expression_file> <output_file> <normal_sample_suffix> <[min_number_of_diploids]>

Any number of columns may precede the data. However, the following must be satisfied:

  • the first column of (?) provides gene identifiers

NormalizeExpressionLevels Transformation Algorithm 

Input: copy number (CNA) and expression (exp) files

    for each gene{
       identify diploid cases in CNA
       obtain mean and sd of exp of diploid cases
       for each case{
          zScore <- (value - mean)/sd
       }
    }

Implementation

    read CNA: build hash geneCopyNumberStatus: gene -> Array of (caseID, value ) pairs
    read exppression: skip normal cases;
    for each gene{
      get mean and s.d. of elements of diploids
      get zScore for each case
    }

Example Calculation

Hugo_Symbol

Entrez_Gene_Id

A1-A0SD-01

A1-A0SE-01

A1-A0SH-01

A1-A0SJ-01

A1-A0SK-01

A1-A0SM-01

A1-A0SO-01

A1-A0SP-01

A2-A04N-01

A2-A04P-01

A2-A04Q-01

A2-A04R-01

A2-A04T-01

A2-A04U-01

A2-A04V-01

A2-A04W-01

A2-A04X-01

A2-A04Y-01

RPS11 Expr

6205

0.765

0.716

0.417125

0.115

0.492875

-0.525

-0.169

0.396

0.50475

0.400875

0.393125

0.9165

0.627125

0.337125

0.705

0.16425

0.325

0.11175

RPS11 CNA

6205

0

0

0

1

1

0

-1

0

0

2

0

0

1

-1

0

0

-1

0

Calculate mean and stdev where CNA is 0 (=diploid): 

Diploid avg

Diploid std

0.414954545454545

0.399504498851105

Calculate the z-scores:

Hugo_Symbol

Entrez_Gene_Id

A1-A0SD-01

A1-A0SE-01

A1-A0SH-01

A1-A0SJ-01

A1-A0SK-01

A1-A0SM-01

A1-A0SO-01

A1-A0SP-01

A2-A04N-01

A2-A04P-01

A2-A04Q-01

A2-A04R-01

A2-A04T-01

A2-A04U-01

A2-A04V-01

A2-A04W-01

A2-A04X-01

A2-A04Y-01

RPS11 Expr Output

6205

0.8762

0.7535

0.0054

-0.7508

0.1950

-2.3528

-1.4617

-0.0474

0.2248

-0.0352

-0.0546

1.2554

0.5311

-0.1948

0.7260

-0.6275

-0.2252

-0.7590

Note: this implies that your full dataset does not have average=0, std=1

Running NormalizeExpressionLevels Example

NormalizeExpressionLevels is fairly memory intensive. You can normalize an expression file with the following commands:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export CORESNAPSHOT=/app/core/target/core-0.2.0-SNAPSHOT.jar
export CONNECTORJAR=/vagrant/tomcat/mysql-connector*.jar
export CNA_FILE=/vagrant/zTest/RPS11_CNA_small.txt
export EXPR_FILE=/vagrant/zTest/RPS11_expression_small.txt
export OUT_FILE=/vagrant/zTest/out2.txt
export NORMALS=nvt

$JAVA_HOME/bin/java -Dspring.profiles.active=dbcp -Xmx4096m -cp :$CORESNAPSHOT org.mskcc.cbio.portal.scripts.NormalizeExpressionLevels $CNA_FILE $EXPR_FILE $OUT_FILE $NORMALS