Skip to content

CELLECT MAGMA Docs

Pascal N Timshel edited this page Sep 22, 2020 · 11 revisions

Table of contents

Technical notes on the workflow, regression and resource usage

The CELLECT-MAGMA workflow consists of the following three main steps:

  1. an annotation step to map SNPs onto genes
  2. a gene analysis step to compute gene p-values
  3. prioritizing cell-type annotations (i.e. fitting the regression)

1. Annotation step: It is a preprocessing step prior to the analysis. The mapping of SNPs is based on genomic location, which is relative to a particular human genome reference build. During this step, we use gene locations for protein-coding genes for build 37 (hg19). The reference data files are created from Phase 3 of 1,000 Genomes. The SNP locations in the data are in reference to human genome build 37 (hg19). Data-specific SNP synonym files are included with the data. All these auxiliary files were obtained from the MAGMA website and are packaged inside data/magma.

To include SNPs in a window around genes, we utilize the WINDOW_DEFINITION of the genes’ transcribed regions. See Input & Output and its section on the WINDOW_DEFINITION parameter for details.

2. Gene analysis step to compute gene p-values: In the gene analysis step the gene p-values, other gene-level metrics and correlations between neighbouring genes are computed. The gene analysis results are output into a formatted output file with .genes.out suffix. The same results plus gene correlations are also stored in a .genes.raw file.

Resource notes: This step takes up the majority of the computation time. This step is CPU intensive but has low memory and I/O footprint. Importantly, the p-values need only to be calculated once per GWAS.

3. Prioritizing cell-type annotations: During the cell-type prioritization analysis, a linear regression model is fit between the MAGMA ZSTATs calculated at the previous step and the specificity values (see the SPECIFICITY_INPUT section of the config) for each cell-type annotation. For cell-type conditional analysis CELLECT adds the cell-type annotation conditioned on (one by one from the list in the CONDITIONAL_INPUT config section), when fitting the model. To fit the regression model itself, CELLECT uses Ordinary Least Squares (OLS) implemented in the statsmodels module in Python.

Resource notes: CPU intensive step. The computation time scales linearly with the number of GWAS traits and total number of cell-type annotations analyzed.

Runtime

  • The gene analysis takes the longest time but only needs to be done once per GWAS.
  • The computational time for analysis_type=prioritization or analysis_type=conditional generally scales linearly with the number of GWAS traits and cell-type annotations analyzed.
  • Building the snakemake workflow DAG can take a long time if you are analyzing many cell-type annotations (and/or GWAS).
  • If an input GWAS file is provided in the compressed format (gz / bz2 are supported), a temporary uncompressed copy will be stored in <BASE_OUTPUT_DIR>/precomputation. The uncompression might take a few minutes.

Runtime examples:

  • analysis_type=prioritization: 3 specificity inputs (~450 annotations total), 40 GWAS takes ~ 0.5 hour using 80 parallel jobs (-j 80 )
  • analysis_type=prioritization: 1 specificity inputs, 10 GWAS (~450 annotations total) takes ~ 0.2 hour using 10 parallel jobs (-j 10)
  • analysis_type=conditional: 1 specificity inputs, 10 GWAS (10 conditional annotations, with ~450 annotations in total) takes ~ 0.5 hour using 10 parallel jobs (-j 10)

Citation

If you use CELLECT-MAGMA, please cite: