CELLECT MAGMA Docs

Technical notes on the workflow, regression and resource usage

The CELLECT-MAGMA workflow consists of the following three main steps:

an annotation step to map SNPs onto genes
a gene analysis step to compute gene p-values
prioritizing cell-type annotations (i.e. fitting the regression)

1. Annotation step: It is a preprocessing step prior to the analysis. The mapping of SNPs is based on genomic location, which is relative to a particular human genome reference build. During this step, we use gene locations for protein-coding genes for build 37 (hg19). The reference data files are created from Phase 3 of 1,000 Genomes. The SNP locations in the data are in reference to human genome build 37 (hg19). Data-specific SNP synonym files are included with the data. All these auxiliary files were obtained from the MAGMA website and are packaged inside data/magma.

To include SNPs in a window around genes, we utilize the WINDOW_DEFINITION of the genes’ transcribed regions. See Input & Output and its section on the WINDOW_DEFINITION parameter for details.

2. Gene analysis step to compute gene p-values: In the gene analysis step the gene p-values, other gene-level metrics and correlations between neighbouring genes are computed. The gene analysis results are output into a formatted output file with .genes.out suffix. The same results plus gene correlations are also stored in a .genes.raw file.

Resource notes: This step takes up the majority of the computation time. This step is CPU intensive but has low memory and I/O footprint. Importantly, the p-values need only to be calculated once per GWAS.

3. Prioritizing cell-type annotations: During the cell-type prioritization analysis, a linear regression model is fit between the MAGMA ZSTATs calculated at the previous step and the specificity values (see the SPECIFICITY_INPUT section of the config) for each cell-type annotation. For cell-type conditional analysis CELLECT adds the cell-type annotation conditioned on (one by one from the list in the CONDITIONAL_INPUT config section), when fitting the model. To fit the regression model itself, CELLECT uses Ordinary Least Squares (OLS) implemented in the statsmodels module in Python.

Resource notes: CPU intensive step. The computation time scales linearly with the number of GWAS traits and total number of cell-type annotations analyzed.

Runtime

The gene analysis takes the longest time but only needs to be done once per GWAS.
The computational time for analysis_type=prioritization or analysis_type=conditional generally scales linearly with the number of GWAS traits and cell-type annotations analyzed.
Building the snakemake workflow DAG can take a long time if you are analyzing many cell-type annotations (and/or GWAS).
If an input GWAS file is provided in the compressed format (gz / bz2 are supported), a temporary uncompressed copy will be stored in <BASE_OUTPUT_DIR>/precomputation. The uncompression might take a few minutes.

Runtime examples:

analysis_type=prioritization: 3 specificity inputs (~450 annotations total), 40 GWAS takes ~ 0.5 hour using 80 parallel jobs (-j 80 )
analysis_type=prioritization: 1 specificity inputs, 10 GWAS (~450 annotations total) takes ~ 0.2 hour using 10 parallel jobs (-j 10)
analysis_type=conditional: 1 specificity inputs, 10 GWAS (10 conditional annotations, with ~450 annotations in total) takes ~ 0.5 hour using 10 parallel jobs (-j 10)

Citation

If you use CELLECT-MAGMA, please cite:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CELLECT MAGMA Docs

Table of contents

Technical notes on the workflow, regression and resource usage

Runtime

Citation

Clone this wiki locally