-
Notifications
You must be signed in to change notification settings - Fork 19
CELLECT MAGMA Docs
- Technical notes: Technical notes on the workflow, regression and resource usage.
- Runtime: runtime estimation.
- Citation: reference and citation
The CELLECT-MAGMA workflow consists of the following three main steps:
- an annotation step to map SNPs onto genes
- a gene analysis step to compute gene p-values
- prioritizing cell-type annotations (i.e. fitting the regression)
1. Annotation step: It is a preprocessing step prior to the analysis. The mapping of SNPs is based on genomic location, which is relative to a particular human genome reference build. During this step, we use gene locations for protein-coding genes for build 37 (hg19). The reference data files are created from Phase 3 of 1,000 Genomes. The SNP locations in the data are in reference to human genome build 37 (hg19). Data-specific SNP synonym files are included with the data. All these auxiliary files were obtained from the MAGMA website and are packaged inside data/magma.
To include SNPs in a window around genes, we utilize the WINDOW_DEFINITION
of the genes’ transcribed regions. See Input & Output and its section on the WINDOW_DEFINITION
parameter for details.
2. Gene analysis step to compute gene p-values: In the gene analysis step the gene p-values, other gene-level metrics and correlations between neighbouring genes are computed. The gene analysis results are output into a formatted output file with .genes.out
suffix. The same results plus gene correlations are also stored in a .genes.raw
file.
Resource notes: This step takes up the majority of the computation time. This step is CPU intensive but has low memory and I/O footprint. Importantly, the p-values need only to be calculated once per GWAS.
3. Prioritizing cell-type annotations: During the cell-type prioritization analysis, a linear regression model is fit between the MAGMA ZSTATs calculated at the previous step and the specificity values (see the SPECIFICITY_INPUT
section of the config) for each cell-type annotation. For cell-type conditional analysis CELLECT adds the cell-type annotation conditioned on (one by one from the list in the CONDITIONAL_INPUT
config section), when fitting the model. To fit the regression model itself, CELLECT uses Ordinary Least Squares (OLS) implemented in the statsmodels
module in Python.
Resource notes: CPU intensive step. The computation time scales linearly with the number of GWAS traits and total number of cell-type annotations analyzed.
- The gene analysis takes the longest time but only needs to be done once per GWAS.
- The computational time for
analysis_type=prioritization
oranalysis_type=conditional
generally scales linearly with the number of GWAS traits and cell-type annotations analyzed. - Building the snakemake workflow DAG can take a long time if you are analyzing many cell-type annotations (and/or GWAS).
- If an input GWAS file is provided in the compressed format (gz / bz2 are supported), a temporary uncompressed copy will be stored in
<BASE_OUTPUT_DIR>/precomputation
. The uncompression might take a few minutes.
Runtime examples:
-
analysis_type=prioritization
: 3 specificity inputs (~450 annotations total), 40 GWAS takes ~ 0.5 hour using 80 parallel jobs (-j 80
) -
analysis_type=prioritization
: 1 specificity inputs, 10 GWAS (~450 annotations total) takes ~ 0.2 hour using 10 parallel jobs (-j 10
) -
analysis_type=conditional
: 1 specificity inputs, 10 GWAS (10 conditional annotations, with ~450 annotations in total) takes ~ 0.5 hour using 10 parallel jobs (-j 10
)
If you use CELLECT-MAGMA, please cite: