Notes for LD Score Regression.
TODO: Delete sliding window, because the l2 in reference panel is already the LD score. Figure out why 'M=1173569' which is the number of SNPs in the reference panel.
- 'master' branch: the source code of my implementation of LDSC.
- 'ldscore' folder: the source code of LDSC.
- 'ldsc.py' file: the entry point of LDSC.
- 'data' folder: the data used by LDSC, i.e.
$\ell^2$ and reference panel. - 'results' folder: the output log of LDSC.
- 'test' folder: the test code and results. Mostly used for analysis performance.
- 'gh-pages' branch: the code of the presentation page.
- Install dependencies:
pip3 install -r requirements.txt
Note: only need install jackknife
to estimate the variance of heritability.
- Calculate LD scores and estimate heritability:
python3 ldsc.py \
-M all \
-s ./data/full.sumstats \
-r ./data/eur_w_ld_chr/ \
-m CM -w 1e-4 \
-o ./results/test
- Calculate LD scores only:
python3 ldsc.py \
-mi ldsc \
-s ./data/full.sumstats \
-r ./data/eur_w_ld_chr/ \
-m CM -w 1e-2 \
-o ./results/test
- Estimate heritability only:
python3 ldsc.py \
-mi h2 \
-s ./results/test.txt \
-n 61220 \
-o ./results/test
- LD Score Calculation with MPI
- Delete MAF <= 0.01 SNPs, ignore the order of 'A1' and 'A2'
- Mapping from SNP in sumsstats to SNP in reference panel
- LD Score is the 'l2' column in the reference panel
- Iterative ReWeighted Least Squares(IRWLS):
$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$ $$\text{argmin}_{\boldsymbol{\beta}} \sum_{i=1}^n ||y_i - \mathbf{x}_i\boldsymbol{\beta}||_p$$ where$\mathbf{X}=[\mathbf{1},\mathbf{\ell}]$ is a$n\times 2$ matrix.-
Initial with pesudo-inverse:
$$\hat{\boldsymbol{\beta}}_0 = \text{pinv}(\mathbf{X})\mathbf{Y}$$ -
Iteratively update the weight matrix, for
$k=1,\ldots$ , until convergence or max iteration:- Calculate the residuals:
$$\mathbf{r}_k = \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}_k$$ - Calculate the weight matrix:
$$\mathbf{w}_k = | \mathbf{r}_k |^{p-2}$$ $$\mathbf{W}_k = \text{diag}(\mathbf{w}_k/\sum \mathbf{w}_k)$$ - Update the regression coefficient:
$$\hat{\boldsymbol{\beta}}_{k+1} = (\mathbf{X}^T\mathbf{W}_k\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}_k\mathbf{Y}$$ - Check convergence:
$$||\mathbf{r}_k - \mathbf{r}_{k+1}||_2 < \epsilon$$
- Calculate the residuals:
-
jackknife estimate of the variance:
-
vizviewer result_mpi.json
You may install vizviewer
python package first.
The presentation on LDSC is here, which is generated base on the 'gh-pages' branch and deployed by github action.
- Bulik-Sullivan, Brendan K., et al. "LD score regression distinguishes confounding from polygenicity in genome-wide association studies." Nature genetics 47.3 (2015): 291-295. doi: 10.1038/ng.3211
- Cai M, Wang Z, Xiao J, et al. XMAP: Cross-population fine-mapping by leveraging genetic diversity and accounting for confounding bias[J]. bioRxiv, 2023: 2023.03. 30.534832.
- Burrus C S. Iterative reweighted least squares[J]. OpenStax CNX. Available online: web.archive, 2012, 12.