diff --git a/05_PCA/index.html b/05_PCA/index.html index ecad9a63..9a8bb819 100644 --- a/05_PCA/index.html +++ b/05_PCA/index.html @@ -2331,6 +2331,10 @@
PCA workflow
+ +For PCA, we first exclude SNPs in high-LD or HLA regions from the genotype data.
@@ -2375,7 +2379,7 @@Next, use high-ld.txt
to extract all SNPs which are located in the regions described in the file using the code as follows:
Next, use high-ld.txt
to extract all SNPs that are located in the regions described in the file using the code as follows:
plink --file ${plinkFile} --make-set high-ld.txt --write-set --out hild
Note: this tutorial is being updated to Version 2024
This Github page aims to provide a hands-on tutorial on common analysis in Complex Trait Genomics. This tutorial is designed for the course Fundamental Exercise II
provided by The Laboratory of Complex Trait Genomics at the University of Tokyo. For more information, please see About.
This tutorial covers the minimum skills and knowledge required to perform a typical genome-wide association study (GWAS). The contents are categorized into the following groups. Additionally, for absolute beginners, we also prepared a section on command lines in Linux.
If you have any questions or suggestions, please feel free to let us know in the Issue section of this repository.
"},{"location":"#contents","title":"Contents","text":""},{"location":"#command-lines","title":"Command lines","text":"In these sections, we will briefly introduce the Post-GWAS analyses, which will dig deeper into the GWAS summary statistics. \u00a0
Introductions on GWAS-related issues
504 EAS individuals from 1000 Genomes Project Phase 3 version 5
Url: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Genome build: human_g1k_v37.fasta (hg19)
"},{"location":"01_Dataset/#genotype-data-processing","title":"Genotype Data Processing","text":"plink --mac 2 --max--maf 0.01 --thin 0.02
)plink --maf 0.01 --thin 0.15
)Note
The sample dataset 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip
has been included in 01_Dataset
when you clone the repository. There is no need to download it again if you clone this repository.
You can also simply run download_sampledata.sh
in 01_Dataset
and the dataset will be downloaded and decompressed.
./download_sampledata.sh\n
Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.
or you can manually download it from this link.
Unzip the dataset unzip -j 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip
, and you will get the following files:
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n
"},{"location":"01_Dataset/#phenotype-simulation","title":"Phenotype Simulation","text":"Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.
gcta \\\n --bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015 \\\n --simu-cc 250 254 \\\n --simu-causal-loci causal.snplist \\\n --simu-hsq 0.8 \\\n --simu-k 0.5 \\\n --simu-rep 1 \\\n --out 1kgeas_binary\n
$ cat causal.snplist\n2:55620927:G:A 3\n8:97094292:C:T 3\n20:42758834:T:C 3\n7:134326056:G:T 3\n1:167562605:G:A 3\n
Warning
This simulation is just used for showing the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result itself is meaningless.
Allele frequency and Effect size
"},{"location":"01_Dataset/#reference","title":"Reference","text":"This section is intended to provide a minimum introduction of the command line in Linux system for handling genomic data. (If you are alreay familiar with Linux commands, it is completely ok to skip this section.)
If you are a beginner with no background in programming, it would be helpful if you could learn some basic commands first before any analysis. In this section, we will introduce the most basic commands which enable you to handle genomic files in the terminal using command lines in a linux system.
For Mac users
This tutorial will probably work with no problems. Just simply open your terminal and follow the tutorial. (Note: A few commands might be different on MacOS.)
For Windows users
You can simply insall WSL to get a linux environment. Please check here for how to install WSL.
"},{"location":"02_Linux_basics/#table-of-contents","title":"Table of Contents","text":"man
Main functions of the Linux kernel
Some of the most common linux distributions
Linux and Linus
Linux is named after Linus Benedict Torvalds, who is a legendary Finnish software engineer who lead the development of the Linux kernel. He also developped the amazing version control software - Git.
Reference: https://en.wikipedia.org/wiki/Linux
"},{"location":"02_Linux_basics/#how-do-we-interact-with-computers","title":"How do we interact with computers?","text":"GUI and CUI
Shell
$
is the prompt for bash shell, which indicate that you can type commands after the $
sign.%
, and C shell uses >
as the prompt sign.bash
. Tip
The reason why we want to use CUI for large-scale data analysis is that CUI is better in term of precision, memory usage and processing speed.
"},{"location":"02_Linux_basics/#overview-of-the-basic-commands-in-linux","title":"Overview of the basic commands in Linux","text":"Unlike clicking and dragging files in Windows or MacOS, in Linux, we usually handle files by typing commands in the terminal.
Here is a list of the basic commands we are going to cover in this brief tutorial:
Basic Linux commands
Function group Commands Description Directoriespwd
, ls
, mkdir
, rmdir
Commands for checking, creating and removing directories Files touch
,cp
,mv
,rm
Commands for creating, copying, moving and removing files Checking files cat
,zcat
,head
,tail
,less
,more
,wc
Commands for inspecting files Archiving and compression tar
,gzip
,gunzip
,zip
,unzip
Commands for Archiving and Compressing files Manipulating text sort
,uniq
,cut
,join
,tr
Commands for manipulating text files Modifying permission chmod
,chown
, chgrp
Commands for changing the permissions of files and directories Links ln
Commands for creating symbolic and hard links Pipe, redirect and others pipe, >
,>>
,*
,.
,..
A group of miscellaneous commands Advance text editing awk
, sed
Commands for more complicated text manipulation and editing"},{"location":"02_Linux_basics/#how-to-check-the-usage-of-a-command-using-man","title":"How to check the usage of a command using man
:","text":"The first command we might want to learn is man
, which shows the manual for a certain command. When you forget how to use a command, you can always use man
to check.
man
: Check the manual of a command (e.g., man chmod
) or --help
option (e.g., chmod --help
)
For example, we want to check the usage of pwd
:
Use man
to get the manual for commands
$ man pwd\n
Then you will see the manual of pwd
in your terminal. PWD(1) User Commands PWD(1)\n\nNAME\n pwd - print name of current/working directory\n\nSYNOPSIS\n pwd [OPTION]...\n\nDESCRIPTION\n Print the full filename of the current working directory.\n....\n
Explain shell
Or you can use this wonderful website to get explanations for your commands.
URL : https://explainshell.com/
"},{"location":"02_Linux_basics/#commands","title":"Commands","text":""},{"location":"02_Linux_basics/#directories","title":"Directories","text":"The first set of commands are: pwd
, cd
, ls
, mkdir
and rmdir
, which are related to directories (like the folders in a Windows system).
pwd
","text":"pwd
: Print working directory, which means printing the path of the current directory (working directory)
Use pwd
to print the current directory you are in
$ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n
This command prints the absolute path.
An example of Linux file system and file paths
Type Description Example Absolute path path starting from root (the orange path)/home/User3/GWASTutorial/02_Linux_basics/README.md
Relative path path starting from the current directory (the blue path) ./GWASTutorial/02_Linux_basics/README.md
Tip: use readlink
to obtain the absolute path of a file
To get the absolute path of a file, you can use readlink -f [filename]
.
$ readlink -f README.md \n/home/he/work/GWASTutorial/02_Linux_basics/README.md\n
"},{"location":"02_Linux_basics/#cd","title":"cd
","text":"cd
: Change the current working directory.
Use cd
to change directory to 02_Linux_basics
and then print the current directory
$ cd 02_Linux_basics\n$ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n
"},{"location":"02_Linux_basics/#ls","title":"ls
","text":"ls
: List the contents in the working directory
Some frequently used options for ls
:
-l
: in a list-like format-h
: convert file size into a human readable format (KB,MB,GB...)-a
: list all files (including hidden files, namly those files with a period at the beginning of the filename)Simply list the files and directories in the current directory
$ ls\nREADME.md sumstats.txt\n
List the files and directories with options -lha
$ ls -lha\ndrwxr-xr-x 4 he staff 128B Dec 23 14:07 .\ndrwxr-xr-x 17 he staff 544B Dec 23 12:13 ..\n-rw-r--r-- 1 he staff 0B Oct 17 11:24 README.md\n-rw-r--r-- 1 he staff 31M Dec 23 14:07 sumstats.txt\n
Tip: use tree
to visualize the structure of a directory
You can use tree
command to visualize the structure of a directory.
$ tree ./02_Linux_basics/\n./02_Linux_basics/\n\u251c\u2500\u2500 README.md\n\u2514\u2500\u2500 sumstats.txt\n\n0 directories, 2 files\n
"},{"location":"02_Linux_basics/#mkdir-rmdir","title":"mkdir
& rmdir
","text":"mkdir
: Create a new empty directoryrmdir
: Delete an empty directoryMake a directory and delete it
$ mkdir new_directory\n$ ls\nnew_directory README.md sumstats.txt\n$ rmdir new_directory/\n$ ls\nREADME.md sumstats.txt\n
"},{"location":"02_Linux_basics/#manipulating-files","title":"Manipulating files","text":"This set of commands includes: touch
, mv
, rm
and cp
touch
","text":"touch
command is used to create a new empty file.
Create an empty text file called newfile.txt
in this directory
$ ls -l\ntotal 64048\n-rw-r--r-- 1 he staff 0 Oct 17 11:24 README.md\n-rw-r--r-- 1 he staff 32790417 Dec 23 14:07 sumstats.txt\n\ntouch newfile.txt\n\n$ touch newfile.txt\n$ ls -l\ntotal 64048\n-rw-r--r-- 1 he staff 0 Oct 17 11:24 README.md\n-rw-r--r-- 1 he staff 0 Dec 23 14:14 newfile.txt\n-rw-r--r-- 1 he staff 32790417 Dec 23 14:07 sumstats.txt\n
"},{"location":"02_Linux_basics/#mv","title":"mv
","text":"mv
has two functions:
The following command will create a new directoru called new_directory
, and move sumstats.txt
into that directory. Just like draggig a file in to a folder in window system.
Move a file to a different directory
# make a new directory\n$ mkdir new_directory\n\n#move sumstats to the new directory\n$ mv sumstats.txt new_directory/\n\n# list the item in new_directory\n$ ls new_directory/\nsumstats.txt\n
Now, let's move it back to the current directory and rename it to sumstats_new.txt
.
Rename a file using mv
$ mv ./new_directory/sumstats.txt ./\n
Note: ./
means the current directory You can also use mv
to rename a file: #rename\n$mv sumstats.txt sumstats_new.txt \n
"},{"location":"02_Linux_basics/#rm","title":"rm
","text":"rm
: Remove files or diretories
Remove a file and a directory
# remove a file\n$rm file\n\n#remove files in a directory (recursive mode)\n$rm -r directory/\n
There is no trash can in Linux command-line interface
If you delete a file with rm
, it will be very difficult to restore it. Please be careful wehn using rm
.
cp
","text":"cp
command is used to copy files or diretories.
Copy a file and a directory
#cp files\n$cp file1 file2\n\n# copy directory\n$cp -r directory1/ directory2/\n
"},{"location":"02_Linux_basics/#links","title":"Links","text":"Symbolic link is like a shortcut on window system, which is a special type of file that points to another file.
It is very useful when you want to organize your tool box or working space.
You can use ln -s pathA pathB
to create such a link.
Create a symbolic link for plink
Let`s create a symbolic link for plink first.
# /home/he/tools/plink/plink is the orinial file\n# /home/he/tools/bin is the path for the symbolic link \nln -s /home/he/tools/plink/plink /home/he/tools/bin\n
And then check the link.
cd /home/he/tools/bin\nls -lha\nlrwxr-xr-x 1 he staff 27B Aug 30 11:30 plink -> /home/he/tools/plink/plink\n
"},{"location":"02_Linux_basics/#archiving-and-compression","title":"Archiving and Compression","text":"Results for millions of variants are usually very large, sometimes >10GB, or consists of multiple files.
To save space and make it easier to transfer, we need to archive and compress these files.
Archiving and Compression
Commoly used commands for archiving and compression:
Extensions Create Extract Functionsfile.gz
gzip
gunzip
compress files.tar
tar -cvf
tar -xvf
archive files.tar.gz
or files.tgz
tar -czvf
tar -xvzf
archive and compress file.zip
zip
unzip
archive and compress Compress and decompress a file using gzip
and gunzip
$ ls -lh\n-rw-r--r-- 1 he staff 31M Dec 23 14:07 sumstats.txt\n\n$ gzip sumstats.txt\n$ ls -lh\n-rw-r--r-- 1 he staff 9.9M Dec 23 14:07 sumstats.txt.gz\n\n$ gunzip sumstats.txt.gz\n$ ls -lh\n-rw-r--r-- 1 he staff 31M Dec 23 14:07 sumstats.txt\n
"},{"location":"02_Linux_basics/#read-and-check-files","title":"Read and check files","text":"We have a group of handy commands to check part of or the entire file, including cat
, zcat
, less
, head
, tail
, wc
cat
","text":"cat
command can print the contents of files or concatenate the files.
Create and then cat
the file a_text_file.txt
$ ls -lha > a_text_file.txt\n$ cat a_text_file.txt \ntotal 32M\ndrwxr-x--- 2 he staff 4.0K Apr 2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr 1 22:20 ..\n-rw-r----- 1 he staff 0 Apr 2 00:37 a_text_file.txt\n-rw-r----- 1 he staff 5.0K Apr 1 22:20 README.md\n-rw-r----- 1 he staff 32M Mar 30 18:17 sumstats.txt\n
Warning
Be careful not to cat
a text file with a huge number of lines. You can try to cat sumstats.txt
and see what happends.
By the way, > a_text_file.txt
here means redirect the output to file a_text_file.txt
.
zcat
","text":"zcat
is similar to cat
, but can only applied to compressed files.
cat
and zcat
a gzipped text file
$ gzip a_text_file.txt \n$ cat a_text_file.txt.gz TGba_text_file. txt\u044f\n@\u0231\u00bbO\ud8ac\udc19v\u0602\ud85e\udca9\u00bc\ud9c3\udce0bq}\udb06\udca4\\\ueee0\u00a4n\u0662\u00aa\uda40\udc2cn\u00bb\u06a1\u01ed\n w5J_\u00bd\ud88d\ude27P\u07c9=\u00ffK\n(\u05a3\u0530\u00a7\u04a4\u0176a\u0786 \u00acM\u00adR\udbb5\udc8am\u00b3\u00fee\u00b8\u00a4\u00bc\u05cdSd\ufff1\u07f2\ub4e4\u00aa\u00adv\n \u5a41 resize: unknown character, exiting.\n\n$ zcat a_text_file.txt.gz \ntotal 32M\ndrwxr-x--- 2 he staff 4.0K Apr 2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr 1 22:20 ..\n-rw-r----- 1 he staff 0 Apr 2 00:37 a_text_file.txt\n-rw-r----- 1 he staff 5.0K Apr 1 22:20 README.md\n-rw-r----- 1 he staff 32M Mar 30 18:17 sumstats.txt\n
gzcat
Use gzcat
instead of zcat
if your device is running MacOS.
head
","text":"head
: Print the first 10 lines.
-n
: option to change the number of lines.
Check the first 10 lines and only the first line of the file sumstats.txt
$ head sumstats.txt \nCHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 319 17 2 1 1 ADD 10000 1.04326 0.0495816 0.854176 0.393008 .\n1 319 22 1 2 2 ADD 10000 1.03347 0.0493972 0.666451 0.505123 .\n1 418 23 1 2 2 ADD 10000 1.02668 0.0498185 0.528492 0.597158 .\n1 537 30 1 2 2 ADD 10000 1.01341 0.0498496 0.267238 0.789286 .\n1 546 31 2 1 1 ADD 10000 1.02051 0.0336786 0.60284 0.546615 .\n1 575 33 2 1 1 ADD 10000 1.09795 0.0818305 1.14199 0.25346 .\n1 752 44 2 1 1 ADD 10000 1.02038 0.0494069 0.408395 0.682984 .\n1 913 50 2 1 1 ADD 10000 1.07852 0.0493585 1.53144 0.12566 .\n1 1356 77 2 1 1 ADD 10000 0.947521 0.0339805 -1.5864 0.112649 .\n\n$ head -n 1 sumstats.txt \nCHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n
"},{"location":"02_Linux_basics/#tail","title":"tail
","text":"Similar to head
, you can use tail
ro check the last 10 lines. -n
works in the same way.
Check the last 10 lines of the file sumstats.txt
$ tail sumstats.txt \n22 99996057 9959945 2 1 1 ADD 10000 1.03234 0.0335547 0.948413 0.342919.\n22 99996465 9959971 2 1 1 ADD 10000 1.04755 0.0337187 1.37769 0.1683 .\n22 99997041 9960013 2 1 1 ADD 10000 1.01942 0.0937548 0.205195 0.837419.\n22 99997608 9960051 2 1 1 ADD 10000 0.969928 0.0397711 -0.767722 0. 442652 .\n22 99997629 9960055 2 1 1 ADD 10000 0.986949 0.0395305 -0.332315 0. 739652 .\n22 99997742 9960061 2 1 1 ADD 10000 0.990829 0.0396614 -0.232298 0. 816307 .\n22 99998121 9960086 2 1 1 ADD 10000 1.04448 0.0335879 1.29555 0.19513 .\n22 99998455 9960106 2 1 1 ADD 10000 0.880953 0.152754 -0.829771 0. 406668 .\n22 99999208 9960146 2 1 1 ADD 10000 0.944604 0.065187 -0.874248 0. 381983 .\n22 99999382 9960164 2 1 1 ADD 10000 0.970509 0.033978 -0.881014 0.37831 .\n
"},{"location":"02_Linux_basics/#wc","title":"wc
","text":"wc
: short for word count, which count the lines, words, and characters in a file.
For example,
Count the lines, words, and characters in sumstats.txt
$ wc sumstats.txt \n 445933 5797129 32790417 sumstats.txt\n
This means that sumstats.txt
has 445933 lines, 5797129 words, and 32790417 characters. "},{"location":"02_Linux_basics/#edit-files","title":"Edit files","text":"Vim is a handy text editor for command line.
Vim - text editor
vim README.md\n
Simple workflow using Vim
vim file_to_edit.txt
i
to enter the INSERT mode.Esc
key to escape the INSERT mode.:wq
to quit and also save the file.Vim is a little bit hard to learn for beginners, but when you get familiar with it, it will be a mighty and convenient tool. For more detailed tutorials on Vim, you can check: https://github.com/iggredible/Learn-Vim
Other common command line text editors
The permissions of a file or directory are represented as a 10-character string (1+3+3+3) :
For example, this represents a directory(the initial d) which is readable, writable and executable for the owner(the first 3: rwx), users in the same group(the 3 characters in the middle: rwx) and others (last 3 characters: rwx).
drwxrwxrwx
-> d (directory or file) rwx (permissions for owner) rwx (permissions for users in the same group) rwx (permissions for other users)
r
readable w
writable x
executable d
directory -
file Command for checking the permissions of files in the current directory: ls -l
Command for changing permissions: chmod
, chown
, chgrp
Syntax:
chmod [3-digit Binary notation] [path]\n
Number notation Permission 3-digit Binary notation 7 rwx
111 6 rw-
110 5 r-x
101 4 r--
100 3 -wx
011 2 -w-
010 1 --x
001 0 ---
000 Change the permissions of the file README.md
to 660
# there is a readme file in the directory, and its permissions are -rw-r----- \n$ ls -lh\ntotal 4.0K\n-rw-r----- 1 he staff 2.1K Feb 24 01:16 README.md\n\n# let's change the permissions to 660, which is a numeric notation of -rw-rw---- based on the table above\n$ chmod 660 README.md \n\n# chack again, and it was changed.\n$ ls -lh\ntotal 4.0K\n-rw-rw---- 1 he staff 2.1K Feb 24 01:16 README.md\n
Note
These commands are very important because we use genome data, which could raise severe ethical and privacy issues if there is data leak.
Warning
Please always be cautious when handling human genomic data.
"},{"location":"02_Linux_basics/#others","title":"Others","text":"There are a group of very handy and flexible commands which will greatly improve your efficiency. These include |
, >
, >>
,*
,.
,..
,~
,and -
.
|
(pipe)","text":"Pipe basically is used to pass the output of the previous command to the next command as input, instead of printing is in terminal. Using pipe you can do very complicated manipulations of the files.
An example of Pipe
cat sumstats.txt | sort | uniq | wc\n
This means (1) print sumstats, (2) sort the output, (3) then keep the unique lines and finally (4) count the lines and words."},{"location":"02_Linux_basics/#_1","title":">
","text":">
redirects output to a new file (if the file already exist, it will be overwritten)
Redirects the output of cat sumstats.txt | sort | uniq | wc
to count.txt
cat sumstats.txt | sort | uniq | wc > count.txt\n
"},{"location":"02_Linux_basics/#_2","title":">>
","text":">>
redirects output to a file by appending to the end of the file (if the file already exist, it will not be overwritten)
Redirects the output of cat sumstats.txt | sort | uniq | wc
to count.txt
by appending
cat sumstats.txt | sort | uniq | wc >> count.txt\n
Other useful commands include :
Command Description Example Code Example code meaning*
represent zero or more characters - - ?
represent a single character - - .
the current directory - - ..
the parent directory of the current directory. cd ..
change to the parent directory of the current directory ~
the home directory cd ~
change to the curent user's home directory -
the last directory you are working in. cd -
change to the last directory you are working in. Wildcards
The asterisk *
and the question mark ?
are called wildcard characters or wildcards in Linux, which are special symbols that can represent other normal characters. Wildcards are especially useful when handling multiple files with similar pattern in their names.
Warning
Be extremely careful when you use rm and *. It is disastrous when you mistakenly type rm *
If you have a lot of commands to run, or if you want to automate some complex manipulations, bash scripts are a good way to address this issue.
We can use vim to create a bash script called hello.sh
A simple example of bash scripts:
Example
hello.sh#!/bin/bash\necho \"Hello, world1\"\necho \"Hello, world2\"\n
#!
is called shebang, which tells the system which interpreter to use to execute the shell script.
Then use chmod
to give it permission to execute.
chmod +x hello.sh \n
Now we can run the srcipt by ./hello.sh
:
./hello.sh\n\"Hello, world1\" \n\"Hello, world2\" \n
"},{"location":"02_Linux_basics/#advanced-text-editing","title":"Advanced text editing","text":"(optional: awk, sed, cut, sort, join, uniq)
cut
: cutting out columns from files.sort
: sorting the lines of a file.uniq
: filter the duplicated lines in a file.join
: join two tabular files based on specified keys.Advanced commands:
awk
: https://cloufield.github.io/GWASTutorial/60_awk/sed
: https://cloufield.github.io/GWASTutorial/61_sed/Git is a powerful version control software and github is a platform where you can share your codes.
Currently you just need to learn git clone
, which simply downloads an existing repository.
git clone https://github.com/Cloufield/GWASTutorial.git
You can also check here for more information.
Quote
We can use wget [option] [url]
command to download files to local machine.
-O
option specify the file name you want to change for the downloaded file.
Use wget to download the hg19 reference genome from UCSC
# Download hg19 reference genome from UCSC\nwget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n\n# Download hg19 reference genome from UCSC and rename it to my_refgenome.fa.gz\nwget -O my_refgenome.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n
"},{"location":"02_Linux_basics/#exercise","title":"Exercise","text":"The questions are generated by Microsoft Bing!
What is the command to list all files and directories in your current working directory?
ls
cd
pwd
mkdir
What is the command to create a new directory named \u201ctest\u201d?
cd test
pwd test
mkdir test
ls test
What is the command to copy a file named \u201cdata.txt\u201d from your current working directory to another directory named \u201cbackup\u201d?
cp data.txt backup/
mv data.txt backup/
rm data.txt backup/
cat data.txt backup/
What is the command to display the first 10 lines of a file named \u201cresults.csv\u201d?
head results.csv
tail results.csv
less results.csv
more results.csv
What is the command to count the number of lines, words, and characters in a file named \u201creport.txt\u201d?
wc report.txt
count report.txt
size report.txt
stat report.txt
What is the command to search for a pattern in a file named \u201clog.txt\u201d and print only the matching lines?
grep pattern log.txt
find pattern log.txt
locate pattern log.txt
search pattern log.txt
What is the command to sort the contents of a file named \u201cnames.txt\u201d in alphabetical order and save the output to a new file named \u201csorted_names.txt\u201d?
sort names.txt > sorted_names.txt
sort names.txt < sorted_names.txt
sort names.txt >> sorted_names.txt
sort names.txt << sorted_names.txt
What is the command to display the difference between two files named \u201cold_version.py\u201d and \u201cnew_version.py\u201d?
diff old_version.py new_version.py
cmp old_version.py new_version.py
diffy old_version.py new_version.py
compare old_version.py new_version.py
What is the command to change the permissions of a file named \u201cscript.sh\u201d to make it executable by everyone?
chmod +x script.sh
chmod 777 script.sh
chmod ugo+x script.sh
All of the above
What is the command to run a program named \u201cprogram.exe\u201d in the background and redirect its output to a file named \u201coutput.log\u201d?
program.exe & > output.log
program.exe > output.log &
program.exe < output.log &
program.exe & < output.log
This section lists some of the most commonly used formats in complex trait genomic analysis.
"},{"location":"03_Data_formats/#table-of-contents","title":"Table of Contents","text":"Simple text file
.txt
cat sample_text.txt \nLorem ipsum dolor sit amet, consectetur adipiscing elit. In ut sem congue, tristique tortor et, ullamcorper elit. Nulla elementum, erat ac fringilla mattis, nisi tellus euismod dui, interdum laoreet orci velit vel leo. Vestibulum neque mi, pharetra in tempor id, malesuada at ipsum. Duis tellus enim, suscipit sit amet vestibulum in, ultricies vitae erat. Proin consequat id quam sed sodales. Ut a magna non tellus dictum aliquet vitae nec mi. Suspendisse potenti. Vestibulum mauris sem, viverra ac metus sed, scelerisque ornare arcu. Vivamus consequat, libero vitae aliquet tempor, lorem leo mattis arcu, et viverra erat ligula sit amet tortor. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Praesent ut massa ac tortor lobortis placerat. Pellentesque aliquam tortor augue, at rutrum magna molestie et. Etiam congue nulla in venenatis congue. Nunc ac felis pharetra, cursus leo et, finibus eros.\n
Random texts are generated using - https://www.lipsum.com/"},{"location":"03_Data_formats/#tsv","title":"tsv","text":"Tab-separated values Tabular data format
.tsv
head sample_data.tsv\n#CHROM POS ID REF ALT A1 FIRTH? TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C N ADD 503 0.750168 0.280794 -1.02373 0.305961 .\n1 14599 1:14599:T:A T A A N ADD 503 1.80972 0.231595 2.56124 0.0104299 .\n1 14604 1:14604:A:G A G G N ADD 503 1.80972 0.231595 2.56124 0.0104299 .\n1 14930 1:14930:A:G A G G N ADD 503 1.70139 0.240245 2.21209 0.0269602 .\n1 69897 1:69897:T:C T C T N ADD 503 1.58002 0.194774 2.34855 0.0188466 .\n1 86331 1:86331:A:G A G G N ADD 503 1.47006 0.236102 1.63193 0.102694 .\n1 91581 1:91581:G:A G A A N ADD 503 0.924422 0.122991 -0.638963 0.522847 .\n1 122872 1:122872:T:G T G G N ADD 503 1.07113 0.180776 0.380121 0.703856 .\n1 135163 1:135163:C:T C T T N ADD 503 0.711822 0.23908 -1.42182 0.155079 .\n
"},{"location":"03_Data_formats/#csv","title":"csv","text":"Comma-separated values Tabular data format
.csv
head sample_data.csv \n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
"},{"location":"03_Data_formats/#data-formats-in-bioinformatics","title":"Data formats in bioinformatics","text":"A typical workflow for generating genotype data for genome-wide association analysis.
"},{"location":"03_Data_formats/#sequence","title":"Sequence","text":""},{"location":"03_Data_formats/#fasta","title":"fasta","text":"text-based format for representing either nucleotide sequences or amino acid (protein) sequences
.fa
or .fasta
>SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n
"},{"location":"03_Data_formats/#fastq","title":"fastq","text":"text-based format for storing both a nucleotide sequence and its corresponding quality scores
.fastq
@SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n+\n!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65\n
Reference: https://en.wikipedia.org/wiki/FASTQ_format"},{"location":"03_Data_formats/#alingment","title":"Alingment","text":""},{"location":"03_Data_formats/#sambam","title":"SAM/BAM","text":"Sequence Alignment/Map Format is a TAB-delimited text file format consisting of a header section and an alignment section.
.sam
@HD VN:1.6 SO:coordinate\n@SQ SN:ref LN:45\nr001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *\nr002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *\nr003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;\nr004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *\nr003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;\nr001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1\n
Reference : https://samtools.github.io/hts-specs/SAMv1.pdf"},{"location":"03_Data_formats/#variant-and-genotype","title":"Variant and genotype","text":""},{"location":"03_Data_formats/#vcf-vcfgz-vcfgztbi","title":"vcf / vcf.gz / vcf.gz.tbi","text":"VCF is a text file format consisting of meta-information lines, a header line, and then data lines. Each data line contains information about a variant in the genome (and the genotype information on samples for each variant).
.vcf
##fileformat=VCFv4.2\n##fileDate=20090805\n##source=myImputationProgramV3.1\n##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta\n##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>\n##phasing=partial\n##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">\n##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">\n##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">\n##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">\n##INFO=<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">\n##INFO=<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">\n##FILTER=<ID=q10,Description=\"Quality below 10\">\n##FILTER=<ID=s50,Description=\"Less than 50% of samples have data\">\n##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">\n##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">\n##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">\n#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003\n20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.\n20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3\n20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4\n20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2\n20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3\n
Reference : https://samtools.github.io/hts-specs/VCFv4.2.pdf "},{"location":"03_Data_formats/#plink-format","title":"PLINK format","text":"The figure shows how genotypes are stored in files.
We have 3 parts of information:
And there are different ways (format sets) to represent this information in PLINK1.9 and PLINK2:
.ped
(PLINK/MERLIN/Haploview text pedigree + genotype table)
Original standard text format for sample pedigree information and genotype calls.Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.
.ped
# check the first 16 rows and 16 columns of the ped file\ncut -d \" \" -f 1-16 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped | head\n0 HG00403 0 0 0 -9 G G T T A A G A C C\n0 HG00404 0 0 0 -9 G G T T A A G A T C\n0 HG00406 0 0 0 -9 G G T T A A G A T C\n0 HG00407 0 0 0 -9 G G T T A A A A C C\n0 HG00409 0 0 0 -9 G G T T A A G A C C\n0 HG00410 0 0 0 -9 G G T T A A G A C C\n0 HG00419 0 0 0 -9 G G T T A A A A T C\n0 HG00421 0 0 0 -9 G G T T A A G A C C\n0 HG00422 0 0 0 -9 G G T T A A G A C C\n0 HG00428 0 0 0 -9 G G T T A A G A C C\n0 HG00436 0 0 0 -9 G G A T G A A A C C\n0 HG00437 0 0 0 -9 C G T T A A G A C C\n0 HG00442 0 0 0 -9 G G T T A A G A C C\n0 HG00443 0 0 0 -9 G G T T A A G A C C\n0 HG00445 0 0 0 -9 G G T T A A G A C C\n0 HG00446 0 0 0 -9 C G T T A A G A T C\n
.map
(PLINK text fileset variant information file)
Variant information file accompanying a .ped text pedigree + genotype table. A text file with no header line, and one line per variant with the following 3-4 fields:
.map
head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n1 1:13273:G:C 0 13273\n1 1:14599:T:A 0 14599\n1 1:14604:A:G 0 14604\n1 1:14930:A:G 0 14930\n1 1:69897:T:C 0 69897\n1 1:86331:A:G 0 86331\n1 1:91581:G:A 0 91581\n1 1:122872:T:G 0 122872\n1 1:135163:C:T 0 135163\n1 1:233473:C:G 0 233473\n
Reference: https://www.cog-genomics.org/plink/1.9/formats
"},{"location":"03_Data_formats/#bed-fam-bim","title":"bed / fam /bim","text":"bed/fam/bim formats are the binary implementation of ped/map formats. bed/bim/fam files contain the same information as ped/map but are much smaller in size.
-rw-r----- 1 yunye yunye 135M Dec 23 11:45 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed\n-rw-r----- 1 yunye yunye 36M Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n-rw-r----- 1 yunye yunye 9.4K Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n-rw-r--r-- 1 yunye yunye 32M Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n-rw-r--r-- 1 yunye yunye 2.2G Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped\n
.fam
head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n0 HG00403 0 0 0 -9\n0 HG00404 0 0 0 -9\n0 HG00406 0 0 0 -9\n0 HG00407 0 0 0 -9\n0 HG00409 0 0 0 -9\n0 HG00410 0 0 0 -9\n0 HG00419 0 0 0 -9\n0 HG00421 0 0 0 -9\n0 HG00422 0 0 0 -9\n0 HG00428 0 0 0 -9\n
.bim
head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n1 1:13273:G:C 0 13273 C G\n1 1:14599:T:A 0 14599 A T\n1 1:14604:A:G 0 14604 G A\n1 1:14930:A:G 0 14930 G A\n1 1:69897:T:C 0 69897 C T\n1 1:86331:A:G 0 86331 G A\n1 1:91581:G:A 0 91581 A G\n1 1:122872:T:G 0 122872 G T\n1 1:135163:C:T 0 135163 T C\n1 1:233473:C:G 0 233473 G C\n
.bed
\"Primary representation of genotype calls at biallelic variants The first three bytes should be 0x6c, 0x1b, and 0x01 in that order. The rest of the file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.\"
hexdump -C 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed | head\n00000000 6c 1b 01 ff ff bf bf ff ff ff ef fb ff ff ff fe |l...............|\n00000010 ff ff ff ff fb ff bb ff ff fb af ff ff fe fb ff |................|\n00000020 ff ff ff fe ff ff ff ff ff bf ff ff ef ff ff ef |................|\n00000030 bb ff ff ff ff ff ff ff fa ff ff ff ff ff ff ff |................|\n00000040 ff ff ff fb ff ff ff ff ff ff ff ff ff ff ff ef |................|\n00000050 ff ff ff fb fe ef fe ff ff ff ff eb ff ff fe fe |................|\n00000060 ff ff fe ff bf ff fa fb fb eb be ff ff 3b ff be |.............;..|\n00000070 fe be bf ef fe ff ef ee ff ff bf ea fe bf fe ff |................|\n00000080 bf ff ff ef ff ff ff ff ff fa ff ff eb ff ff ff |................|\n00000090 ff ff fb fe af ff bf ff ff ff ff ff ff ff ff ff |................|\n
Reference: https://www.cog-genomics.org/plink/1.9/formats
"},{"location":"03_Data_formats/#imputation-dosage","title":"Imputation dosage","text":""},{"location":"03_Data_formats/#bgen-bgi","title":"bgen / bgi","text":"Reference: https://www.well.ox.ac.uk/~gav/bgen_format/
"},{"location":"03_Data_formats/#pgenpsampvar","title":"pgen,psam,pvar","text":"Reference: https://www.cog-genomics.org/plink/2.0/formats#pgen
NOTE: pgen
only saved the dosage for each individual (a scalar ranged from 0 to 2). It could not been converted back to the genotype probability (a vector of length 3) or allele probability (a matrix of dimension 2 x 2) saved in bgen
.
In this module, we will learn the basics of genotype data QC using PLINK, which is one of the most commonly used software in complex trait genomics. (Huge thanks to the developers: PLINK1.9 and PLINK2)
"},{"location":"04_Data_QC/#table-of-contents","title":"Table of Contents","text":"To get prepared for genotype QC, we will need to make directories, download software and add the software to your environment path.
First, we will simply create some directories to keep the tools we need to use.
Create directories
cd ~\nmkdir tools\ncd tools\nmkdir bin\nmkdir plink\nmkdir plink2\n
You can download each tool into its corresponding directories.
The bin
directory here is for keeping all the symbolic links to the executable files of each tool.
In this way, it is much easier to manage and organize the paths and tools. We will only add the bin
directory here to the environment path.
Next, go to the Plink webpage to download the software. We will need both PLINK1.9 and PLINK2.
Download PLINK1.9 and PLINK2 from the following webpage to the corresponding directories:
Info
If you are using Mac or Windows, then please download the Mac or Windows version. In this tutorial, we will use a Linux system and the Linux version of PLINK.
Find the suitable version on the PLINK website, right-click and copy the link address.
Download PLINK2 (Linux AVX2 AMD)
cd ~/tools/plink2\nwget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_amd_avx2_20231212.zip\nunzip plink2_linux_amd_avx2_20231212.zip\n
Then do the same for PLINK1.9
Download PLINK1.9 (Linux 64-bit)
cd ~/tools/plink\nwget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip\nunzip plink_linux_x86_64_20231211.zip\n
"},{"location":"04_Data_QC/#create-symbolic-links","title":"Create symbolic links","text":"After downloading and unzipping, we will create symbolic links for the plink binary files, and then move the link to ~/tools/bin/
.
Create symbolic links
cd ~\nln -s ~/tools/plink2/plink2 ~/tools/bin/plink2\nln -s ~/tools/plink/plink ~/tools/bin/plink\n
"},{"location":"04_Data_QC/#add-paths-to-the-environment-path","title":"Add paths to the environment path","text":"Then add ~/tools/bin/
to the environment path.
Example
export PATH=$PATH:~/tools/bin/\n
This command will add the path to your current shell. If you restart the terminal, it will be lost. So you may need to add it to the Bash configuration file. Then run
echo \"export PATH=$PATH:~/tools/bin/\" >> ~/.bashrc\n
This will add a new line at the end of .bashrc
, which will be run every time you open a new bash shell.
All done. Let's test if we installed PLINK successfully or not.
Check if PLINK is installed successfully.
./plink\nPLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\n\nplink <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink --help [flag name(s)...]\n\nCommands include --make-bed, --recode, --flip-scan, --merge-list,\n--write-snplist, --list-duplicate-vars, --freqx, --missing, --test-mishap,\n--hardy, --mendel, --ibc, --impute-sex, --indep-pairphase, --r2, --show-tags,\n--blocks, --distance, --genome, --homozyg, --make-rel, --make-grm-gz,\n--rel-cutoff, --cluster, --pca, --neighbour, --ibs-test, --regress-distance,\n--model, --bd, --gxe, --logistic, --dosage, --lasso, --test-missing,\n--make-perm-pheno, --tdt, --qfam, --annotate, --clump, --gene-report,\n--meta-analysis, --epistasis, --fast-epistasis, and --score.\n\n\"plink --help | more\" describes all functions (warning: long).\n
./plink2\nPLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\n\nplink2 <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink2 --help [flag name(s)...]\n\nCommands include --rm-dup list, --make-bpgen, --export, --freq, --geno-counts,\n--sample-counts, --missing, --hardy, --het, --fst, --indep-pairwise, --ld,\n--sample-diff, --make-king, --king-cutoff, --pmerge, --pgen-diff,\n--write-samples, --write-snplist, --make-grm-list, --pca, --glm, --adjust-file,\n--gwas-ssf, --clump, --score, --variant-score, --genotyping-rate, --pgen-info,\n--validate, and --zst-decompress.\n\n\"plink2 --help | more\" describes all functions.\n
Well done. We have successfully installed plink1.9 and plink2.
"},{"location":"04_Data_QC/#download-genotype-data","title":"Download genotype data","text":"Next, we need to download the sample genotype data. The way to create the sample data is described [here].(https://cloufield.github.io/GWASTutorial/01_Dataset/) This dataset contains 504 EAS individuals from 1000 Genome Project Phase 3v5 with around 1 million variants.
Simply run download_sampledata.sh
in 01_Dataset to download this dataset (from Dropbox). See here
Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.
Download sample data
cd ../01_Dataset\n./download_sampledata.sh\n
And you will get the following three PLINK files:
-rw-r--r-- 1 yunye yunye 149M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n-rw-r--r-- 1 yunye yunye 40M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n-rw-r--r-- 1 yunye yunye 13K Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n
Check the bim file:
head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1 1:14930:A:G 0 14930 G A\n1 1:15774:G:A 0 15774 A G\n1 1:15777:A:G 0 15777 G A\n1 1:57292:C:T 0 57292 T C\n1 1:77874:G:A 0 77874 A G\n1 1:87360:C:T 0 87360 T C\n1 1:92917:T:A 0 92917 A T\n1 1:104186:T:C 0 104186 T C\n1 1:125271:C:T 0 125271 C T\n1 1:232449:G:A 0 232449 A G\n
Check the fam file:
head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\nHG00403 HG00403 0 0 0 -9\nHG00404 HG00404 0 0 0 -9\nHG00406 HG00406 0 0 0 -9\nHG00407 HG00407 0 0 0 -9\nHG00409 HG00409 0 0 0 -9\nHG00410 HG00410 0 0 0 -9\nHG00419 HG00419 0 0 0 -9\nHG00421 HG00421 0 0 0 -9\nHG00422 HG00422 0 0 0 -9\nHG00428 HG00428 0 0 0 -9\n
"},{"location":"04_Data_QC/#plink-tutorial","title":"PLINK tutorial","text":"Detailed descriptions can be found on plink's website: PLINK1.9 and PLINK2.
The functions we will learn in this tutorial:
All sample codes and results for this module are available in ./04_data_QC
QC Step Summary
QC step Option in PLINK Commonly used threshold to exclude Sample missing rate--geno
, --missing
missing rate > 0.01 (0.02, or 0.05) SNP missing rate --mind
, --missing
missing rate > 0.01 (0.02, or 0.05) Minor allele frequency --freq
, --maf
maf < 0.01 Sample Relatedness --genome
pi_hat > 0.2 to exclude second-degree relatives Hardy-Weinberg equilibrium --hwe
,--hardy
hwe < 1e-6 Inbreeding F coefficient --het
outside of 3 SD from the mean First, we can calculate some basic statistics of our simulated data:
"},{"location":"04_Data_QC/#missing-rate-call-rate","title":"Missing rate (call rate)","text":"The first thing we want to know is the missing rate of our data. Usually, we need to check the missing rate of samples and SNPs to decide a threshold to exclude low-quality samples and SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#missing)
Missing rate and Call rate
Suppose we have N samples and M SNPs for each sample.
For sample \\(j\\) :
\\[Sample\\ Missing\\ Rate_{j} = {{N_{missing\\ SNPs\\ for\\ j}}\\over{M}} = 1 - Call\\ Rate_{sample, j}\\]For SNP \\(i\\) :
\\[SNP\\ Missing\\ Rate_{i} = {{N_{missing\\ samples\\ at\\ i}}\\over{N}} = 1 - Call\\ Rate_{SNP, i}\\]The input is PLINK bed/bim/fam file. Usually, they have the same prefix, and we just need to pass the prefix to --bfile
option.
PLINK syntax
To calculate the missing rate, we need the flag --missing
, which tells PLINK to calculate the missing rate in the dataset specified by --bfile
.
Calculate missing rate
cd ../04_Data_QC\ngenotypeFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" #!!! Please add your own path here. \"1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" is the prefix of PLINK bed file. \n\nplink \\\n --bfile ${genotypeFile} \\\n --missing \\\n --out plink_results\n
Remeber to set the value for ${genotypeFile}
. This code will generate two files plink_results.imiss
and plink_results.lmiss
, which contain the missing rate information for samples and SNPs respectively.
Take a look at the .imiss
file. The last column shows the missing rate for samples. Since we used part of the 1000 Genome Project data this time, there are no missing SNPs in the original datasets. But for educational purposes, we randomly make some of the genotypes missing.
# missing rate for each sample\nhead plink_results.imiss\n FID IID MISS_PHENO N_MISS N_GENO F_MISS\nHG00403 HG00403 Y 10020 1235116 0.008113\nHG00404 HG00404 Y 9192 1235116 0.007442\nHG00406 HG00406 Y 15751 1235116 0.01275\nHG00407 HG00407 Y 14653 1235116 0.01186\nHG00409 HG00409 Y 5667 1235116 0.004588\nHG00410 HG00410 Y 6066 1235116 0.004911\nHG00419 HG00419 Y 20000 1235116 0.01619\nHG00421 HG00421 Y 17542 1235116 0.0142\nHG00422 HG00422 Y 18608 1235116 0.01507\n
# missing rate for each SNP\nhead plink_results.lmiss\n CHR SNP N_MISS N_GENO F_MISS\n 1 1:14930:A:G 2 504 0.003968\n 1 1:15774:G:A 3 504 0.005952\n 1 1:15777:A:G 3 504 0.005952\n 1 1:57292:C:T 6 504 0.0119\n 1 1:77874:G:A 3 504 0.005952\n 1 1:87360:C:T 1 504 0.001984\n 1 1:92917:T:A 7 504 0.01389\n 1 1:104186:T:C 3 504 0.005952\n 1 1:125271:C:T 2 504 0.003968\n
Distribution of sample missing rate and SNP missing rate
Note: The missing values were simulated based on normal distributions for each individual.
Sample missing rate
SNP missing rate
For the meaning of headers, please refer to PLINK documents.
"},{"location":"04_Data_QC/#allele-frequency","title":"Allele Frequency","text":"One of the most important statistics of SNPs is their frequency in a certain population. Many downstream analyses are based on investigating differences in allele frequencies.
Usually, variants can be categorized into 3 groups based on their Minor Allele Frequency (MAF):
How to calculate Minor Allele Frequency (MAF)
Suppose the reference allele(REF) is A and the alternative allele(ALT) is B for a certain SNP. The posible genotypes are AA, AB and BB. In a population of N samples (2N alleles), \\(N = N_{AA} + 2 \\times N_{AB} + N_{BB}\\) :
So we can calculate the allele frequency:
The MAF for this SNP in this specific population is defined as:
\\(MAF = min( AF_{REF}, AF_{ALT} )\\)
For different downstream analyses, we might use different sets of variants. For example, for PCA, we might use only common variants. For gene-based tests, we might use only rare variants.
Using PLINK1.9 we can easily calculate the MAF of variants in the input data.
Calculate the MAF of variants using PLINK1.9
plink \\\n --bfile ${genotypeFile} \\\n --freq \\\n --out plink_results\n
# results from plink1.9\nhead plink_results.frq\nCHR SNP A1 A2 MAF NCHROBS\n1 1:14930:A:G G A 0.4133 1004\n1 1:15774:G:A A G 0.02794 1002\n1 1:15777:A:G G A 0.07385 1002\n1 1:57292:C:T T C 0.1054 996\n1 1:77874:G:A A G 0.01996 1002\n1 1:87360:C:T T C 0.02286 1006\n1 1:92917:T:A A T 0.003018 994\n1 1:104186:T:C T C 0.499 1002\n1 1:125271:C:T C T 0.03088 1004\n
Next, we use plink2 to run the same options to check the difference between the results.
Calculate the alternative allele frequencies of variants using PLINK2
plink2 \\\n --bfile ${genotypeFile} \\\n --freq \\\n --out plink_results\n
# results from plink2\nhead plink_results.afreq\n#CHROM ID REF ALT PROVISIONAL_REF? ALT_FREQS OBS_CT\n1 1:14930:A:G A G Y 0.413347 1004\n1 1:15774:G:A G A Y 0.0279441 1002\n1 1:15777:A:G A G Y 0.0738523 1002\n1 1:57292:C:T C T Y 0.105422 996\n1 1:77874:G:A G A Y 0.0199601 1002\n1 1:87360:C:T C T Y 0.0228628 1006\n1 1:92917:T:A T A Y 0.00301811 994\n1 1:104186:T:C T C Y 0.500998 1002\n1 1:125271:C:T C T Y 0.969124 1004\n
We need to pay attention to the concepts here.
In PLINK1.9, the concept here is minor (A1) and major(A2) allele, while in PLINK2 it is the reference (REF) allele and the alternative (ALT) allele.
For SNP QC, besides checking the missing rate, we also need to check if the SNP is in Hardy-Weinberg equilibrium:
--hardy
will perform Hardy-Weinberg equilibrium exact test for each variant. Variants with low P value usually suggest genotyping errors, or indicate evolutionary selection for these variants.
The following command can calculate the Hardy-Weinberg equilibrium exact test statistics for all SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#hardy)
Info
Suppose we have N unrelated samples (2N alleles). Under HWE, the exact probability of observing \\(n_{AB}\\) sample with genotype AB in N samples is:
\\[P(N_{AB} = n_{AB} | N, n_A) = {{2^{n_{AB}}}N!\\over{n_{AA}!n_{AB}!n_{BB}!}} \\times {{n_A!n_B!}\\over{n_A!n_B!}} \\]To compute the Hardy-Weinberg equilibrium exact test statistics, we will sum up the probabilities of all configurations with probability equal to or less than the observed configuration :
\\[P_{HWE} = \\sum_{n^{*}_AB} I[P(N_{AB} = n_{AB} | N, n_A) \\geqq P(N_{AB} = n^{*}_{AB} | N, n_A)] \\times P(N_{AB} = n^{*}_{AB} | N, n_A)\\]\\(I(x)\\) is the indicator function. If x is true, \\(I(x) = 1\\); otherwise, \\(I(x) = 0\\).
Reference : Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link
Calculate the Hardy-Weinberg equilibrium exact test statistics for a single SNP using Python
This code is converted from here (Jeremy McRae) to python. Orginal citation: Wigginton, JE, Cutler, DJ, and Abecasis, GR (2005) A Note on Exact Tests of Hardy-Weinberg Equilibrium. AJHG 76: 887-893
def snphwe(obs_hets, obs_hom1, obs_hom2):\n obs_homr = min(obs_hom1, obs_hom2)\n obs_homc = max(obs_hom1, obs_hom2)\n\n rare = 2 * obs_homr + obs_hets\n genotypes = obs_hets + obs_homc + obs_homr\n\n probs = [0.0 for i in range(rare +1)]\n\n mid = rare * (2 * genotypes - rare) // (2 * genotypes)\n if mid % 2 != rare%2:\n mid += 1\n\n probs[mid] = 1.0\n sum_p = 1 #probs[mid]\n\n curr_homr = (rare - mid) // 2\n curr_homc = genotypes - mid - curr_homr\n\n for curr_hets in range(mid, 1, -2):\n probs[curr_hets - 2] = probs[curr_hets] * curr_hets * (curr_hets - 1.0)/ (4.0 * (curr_homr + 1.0) * (curr_homc + 1.0))\n sum_p+= probs[curr_hets - 2]\n curr_homr += 1\n curr_homc += 1\n\n curr_homr = (rare - mid) // 2\n curr_homc = genotypes - mid - curr_homr\n\n for curr_hets in range(mid, rare-1, 2):\n probs[curr_hets + 2] = probs[curr_hets] * 4.0 * curr_homr * curr_homc/ ((curr_hets + 2.0) * (curr_hets + 1.0))\n sum_p += probs[curr_hets + 2]\n curr_homr -= 1\n curr_homc -= 1\n\n target = probs[obs_hets]\n p_hwe = 0.0\n for p in probs:\n if p <= target :\n p_hwe += p / sum_p \n\n return min(p_hwe,1)\n
Calculate the Hardy-Weinberg equilibrium exact test statistics using PLINK
plink \\\n --bfile ${genotypeFile} \\\n --hardy \\\n --out plink_results\n
head plink_results.hwe\n CHR SNP TEST A1 A2 GENO O(HET) E(HET) P\n1 1:14930:A:G ALL(NP) G A 4/407/91 0.8108 0.485 4.864e-61\n1 1:15774:G:A ALL(NP) A G 0/28/473 0.05589 0.05433 1\n1 1:15777:A:G ALL(NP) G A 1/72/428 0.1437 0.1368 0.5053\n1 1:57292:C:T ALL(NP) T C 3/99/396 0.1988 0.1886 0.3393\n1 1:77874:G:A ALL(NP) A G 0/20/481 0.03992 0.03912 1\n1 1:87360:C:T ALL(NP) T C 0/23/480 0.04573 0.04468 1\n1 1:92917:T:A ALL(NP) A T 0/3/494 0.006036 0.006018 1\n1 1:104186:T:C ALL(NP) T C 74/352/75 0.7026 0.5 6.418e-20\n1 1:125271:C:T ALL(NP) C T 1/29/472 0.05777 0.05985 0.3798\n
"},{"location":"04_Data_QC/#applying-filters","title":"Applying filters","text":"Previously we calculated the basic statistics using PLINK. But when performing certain analyses, we just want to exclude the bad-quality samples or SNPs instead of calculating the statistics for all samples and SNPs.
In this case we can apply the following filters for example:
--maf 0.01
: exlcude snps with maf<0.01--geno 0.02
:filters out all variants with missing rates exceeding 0.02--mind 0.02
:filters out all samples with missing rates exceeding 0.02--hwe 1e-6
: filters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold. NOTE: With case/control data, cases and missing phenotypes are normally ignored. (see https://www.cog-genomics.org/plink/1.9/filter#hwe)We will apply these filters in the following example if LD-pruning.
"},{"location":"04_Data_QC/#ld-pruning","title":"LD Pruning","text":"There is often strong Linkage disequilibrium(LD) among SNPs, for some analysis we don't need all SNPs and we need to remove the redundant SNPs to avoid bias in genetic estimations. For example, for relatedness estimation, we will use only LD-Pruned SNP set.
We can use --indep-pairwise 50 5 0.2
to filter out those in strong LD and keep only the independent SNPs.
Meaning of --indep-pairwise x y z
x
SNPsz
y
SNPs forward and repeat the procedure.Please check https://www.cog-genomics.org/plink/1.9/ld#indep for details.
Combined with the filters we just introduced, we can run:
Example
plink \\\n --bfile ${genotypeFile} \\\n --maf 0.01 \\\n --geno 0.02 \\\n --mind 0.02 \\\n --hwe 1e-6 \\\n --indep-pairwise 50 5 0.2 \\\n --out plink_results\n
This command generates two outputs: plink_results.prune.in
and plink_results.prune.out
plink_results.prune.in
is the independent set of SNPs we will use in the following analysis. You can check the PLINK log for how many variants were removed based on the filters you applied:
Total genotyping rate in remaining samples is 0.993916.\n108837 variants removed due to missing genotype data (--geno).\n--hwe: 9754 variants removed due to Hardy-Weinberg exact test.\n87149 variants removed due to minor allele threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1029376 variants and 501 people pass filters and QC.\n
Let's take a look at the LD-pruned SNP file. Basically, it just contains one SNP id per line.
head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
"},{"location":"04_Data_QC/#inbreeding-f-coefficient","title":"Inbreeding F coefficient","text":"Next, we can check the heterozygosity F of samples (https://www.cog-genomics.org/plink/1.9/basic_stats#ibc) :
-het
option will compute observed and expected autosomal homozygous genotype counts for each sample. Usually, we need to exclude individuals with high or low heterozygosity coefficients, which suggests that the sample might be contaminated.
Inbreeding F coefficient calculation by PLINK
\\[F = {{O(HOM) - E(HOM)}\\over{ M - E(HOM)}}\\]High F may indicate a relatively high level of inbreeding.
Low F may suggest the sample DNA was contaminated.
Performing LD-pruning beforehand since these calculations do not take LD into account.
Calculate inbreeding F coefficient
plink \\\n --bfile ${genotypeFile} \\\n --extract plink_results.prune.in \\\n --het \\\n --out plink_results\n
Check the output:
head plink_results.het\n FID IID O(HOM) E(HOM) N(NM) F\nHG00403 HG00403 180222 1.796e+05 217363 0.01698\nHG00404 HG00404 180127 1.797e+05 217553 0.01023\nHG00406 HG00406 178891 1.789e+05 216533 -0.0001138\nHG00407 HG00407 178992 1.79e+05 216677 -0.0008034\nHG00409 HG00409 179918 1.801e+05 218045 -0.006049\nHG00410 HG00410 179782 1.801e+05 218028 -0.009268\nHG00419 HG00419 178362 1.783e+05 215849 0.001315\nHG00421 HG00421 178222 1.785e+05 216110 -0.008288\nHG00422 HG00422 178316 1.784e+05 215938 -0.0022\n
A commonly used method is to exclude samples with heterozygosity F deviating more than 3 standard deviations (SD) from the mean. Some studies used a fixed value such as +-0.15 or +-0.2.
Usually we will use only LD-pruned SNPs for the calculation of F.
We can plot the distribution of F:
Distribution of \\(F_{het}\\) in sample data
Here we use +-0.1 as the \\(F_{het}\\) threshold for convenience.
Create sample list of individuals with extreme F using awk
# only one sample\nawk 'NR>1 && $6>0.1 || $6<-0.1 {print $1,$2}' plink_results.het > high_het.sample\n
"},{"location":"04_Data_QC/#sample-snp-filtering-extractexcludekeepremove","title":"Sample & SNP filtering (extract/exclude/keep/remove)","text":"Sometimes we will use only a subset of samples or SNPs included the original dataset. In this case, we can use --extract
or --exclude
to select or exclude SNPs from analysis, --keep
or --remove
to select or exclude samples.
For --keep
or --remove
, the input is the filename of a sample FID and IID file. For --extract
or --exclude
, the input is the filename of an SNP list file.
head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
"},{"location":"04_Data_QC/#ibd-pi_hat-kinship-coefficient","title":"IBD / PI_HAT / kinship coefficient","text":"--genome
will estimate IBS/IBD. Usually, for this analysis, we need to prune our data first since the strong LD will cause bias in the results. (This step is computationally intensive)
Combined with the --extract
, we can run:
How PLINK estimates IBD
The prior probability of IBS sharing can be modeled as:
\\[P(I=i) = \\sum^{z=i}_{z=0}P(I=i|Z=z)P(Z=z)\\]So the proportion of alleles shared IBD (\\(\\hat{\\pi}\\)) can be estimated by:
\\[\\hat{\\pi} = {{P(Z=1)}\\over{2}} + P(Z=2)\\]Estimate IBD
plink \\\n --bfile ${genotypeFile} \\\n --extract plink_results.prune.in \\\n --genome \\\n --out plink_results\n
PI_HAT is the IBD estimation. Please check https://www.cog-genomics.org/plink/1.9/ibd for more details.
head plink_results.genome\n FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO\nHG00403 HG00403 HG00404 HG00404 UN NA 1.0000 0.0000 0.0000 0.0000 -1 0.858562 0.3679 1.9774\nHG00403 HG00403 HG00406 HG00406 UN NA 0.9805 0.0044 0.0151 0.0173 -1 0.858324 0.8183 2.0625\nHG00403 HG00403 HG00407 HG00407 UN NA 0.9790 0.0000 0.0210 0.0210 -1 0.857794 0.8034 2.0587\nHG00403 HG00403 HG00409 HG00409 UN NA 0.9912 0.0000 0.0088 0.0088 -1 0.857024 0.2637 1.9578\nHG00403 HG00403 HG00410 HG00410 UN NA 0.9699 0.0235 0.0066 0.0184 -1 0.858194 0.6889 2.0335\nHG00403 HG00403 HG00419 HG00419 UN NA 1.0000 0.0000 0.0000 0.0000 -1 0.857643 0.8597 2.0745\nHG00403 HG00403 HG00421 HG00421 UN NA 0.9773 0.0218 0.0010 0.0118 -1 0.857276 0.2186 1.9484\nHG00403 HG00403 HG00422 HG00422 UN NA 0.9880 0.0000 0.0120 0.0120 -1 0.857224 0.8277 2.0652\nHG00403 HG00403 HG00428 HG00428 UN NA 0.9801 0.0069 0.0130 0.0164 -1 0.858162 0.9812 2.1471\n
KING-robust kinship estimator
PLINK2 uses KING-robust kinship estimator, which is more robust in the presence of population substructure. See here.
Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W. M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.
Since the samples are unrelated, we do not need to remove any samples at this step. But remember to check this for your dataset.
"},{"location":"04_Data_QC/#ld-calculation","title":"LD calculation","text":"We can also use our data to estimate the LD between a pair of SNPs.
Details on LD can be found here
--chr
option in PLINK allows us to include SNPs on a specific chromosome. To calculate LD r2 for SNPs on chr22 , we can run:
Example
plink \\\n --bfile ${genotypeFile} \\\n --chr 22 \\\n --r2 \\\n --out plink_results\n
head plink_results.ld\n CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2\n22 16069141 22:16069141:C:G 22 16071624 22:16071624:A:G 0.771226\n22 16069784 22:16069784:A:T 22 16149743 22:16149743:T:A 0.217197\n22 16069784 22:16069784:A:T 22 16150589 22:16150589:C:A 0.224992\n22 16069784 22:16069784:A:T 22 16159060 22:16159060:G:A 0.2289\n22 16149743 22:16149743:T:A 22 16150589 22:16150589:C:A 0.965109\n22 16149743 22:16149743:T:A 22 16152606 22:16152606:T:C 0.692157\n22 16149743 22:16149743:T:A 22 16159060 22:16159060:G:A 0.721796\n22 16149743 22:16149743:T:A 22 16193549 22:16193549:C:T 0.336477\n22 16149743 22:16149743:T:A 22 16212542 22:16212542:C:T 0.442424\n
"},{"location":"04_Data_QC/#data-management-make-bedrecode","title":"Data management (make-bed/recode)","text":"By far the input data we use is in binary form, but sometimes we may want the text version.
Info
To convert the formats, we can run:
Convert PLINK formats
#extract the 1000 samples with the pruned SNPs, and make a bed file.\nplink \\\n --bfile ${genotypeFile} \\\n --extract plink_results.prune.in \\\n --make-bed \\\n --out plink_1000_pruned\n\n#convert the bed/bim/fam to ped/map\nplink \\\n --bfile plink_1000_pruned \\\n --recode \\\n --out plink_1000_pruned\n
"},{"location":"04_Data_QC/#apply-all-the-filters-to-obtain-a-clean-dataset","title":"Apply all the filters to obtain a clean dataset","text":"We can then apply the filters and remove samples with high \\(F_{het}\\) to get a clean dataset for later use.
plink \\\n --bfile ${genotypeFile} \\\n --maf 0.01 \\\n --geno 0.02 \\\n --mind 0.02 \\\n --hwe 1e-6 \\\n --remove high_het.sample \\\n --keep-allele-order \\\n --make-bed \\\n --out sample_data.clean\n
1224104 variants and 500 people pass filters and QC.\n
-rw-r--r-- 1 yunye yunye 146M Dec 26 15:40 sample_data.clean.bed\n-rw-r--r-- 1 yunye yunye 39M Dec 26 15:40 sample_data.clean.bim\n-rw-r--r-- 1 yunye yunye 13K Dec 26 15:40 sample_data.clean.fam\n
"},{"location":"04_Data_QC/#other-common-qc-steps-not-included-in-this-tutorial","title":"Other common QC steps not included in this tutorial","text":"Learn the meaning of each QC step.
Visualize the results of QC (using Python or R)
PCA aims to find the orthogonal directions of maximum variance and project the data onto a new subspace with equal or fewer dimensions than the original one. Simply speaking, GRM (genetic relationship matrix; covariance matrix) is first estimated and then PCA is applied to this matrix to generate eigenvectors and eigenvalues. Finally, the \\(k\\) eigenvectors with the largest eigenvalues are used to transform the genotypes to a new feature subspace.
Genetic relationship matrix (GRM)
Citation: Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.
A simple PCA
Source data:
cov = np.array([[6, -3], [-3, 3.5]])\npts = np.random.multivariate_normal([0, 0], cov, size=800)\n
The red arrow shows the first principal component axis (PC1) and the blue arrow shows the second principal component axis (PC2). The two axes are orthogonal.
Interpretation of PCs
The first principal component of a set of p variables, presumed to be jointly normally distributed, is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p iterations until all the variance is explained.
PCA is by far the most commonly used dimension reduction approach used in population genetics which could identify the difference in ancestry among the sample individuals. The population outliers could be excluded from the main cluster. For GWAS we also need to include top PCs to adjust for the population stratification.
Please read the following paper on how we apply PCA to genetic data: Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904\u2013909 (2006). https://doi.org/10.1038/ng1847 https://www.nature.com/articles/ng1847
So before association analysis, we will learn how to run PCA analysis first.
For PCA, we first exclude SNPs in high-LD or HLA regions from the genotype data.
The reason why we want to exclude such high-LD or HLA regions
You can simply copy the list of high-LD or HLA regions in Genome build version(.bed format) to a text file high-ld.txt
.
High LD regions were obtained from
https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)
High LD regions of hg19
high-ld-hg19.txt1 48000000 52000000 highld\n2 86000000 100500000 highld\n2 134500000 138000000 highld\n2 183000000 190000000 highld\n3 47500000 50000000 highld\n3 83500000 87000000 highld\n3 89000000 97500000 highld\n5 44500000 50500000 highld\n5 98000000 100500000 highld\n5 129000000 132000000 highld\n5 135500000 138500000 highld\n6 25000000 35000000 highld\n6 57000000 64000000 highld\n6 140000000 142500000 highld\n7 55000000 66000000 highld\n8 7000000 13000000 highld\n8 43000000 50000000 highld\n8 112000000 115000000 highld\n10 37000000 43000000 highld\n11 46000000 57000000 highld\n11 87500000 90500000 highld\n12 33000000 40000000 highld\n12 109500000 112000000 highld\n20 32000000 34500000 highld\n
"},{"location":"05_PCA/#create-a-list-of-snps-in-high-ld-or-hla-regions","title":"Create a list of SNPs in high-LD or HLA regions","text":"Next, use high-ld.txt
to extract all SNPs which are located in the regions described in the file using the code as follows:
plink --file ${plinkFile} --make-set high-ld.txt --write-set --out hild\n
Create a list of SNPs in the regions specified in high-ld.txt
plinkFile=\"../04_Data_QC/sample_data.clean\"\n\nplink \\\n --bfile ${plinkFile} \\\n --make-set high-ld-hg19.txt \\\n --write-set \\\n --out hild\n
And all SNPs in the regions will be extracted to hild.set.
$head hild.set\nhighld\n1:48000156:C:G\n1:48002096:C:G\n1:48003081:T:C\n1:48004776:C:T\n1:48006500:A:G\n1:48006546:C:T\n1:48008102:T:G\n1:48009994:C:T\n1:48009997:C:A\n
For downstream analysis, we can exclude these SNPs using --exclude hild.set
.
Steps to perform a typical genomic PCA analysis
MAF filter for LD-pruning and PCA
For LD-pruning and PCA, we usually only use variants with MAF > 0.01 or MAF>0.05 ( --maf 0.01
or --maf 0.05
) for robust estimation.
Sample codes for performing PCA
plinkFile=\"\" #please set this to your own path\noutPrefix=\"plink_results\"\nthreadnum=2\nhildset = hild.set \n\n# LD-pruning, excluding high-LD and HLA regions\nplink2 \\\n --bfile ${plinkFile} \\\n --maf 0.01 \\\n --threads ${threadnum} \\\n --exclude ${hildset} \\ \n --indep-pairwise 500 50 0.2 \\\n --out ${outPrefix}\n\n# Remove related samples using king-cuttoff\nplink2 \\\n --bfile ${plinkFile} \\\n --extract ${outPrefix}.prune.in \\\n --king-cutoff 0.0884 \\\n --threads ${threadnum} \\\n --out ${outPrefix}\n\n# PCA after pruning and removing related samples\nplink2 \\\n --bfile ${plinkFile} \\\n --keep ${outPrefix}.king.cutoff.in.id \\\n --extract ${outPrefix}.prune.in \\\n --freq counts \\\n --threads ${threadnum} \\\n --pca approx allele-wts 10 \\ \n --out ${outPrefix}\n\n# Projection (related and unrelated samples)\nplink2 \\\n --bfile ${plinkFile} \\\n --threads ${threadnum} \\\n --read-freq ${outPrefix}.acount \\\n --score ${outPrefix}.eigenvec.allele 2 5 header-read no-mean-imputation variance-standardize \\\n --score-col-nums 6-15 \\\n --out ${outPrefix}_projected\n
--pca
and --pca approx
For step 3, please note that approx
flag is only recommended for analysis of >5000 samples. (It was applied in the sample code anyway because in real analysis you usually have a much larger sample size, though the sample size of our data is just ~500)
After step 3, the allele-wts 10
modifier requests an additional one-line-per-allele .eigenvec.allele
file with the first 10 PCs
expressed as allele weights instead of sample weights.
We will get the plink_results.eigenvec.allele
file, which will be used to project onto all samples along with an allele count plink_results.acount
file.
In the projection, score ${outPrefix}.eigenvec.allele 2 5
sets the ID
(2nd column) and A1
(5th column), score-col-nums 6-15
sets the first 10 PCs to be projected.
Please check https://www.cog-genomics.org/plink/2.0/score#pca_project for more details on the projection.
Allele weight and count files
plink_results.eigenvec.allele#CHROM ID REF ALT PROVISIONAL_REF? A1 PC1 PC2 PC3 PC4 PC5 PC6 PC7PC8 PC9 PC10\n1 1:15774:G:A G A Y G 0.57834 -1.03002 0.744557 -0.161887 0.389223 -0.0514592 0.133195 -0.0336162 -0.846376 0.0542876\n1 1:15774:G:A G A Y A -0.57834 1.03002 -0.744557 0.161887 -0.389223 0.0514592 -0.133195 0.0336162 0.846376 -0.0542876\n1 1:15777:A:G A G Y A -0.585215 0.401872 -0.393071 -1.79583 0.89579 -0.700882 -0.103729 -0.694495 -0.007313 0.513223\n1 1:15777:A:G A G Y G 0.585215 -0.401872 0.393071 1.79583 -0.89579 0.700882 0.103729 0.694495 0.007313 -0.513223\n1 1:57292:C:T C T Y C -0.123768 0.912046 -0.353606 -0.220148 -0.893017 -0.374505 -0.141002 -0.249335 0.625097 0.206104\n1 1:57292:C:T C T Y T 0.123768 -0.912046 0.353606 0.220148 0.893017 0.374505 0.141002 0.249335 -0.625097 -0.206104\n1 1:77874:G:A G A Y G 1.49202 -1.12567 1.19915 0.0755314 0.401134 -0.015842 0.0452086 0.273072 -0.00716098 0.237545\n1 1:77874:G:A G A Y A -1.49202 1.12567 -1.19915 -0.0755314 -0.401134 0.015842 -0.0452086 -0.273072 0.00716098 -0.237545\n1 1:87360:C:T C T Y C -0.191803 0.600666 -0.513208 -0.0765155 -0.656552 0.0930399 -0.0238774 -0.330449 -0.192037 -0.727729\n
plink_results.acount#CHROM ID REF ALT PROVISIONAL_REF? ALT_CTS OBS_CT\n1 1:15774:G:A G A Y 28 994\n1 1:15777:A:G A G Y 73 994\n1 1:57292:C:T C T Y 104 988\n1 1:77874:G:A G A Y 19 994\n1 1:87360:C:T C T Y 23 998\n1 1:125271:C:T C T Y 967 996\n1 1:232449:G:A G A Y 185 996\n1 1:533113:A:G A G Y 129 992\n1 1:565697:A:G A G Y 334 996\n
Eventually, we will get the PCA results for all samples.
PCA results for all samples
plink_results_projected.sscore#FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG\nHG00403 HG00403 390256 390256 0.00290265 -0.0248649 0.0100408 0.00957591 0.00694349 -0.00222251 0.0082228 -0.00114937 0.00335249 0.00437471\nHG00404 HG00404 390696 390696 -0.000141221 -0.027965 0.025389 -0.00582538 -0.00274707 0.00658501 0.0113803 0.0077766 0.0159976 0.0178927\nHG00406 HG00406 388524 388524 0.00707397 -0.0315445 -0.00437011 -0.0012621 -0.0114932 -0.00539483 -0.00620153 0.00452379 -0.000870627 -0.00227979\nHG00407 HG00407 388808 388808 0.00683977 -0.025073 -0.00652723 0.00679729 -0.0116 -0.0102328 0.0139572 0.00618677 0.0138063 0.00825269\nHG00409 HG00409 391646 391646 0.000398695 -0.0290334 -0.0189352 -0.00135977 0.0290436 0.00942829 -0.0171194 -0.0129637 0.0253596 0.022907\nHG00410 HG00410 391600 391600 0.00277094 -0.0280021 -0.0209991 -0.00799085 0.0318038 -0.00284209 -0.031517 -0.0010026 0.0132541 0.0357565\nHG00419 HG00419 387118 387118 0.00684154 -0.0326244 0.00237159 0.0167284 -0.0119737 -0.0079637 -0.0144339 0.00712756 0.0114292 0.00404426\nHG00421 HG00421 387720 387720 0.00157095 -0.0338115 -0.00690541 0.0121058 0.00111378 0.00530794 -0.0017545 -0.00121793 0.00393407 0.00414204\nHG00422 HG00422 387466 387466 0.00439167 -0.0332386 0.000741526 0.0124843 -0.00362248 -0.00343393 -0.00735112 0.00944759 -0.0107516 0.00376537\n
"},{"location":"05_PCA/#plotting-the-pcs","title":"Plotting the PCs","text":"You can now create scatterplots of the PCs using R or Python.
For plotting using Python: plot_PCA.ipynb
Scatter plot of PC1 and PC2 using 1KG EAS individuals
Note : We only used a small proportion of all available variants. This figure only very roughly shows the population structure in East Asia.
Requirements: - python>3 - numpy,pandas,seaborn,matplotlib
"},{"location":"05_PCA/#pca-umap","title":"PCA-UMAP","text":"(optional) We can also apply another non-linear dimension reduction algorithm called UMAP to the PCs to further identify the local structures. (PCA-UMAP)
For more details, please check: - https://umap-learn.readthedocs.io/en/latest/index.html
An example of PCA and PCA-UMAP for population genetics: - Sakaue, S., Hirata, J., Kanai, M., Suzuki, K., Akiyama, M., Lai Too, C., ... & Okada, Y. (2020). Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nature communications, 11(1), 1-11.
"},{"location":"05_PCA/#references","title":"References","text":"To test the association between a phenotype and genotypes, we need to group the genotypes based on genetic models.
There are three basic genetic models:
Three genetic models
For example, suppose we have a biallelic SNP whose reference allele is A and the alternative allele is G.
There are three possible genotypes for this SNP: AA, AG, and GG.
This table shows how we group different genotypes under each genetic model for association tests using linear or logistic regressions.
Genetic models AA AG GG Additive model 0 1 2 Dominant model 0 1 1 Recessive model 0 0 1Contingency table and non-parametric tests
A simple way to test association is to use the 2x2 or 2x3 contingency table. For dominant and recessive models, Chi-square tests are performed using the 2x2 table. For the additive model, Cochran-Armitage trend tests are performed for the 2x3 table. However, the non-parametric tests do not adjust for the bias caused by other covariates like sex, age and so forth.
"},{"location":"06_Association_tests/#association-testing-basics","title":"Association testing basics","text":"For quantitative traits, we can employ a simple linear regression model to test associations:
\\[ y = G\\beta_G + X\\beta_X + e \\]Interpretation of linear regression
For binary traits, we can utilize the logistic regression model to test associations:
\\[ logit(p) = G\\beta_G + X\\beta_X + e \\]Linear regression and logistic regression
"},{"location":"06_Association_tests/#file-preparation","title":"File Preparation","text":"To perform genome-wide association tests, usually, we need the following files:
Phenotype and covariate files
Phenotype file for a simulated binary trait; B1 is the phenotype name; 1 means the control, 2 means the case.
1kgeas_binary.txtFID IID B1\nHG00403 HG00403 1\nHG00404 HG00404 2\nHG00406 HG00406 1\nHG00407 HG00407 1\nHG00409 HG00409 2\nHG00410 HG00410 2\nHG00419 HG00419 1\nHG00421 HG00421 1\nHG00422 HG00422 1\n\nCovariate file (only top PCs calculated in the previous PCA section)\n\n```txt title=\"plink_results_projected.sscore\"\n#FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVGPC9_AVG PC10_AVG\nHG00403 HG00403 390256 390256 0.00290265 -0.0248649 -0.0100407 0.00957595 0.00694056 0.00222996 0.00823028 0.00116497 -0.00334937 0.00434627\nHG00404 HG00404 390696 390696 -0.000141221 -0.027965 -0.025389 -0.00582553 -0.00274711 -0.00657958 0.0113769 -0.00778919 -0.0159685 0.0180678\nHG00406 HG00406 388524 388524 0.00707397 -0.0315445 0.00437013 -0.00126195 -0.0114938 0.00538932 -0.00619657 -0.00454686 0.000969112 -0.00217617\nHG00407 HG00407 388808 388808 0.00683977 -0.025073 0.00652723 0.00679731 -0.0116001 0.0102403 0.0139674 -0.00621948 -0.013797 0.00827744\nHG00409 HG00409 391646 391646 0.000398695 -0.0290334 0.0189352 -0.00135996 0.0290464 -0.00941851 -0.0171911 0.01293 -0.0252628 0.0230819\nHG00410 HG00410 391600 391600 0.00277094 -0.0280021 0.0209991 -0.00799089 0.0318043 0.00283456 -0.0315157 0.000978664 -0.0133768 0.0356721\nHG00419 HG00419 387118 387118 0.00684154 -0.0326244 -0.00237159 0.0167284 -0.0119684 0.00795149 -0.0144241 -0.00716183 -0.0115059 0.0038652\nHG00421 HG00421 387720 387720 0.00157095 -0.0338115 0.00690542 0.0121058 0.00111448 -0.00531714 -0.00175494 0.00118513 -0.00391494 0.00414682\nHG00422 HG00422 387466 387466 0.00439167 -0.0332386 -0.000741482 0.0124843 -0.00362885 0.00342491 -0.0073205 -0.00939123 0.010718 0.00360906\n
"},{"location":"06_Association_tests/#association-tests-using-plink","title":"Association tests using PLINK","text":"Please check https://www.cog-genomics.org/plink/2.0/assoc for more details.
We will perform logistic regression with firth correction for a simulated binary trait under the additive model using the 1KG East Asian individuals.
Firth correction
Adding a penalty term to the log-likelihood function when fitting the logistic model results in less bias. - Firth, David. \"Bias reduction of maximum likelihood estimates.\" Biometrika 80.1 (1993): 27-38.
Quantitative traits
For quantitative traits, linear regressions will be performed and in this case, we do not need to add firth
(since Firth correction is not appliable).
Sample codes for association test using plink for binary traits
genotypeFile=\"../04_Data_QC/sample_data.clean\" # the clean dataset we generated in previous section\nphenotypeFile=\"../01_Dataset/1kgeas_binary.txt\" # the phenotype file\ncovariateFile=\"../05_PCA/plink_results_projected.sscore\" # the PC score file\n\ncovariateCols=6-10\ncolName=\"B1\"\nthreadnum=2\n\nplink2 \\\n --bfile ${genotypeFile} \\\n --pheno ${phenotypeFile} \\\n --pheno-name ${colName} \\\n --maf 0.01 \\\n --covar ${covariateFile} \\\n --covar-col-nums ${covariateCols} \\\n --glm hide-covar firth firth-residualize single-prec-cc \\\n --threads ${threadnum} \\\n --out 1kgeas\n
Note
Using the latest version of PLINK2, you need to add firth-residualize single-prec-cc
to generate the results. (The algorithm and precision have been changed since 2023 for firth regression)
You will see a similar log like:
Log
1kgeas.logPLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to 1kgeas.log.\nOptions in effect:\n--bfile ../04_Data_QC/sample_data.clean\n--covar ../05_PCA/plink_results_projected.sscore\n--covar-col-nums 6-10\n--glm hide-covar firth firth-residualize single-prec-cc\n--maf 0.01\n--out 1kgeas\n--pheno ../01_Dataset/1kgeas_binary.txt\n--pheno-name B1\n--threads 2\n\nStart time: Tue Dec 26 15:52:10 2023\n31934 MiB RAM detected, ~30479 available; reserving 15967 MiB for main\nworkspace.\nUsing up to 2 compute threads.\n500 samples (0 females, 0 males, 500 ambiguous; 500 founders) loaded from\n../04_Data_QC/sample_data.clean.fam.\n1224104 variants loaded from ../04_Data_QC/sample_data.clean.bim.\n1 binary phenotype loaded (248 cases, 250 controls).\n5 covariates loaded from ../05_PCA/plink_results_projected.sscore.\nCalculating allele frequencies... done.\n95372 variants removed due to allele frequency threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1128732 variants remaining after main filters.\n--glm Firth regression on phenotype 'B1': done.\nResults written to 1kgeas.B1.glm.firth .\nEnd time: Tue Dec 26 15:53:49 2023\n
Let's check the first lines of the output:
Association test results
1kgeas.B1.glm.firth #CHROM POS ID REF ALT PROVISIONAL_REF? A1 OMITTED A1_FREQ TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 15774 1:15774:G:A G A Y A G 0.0282828 ADD 495 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 15777 1:15777:A:G A G Y G A 0.0737374 ADD 495 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 57292 1:57292:C:T C T Y T C 0.104675 ADD 492 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 77874 1:77874:G:A G A Y A G 0.0191532 ADD 496 1.12228 0.46275 0.249299 0.80313 .\n1 87360 1:87360:C:T C T Y T C 0.0231388 ADD 497 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 125271 1:125271:C:T C T Y C T 0.0292339 ADD 496 1.53387 0.373358 1.1458 0.25188 .\n1 232449 1:232449:G:A G A Y A G 0.185484 ADD 496 0.884097 0.168961 -0.729096 0.465943 .\n1 533113 1:533113:A:G A G Y G A 0.129555 ADD 494 0.90593 0.196631 -0.50243 0.615365 .\n1 565697 1:565697:A:G A G Y G A 0.334677 ADD 496 1.04653 0.15286 0.297509 0.766078 .\n
Usually, other options are added to enhance the sumstats
cols=
requests the following columns in the sumstats: here are allele1 frequency and (MaCH)Rsq, firth-fallback
will test the common variants without firth correction, which could improve the speed, omit-ref
will force the ALT==A1==effect allele, otherwise the minor allele would be tested (see the above result, which ALT may not equal A1).Genomic control (GC) is a basic method for controlling for confounding factors including population stratification.
We will calculate the genomic control factor (lambda GC) to evaluate the inflation. The genomic control factor is calculated by dividing the median of observed Chi square statistics by the median of Chi square distribution with the degree of freedom being 1 (which is approximately 0.455).
\\[ \\lambda_{GC} = {median(\\chi^{2}_{observed}) \\over median(\\chi^{2}_1)} \\]Then, we can used the genomic control factor to correct observed Chi suqare statistics.
\\[ \\chi^{2}_{corrected} = {\\chi^{2}_{observed} \\over \\lambda_{GC}} \\]Genomic inflation is based on the idea that most of the variants are not associated, thus no deviation between the observed and expected Chi square distribution, except the spikes at the end. However, if the trait is highly polygenic, this assumption may be violated.
Reference: Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997-1004.
"},{"location":"06_Association_tests/#significant-loci","title":"Significant loci","text":"Please check Visualization using gwaslab
Loci that reached genome-wide significance threshold (P value < 5e-8) :
SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT\n1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A\n2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T\n7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G\n20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C\n
Warning
This is just to show the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result is meaningless here.
Allele frequency and Effect size
"},{"location":"06_Association_tests/#visualization","title":"Visualization","text":"To visualize the sumstats, we will create the Manhattan plot, QQ plot and regional plot.
Please check for codes : Visualization using gwaslab
"},{"location":"06_Association_tests/#manhattan-plot","title":"Manhattan plot","text":"Manhattan plot is the most classic visualization of GWAS summary statistics. It is a form of scatter plot. Each dot represents the test result for a variant. variants are sorted by their genome coordinates and are aligned along the X axis. Y axis shows the -log10(P value) for tests of variants in GWAS.
Note
This kind of plot was named after Manhattan in New York City since it resembles the Manhattan skyline.
A real Manhattan plot
I took this photo in 2020 just before the COVID-19 pandemic. It was a cloudy and misty day. Those birds formed a significance threshold line. And the skyscrapers above that line resembled the significant signals in your GWAS. I believe you could easily get how the GWAS Manhattan plot was named.
Data we need from sumstats to create Manhattan plots:
Steps to create Manhattan plot
Quantile-quantile plot (also known as Q-Q plot), is commonly used to compare an observed distribution with its expected distribution. For a specific point (x,y) on Q-Q plot, its y coordinate corresponds to one of the quantiles of the observed distribution, while its x coordinate corresponds to the same quantile of the expected distribution.
Quantile-quantile plot is used to check if there is any significant inflation in P value distribution, which usually indicates population stratification or cryptic relatedness.
Data we need from sumstats to create the Manhattan plot:
Steps to create Q-Q plot
Suppose we have n
variants in our sumstats,
n
P value to -log10(P).n
numbers from (0,1)
with equal intervals.n
numbers to -log10(P) and sort in ascending order.Note
The expected distribution of P value is a Uniform distribution from 0 to 1.
\\[P_{expected} \\sim U(0,1)\\]"},{"location":"06_Association_tests/#regional-plot","title":"Regional plot","text":"Manhattan plot is very useful to check the overview of our sumstats. But if we want to check a specific genomic locus, we need a plot with finer resolution. This kind of plot is called a regional plot. It is basically the Manhattan plot of only a small region on the genome, with points colored by its LD r2 with the lead variant in this region.
Such a plot is especially helpful to understand the signal and loci, e.g., LD structure, independent signals, and genes.
The regional plot for the loci of 2:55513738:C:T.
Please check Visualization using gwaslab
"},{"location":"06_Association_tests/#gwas-ssf","title":"GWAS-SSF","text":"To standardize the format of GWAS summary statistics for sharing, GWAS-SSF format was proposed in 2022. This format is now used as the standard format for GWAS Catalog.
GWAS-SSF consists of :
Schematic representation of GWAS-SSF data file
GWAS-SSF
Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv, 2022-07.
For details, please check:
ANNOVAR is a simple and efficient command line tool for variant annotation.
In this tutorial, we will use ANNOVAR to annotate the variants in our summary statistics (hg19).
"},{"location":"07_Annotation/#install","title":"Install","text":"Download ANNOVAR from here (registration required; freely available to personal, academic and non-profit use only.)
You will receive an email with the download link after registration. Download it and decompress:
tar -xvzf annovar.latest.tar.gz\n
For refGene annotation for hg19, we do not need to download additional files.
"},{"location":"07_Annotation/#format-input-file","title":"Format input file","text":"The default input file for ANNOVAR is a 1-based coordinate file.
We will only use the first 100000 variants as an example.
annovar_input
awk 'NR>1 && NR<100000 {print $1,$2,$2,$4,$5}' ../06_Association_tests/1kgeas.B1.glm.logistic. hybrid > annovar_input.txt\n
head annovar_input.txt \n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n
With -vcfinput
option, ANNOVAR can accept input files in VCF format.
Annotate the variants with gene information.
A minimal example of annotation using refGene
input=annovar_input.txt\nhumandb=/home/he/tools/annovar/annovar/humandb\ntable_annovar.pl ${input} ${humandb} -buildver hg19 -out myannotation -remove -protocol refGene -operation g -nastring . -polish\n
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange. refGene\n1 13273 13273 G C ncRNA_exonic DDX11L1;LOC102725121 . . .\n1 14599 14599 T A ncRNA_exonic WASH7P . . .\n1 14604 14604 A G ncRNA_exonic WASH7P . . .\n1 14930 14930 A G ncRNA_intronic WASH7P . . .\n1 69897 69897 T C exonic OR4F5 . synonymous SNV OR4F5:NM_001005484:exon1:c.T807C:p.S269S\n1 86331 86331 A G intergenic OR4F5;LOC729737 dist=16323;dist=48442 . .\n1 91581 91581 G A intergenic OR4F5;LOC729737 dist=21573;dist=43192 . .\n1 122872 122872 T G intergenic OR4F5;LOC729737 dist=52864;dist=11901 . .\n1 135163 135163 C T ncRNA_exonic LOC729737 . . .\n
"},{"location":"07_Annotation/#additional-databases","title":"Additional databases","text":"ANNOVAR supports a wide range of commonly used databases including dbsnp
, dbnsfp
, clinvar
, gnomad
, 1000g
, cadd
and so forth. For details, please check ANNOVAR's official documents
You can check the Table Name listed in the link above and download the database you need using the following command.
Example: Downloading avsnp150 for hg19 from ANNOVAR
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb/\n
An example of annotation using multiple databases
# input file is in vcf format\ntable_annovar.pl \\\n ${in_vcf} \\\n ${humandb} \\\n -buildver hg19 \\\n -protocol refGene,avsnp150,clinvar_20200316,gnomad211_exome \\\n -operation g,f,f,f \\\n -remove \\\n -out ${out_prefix} \\ \n -vcfinput\n
"},{"location":"07_Annotation/#vep-under-construction","title":"VEP (under construction)","text":""},{"location":"07_Annotation/#install_1","title":"Install","text":"git clone https://github.com/Ensembl/ensembl-vep.git\ncd ensembl-vep\nperl INSTALL.pl\n
Hello! This installer is configured to install v108 of the Ensembl API for use by the VEP.\nIt will not affect any existing installations of the Ensembl API that you may have.\n\nIt will also download and install cache files from Ensembl's FTP server.\n\nChecking for installed versions of the Ensembl API...done\n\nSetting up directories\nDestination directory ./Bio already exists.\nDo you want to overwrite it (if updating VEP this is probably OK) (y/n)? y\n - fetching BioPerl\n - unpacking ./Bio/tmp/release-1-6-924.zip\n - moving files\n\nDownloading required Ensembl API files\n - fetching ensembl\n - unpacking ./Bio/tmp/ensembl.zip\n - moving files\n - getting version information\n - fetching ensembl-variation\n - unpacking ./Bio/tmp/ensembl-variation.zip\n - moving files\n - getting version information\n - fetching ensembl-funcgen\n - unpacking ./Bio/tmp/ensembl-funcgen.zip\n - moving files\n - getting version information\n - fetching ensembl-io\n - unpacking ./Bio/tmp/ensembl-io.zip\n - moving files\n - getting version information\n\nTesting VEP installation\n - OK!\n\nThe VEP can either connect to remote or local databases, or use local cache files.\nUsing local cache files is the fastest and most efficient way to run the VEP\nCache files will be stored in /home/he/.vep\nDo you want to install any cache files (y/n)? y\n\nThe following species/files are available; which do you want (specify multiple separated by spaces or 0 for all): \n1 : acanthochromis_polyacanthus_vep_108_ASM210954v1.tar.gz (69 MB)\n2 : accipiter_nisus_vep_108_Accipiter_nisus_ver1.0.tar.gz (55 MB)\n...\n466 : homo_sapiens_merged_vep_108_GRCh37.tar.gz (16 GB)\n467 : homo_sapiens_merged_vep_108_GRCh38.tar.gz (26 GB)\n468 : homo_sapiens_refseq_vep_108_GRCh37.tar.gz (13 GB)\n469 : homo_sapiens_refseq_vep_108_GRCh38.tar.gz (22 GB)\n470 : homo_sapiens_vep_108_GRCh37.tar.gz (14 GB)\n471 : homo_sapiens_vep_108_GRCh38.tar.gz (22 GB)\n\n Total: 221 GB for all 471 files\n\n? 470\n - downloading https://ftp.ensembl.org/pub/release-108/variation/indexed_vep_cache/homo_sapiens_vep_108_GRCh37.tar.gz\n
"},{"location":"08_LDSC/","title":"LD score regression","text":""},{"location":"08_LDSC/#table-of-contents","title":"Table of Contents","text":"LDSC is one of the most commonly used command line tool to estimate inflation, hertability, genetic correlation and cell/tissue type specificity from GWAS summary statistics.
"},{"location":"08_LDSC/#ld-linkage-disequilibrium","title":"LD: Linkage disequilibrium","text":"Linkage disequilibrium (LD) : non-random association of alleles at different loci in a given population. (Wiki)
"},{"location":"08_LDSC/#ld-score","title":"LD score","text":"LD score \\(l_j\\) for a SNP \\(j\\) is defined as the sum of \\(r^2\\) for the SNP and other SNPs in a region.
\\[ l_j= \\Sigma_k{r^2_{j,k}} \\]"},{"location":"08_LDSC/#ld-score-regression_1","title":"LD score regression","text":"Key idea: A variant will have higher test statistics if it is in LD with causal variant, and the elevation is proportional to the correlation ( \\(r^2\\) ) with the causal variant.
\\[ E[\\chi^2|l_j] = {{Nh^2l_j}\\over{M}} + Na + 1 \\]For more details of LD score regression, please refer to : - Bulik-Sullivan, Brendan K., et al. \"LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.\" Nature genetics 47.3 (2015): 291-295.
"},{"location":"08_LDSC/#install-ldsc","title":"Install LDSC","text":"LDSC can be downloaded from github (GPL-3.0 license): https://github.com/bulik/ldsc
For ldsc, we need anaconda to create virtual environment (for python2). If you haven't installed Anaconda, please check how to install anaconda.
# change to your directory for tools\ncd ~/tools\n\n# clone the ldsc github repository\ngit clone https://github.com/bulik/ldsc.git\n\n# create a virtual environment for ldsc (python2)\ncd ldsc\nconda env create --file environment.yml \n\n# activate ldsc environment\nconda activate ldsc\n
"},{"location":"08_LDSC/#data-preparation","title":"Data Preparation","text":"In this tutoial, we will use sample summary statistics for HDLC and LDLC from Jenger. - Kanai, Masahiro, et al. \"Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases.\" Nature genetics 50.3 (2018): 390-400.
The Miami plot for the two traits:
"},{"location":"08_LDSC/#download-sample-summary-statistics","title":"Download sample summary statistics","text":"# HDL-c and LDL-c in Biobank Japan\nwget -O BBJ_LDLC.txt.gz http://jenger.riken.jp/61analysisresult_qtl_download/\nwget -O BBJ_HDLC.txt.gz http://jenger.riken.jp/47analysisresult_qtl_download/\n
"},{"location":"08_LDSC/#download-reference-files","title":"Download reference files","text":"# change to your ldsc directory\ncd ~/tools/ldsc\nmkdir resource\ncd ./resource\n\n# snplist\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/w_hm3.snplist.bz2\n\n# EAS ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/eas_ldscores.tar.bz2\n\n# EAS weight\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_weights_hm3_no_MHC.tgz\n\n# EAS frequency\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_plinkfiles.tgz\n\n# EAS baseline model\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_baseline_v1.2_ldscores.tgz\n\n# Cell type ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/LDSC_SEG_ldscores/Cahoy_EAS_1000Gv3_ldscores.tar.gz\n
You can then decompress the files and organize them."},{"location":"08_LDSC/#munge-sumstats","title":"Munge sumstats","text":"Before the analysis, we need to format and clean the raw sumstats.
Note
Rsid is used here. If the sumstats only contained id like CHR:POS:REF:ALT, annotate it first.
snplist=~/tools/ldsc/resource/w_hm3.snplist\nmunge_sumstats.py \\\n --sumstats BBJ_HDLC.txt.gz \\\n --merge-alleles $snplist \\\n --a1 ALT \\\n --a2 REF \\\n --chunksize 500000 \\\n --out BBJ_HDLC\nmunge_sumstats.py \\\n --sumstats BBJ_LDLC.txt.gz \\\n --a1 ALT \\\n --a2 REF \\\n --chunksize 500000 \\\n --merge-alleles $snplist \\\n --out BBJ_LDLC\n
After munging, you will get two munged and formatted files:
BBJ_HDLC.sumstats.gz\nBBJ_LDLC.sumstats.gz\n
And these are the files we will use to run LD score regression."},{"location":"08_LDSC/#ld-score-regression_2","title":"LD score regression","text":"Univariate LD score regression is utilized to estimate heritbility and confuding factors (cryptic relateness and population stratification) of a certain trait.
Using the munged sumstats, we can run:
ldsc.py \\\n --h2 BBJ_HDLC.sumstats.gz \\\n --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --out BBJ_HDLC\n\nldsc.py \\\n --h2 BBJ_LDLC.sumstats.gz \\\n --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --out BBJ_LDLC\n
Lest's check the results for HDLC:
cat BBJ_HDLC.log\n*********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--h2 BBJ_HDLC.sumstats.gz \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Sat Dec 24 20:40:34 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nUsing two-step estimator with cutoff at 30.\nTotal Observed scale h2: 0.1583 (0.0281)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.0563 (0.0114)\nRatio: 0.1981 (0.0402)\nAnalysis finished at Sat Dec 24 20:40:41 2022\nTotal time elapsed: 6.57s\n
We can see that from the log:
According to LDSC documents, Ratio measures the proportion of the inflation in the mean chi^2 that the LD Score regression intercept ascribes to causes other than polygenic heritability. The value of ratio should be close to zero, though in practice values of 10-20% are not uncommon.
\\[ Ratio = {{intercept-1}\\over{mean(\\chi^2)-1}} \\]"},{"location":"08_LDSC/#distribution-of-h2-and-intercept-across-traits-in-ukb","title":"Distribution of h2 and intercept across traits in UKB","text":"The Neale Lab estimated SNP heritability using LDSC across more than 4,000 primary GWAS in UKB. You can check the distributions of SNP heritability and intercept estimates using the following link to get the idea of what you can expect from LD score regresion:
https://nealelab.github.io/UKBB_ldsc/viz_h2.html
"},{"location":"08_LDSC/#cross-trait-ld-score-regression","title":"Cross-trait LD score regression","text":"Cross-trait LD score regression is employed to estimate the genetic correlation between a pair of traits.
Key idea: replace \\chi^2
in univariate LD score regression and the relationship (SNPs with high LD ) still holds.
Then we can get the genetic correlation by :
\\[ r_g = {{\\rho_g}\\over{\\sqrt{h_1^2h_2^2}}} \\]ldsc.py \\\n --rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --out BBJ_HDLC_LDLC\n
Let's check the results: *********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC_LDLC \\\n--rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Thu Dec 29 21:02:37 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nComputing rg for phenotype 2/2\nReading summary statistics from BBJ_LDLC.sumstats.gz ...\nRead summary statistics for 1217311 SNPs.\nAfter merging with summary statistics, 1012040 SNPs remain.\n1012040 SNPs with valid alleles.\n\nHeritability of phenotype 1\n---------------------------\nTotal Observed scale h2: 0.1054 (0.0383)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.1234 (0.0607)\nRatio: 0.4342 (0.2134)\n\nHeritability of phenotype 2/2\n-----------------------------\nTotal Observed scale h2: 0.0543 (0.0211)\nLambda GC: 1.0833\nMean Chi^2: 1.1465\nIntercept: 1.0583 (0.0335)\nRatio: 0.398 (0.2286)\n\nGenetic Covariance\n------------------\nTotal Observed scale gencov: 0.0121 (0.0106)\nMean z1*z2: -0.001\nIntercept: -0.0198 (0.0121)\n\nGenetic Correlation\n-------------------\nGenetic Correlation: 0.1601 (0.1821)\nZ-score: 0.8794\nP: 0.3792\n\n\nSummary of Genetic Correlation Results\np1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se\nBBJ_HDLC.sumstats.gz BBJ_LDLC.sumstats.gz 0.1601 0.1821 0.8794 0.3792 0.0543 0.0211 1.0583 0.0335 -0.0198 0.0121\n\nAnalysis finished at Thu Dec 29 21:02:47 2022\nTotal time elapsed: 10.39s\n
"},{"location":"08_LDSC/#partitioned-ld-regression","title":"Partitioned LD regression","text":"Partitioned LD regression is utilized to evaluate the contribution of each functional group to the total SNP heriatbility.
\\[ E[\\chi^2] = N \\sum\\limits_C \\tau_C l(j,C) + Na + 1 \\]\\(\\tau_C\\) : per-SNP contribution of category C to heritability
Reference: Finucane, Hilary K., et al. \"Partitioning heritability by functional annotation using genome-wide association summary statistics.\" Nature genetics 47.11 (2015): 1228-1235.
ldsc.py \\\n --h2 BBJ_HDLC.sumstats.gz \\\n --overlap-annot \\\n --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n --frqfile-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_plinkfiles/1000G.EAS.QC. \\\n --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n --out BBJ_HDLC_baseline\n
"},{"location":"08_LDSC/#celltype-specificity-ld-regression","title":"Celltype specificity LD regression","text":"LDSC-SEG : LD score regression applied to specifically expressed genes
An extension of Partitioned LD regression. Categories are defined by tissue or cell-type specific genes.
ldsc.py \\\n --h2-cts BBJ_HDLC.sumstats.gz \\\n --ref-ld-chr-cts ~/tools/ldsc/resource/Cahoy_EAS_1000Gv3_ldscores/Cahoy.EAS.ldcts \\\n --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n --out BBJ_HDLC_baseline_cts\n
"},{"location":"08_LDSC/#reference","title":"Reference","text":"MAGMA is one the most commonly used tools for gene-based and gene-set analysis.
Gene-level analysis in MAGMA uses two models:
1.Multiple linear principal components regression
MAGMA employs a multiple linear principal components regression, and F test to obtain P values for genes. The multiple linear principal components regression:
\\[ Y = \\alpha_{0,g} + X_g \\alpha_g + W \\beta_g + \\epsilon_g \\]\\(X_g\\) is obtained by first projecting the variant matrix of a gene onto its PC, and removing PCs with samll eigenvalues.
Note
The linear principal components regression model requires raw genotype data.
2.SNP-wise models
SNP-wise Mean: perform tests on mean SNP association
Note
SNP-wise models use summary statistics and reference LD panel
Gene-set analysis
Quote
Competitive gene-set analysis tests whether the genes in a gene-set are more strongly associated with the phenotype of interest than other genes.
P values for each gene were converted to Z scores to perform gene-set level analysis.
\\[ Z = \\beta_{0,S} + S_S \\beta_S + \\epsilon \\]Dowload MAGMA for your operating system from the following url:
MAGMA: https://ctg.cncr.nl/software/magma
For example:
cd ~/tools\nmkdir MAGMA\ncd MAGMA\nwget https://ctg.cncr.nl/software/MAGMA/prog/magma_v1.10.zip\nunzip magma_v1.10.zip\n
Add magma to your environment path. Test if it is successfully installed.
$ magma --version\nMAGMA version: v1.10 (linux)\n
"},{"location":"09_Gene_based_analysis/#download-reference-files","title":"Download reference files","text":"We nedd the following reference files:
The gene location files and LD reference panel can be downloaded from magma website.
-> https://ctg.cncr.nl/software/magma
The third one can be downloaded form MsigDB.
-> https://www.gsea-msigdb.org/gsea/msigdb/
"},{"location":"09_Gene_based_analysis/#format-input-files","title":"Format input files","text":"zcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,$2,$3}' > HDLC_chr3.magma.input.snp.chr.pos.txt\nzcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,10^(-$11)}' > HDLC_chr3.magma.input.p.txt\n
"},{"location":"09_Gene_based_analysis/#annotate-snps","title":"Annotate SNPs","text":"snploc=./HDLC_chr3.magma.input.snp.chr.pos.txt\nncbi37=~/tools/magma/NCBI37/NCBI37.3.gene.loc\nmagma --annotate \\\n --snp-loc ${snploc} \\\n --gene-loc ${ncbi37} \\\n --out HDLC_chr3\n
Tip
Usually to capture the variants in the regulatory regions, we will add windows upstream and downstream of the genes with --annotate window
.
For example, --annotate window=35,10
set a 35 kilobase pair(kb) upstream and 10kb downstream window.
ref=~/tools/magma/g1000_eas/g1000_eas\nmagma \\\n --bfile $ref \\\n --pval ./HDLC_chr3.magma.input.p.txt N=70657 \\\n --gene-annot HDLC_chr3.genes.annot \\\n --out HDLC_chr3\n
"},{"location":"09_Gene_based_analysis/#gene-set-level-analysis","title":"Gene-set level analysis","text":"geneset=/home/he/tools/magma/MSigDB/msigdb_v2022.1.Hs_files_to_download_locally/msigdb_v2022.1.Hs_GMTs/msigdb.v2022.1.Hs.entrez.gmt\nmagma \\\n --gene-results HDLC_chr3.genes.raw \\\n --set-annot ${geneset} \\\n --out HDLC_chr3\n
"},{"location":"09_Gene_based_analysis/#reference","title":"Reference","text":"Polygenic risk score(PRS), as known as polygenic score (PGS) or genetic risk score (GRS), is a score that summarizes the effect sizes of genetic variants on a certain disease or trait (weighted sum of disease/trait-associated alleles).
To calculate the PRS for sample j,
\\[PRS_j = \\sum_{i=0}^{i=M} x_{i,j} \\beta_{i}\\]In this tutorial, we will first briefly introduce how to develop PRS model using the sample data and then demonstrate how we can download PRS models from PGS Catalog and apply to our sample genotype data.
"},{"location":"10_PRS/#ctpt-using-plink","title":"C+T/P+T using PLINK","text":"P+T stands for Pruning + Thresholding, also known as Clumping and Thresholding(C+T), which is a very simple and straightforward approach to constructing PRS models.
Clumping
Clumping: LD-pruning based on P value. It is a approach to select variants when there are multiple significant associations in high LD in the same region.
The three important parameters for clumping in PLINK are:
Clumping using PLINK
#!/bin/bash\n\nplinkFile=../04_Data_QC/sample_data.clean\nsumStats=../06_Association_tests/1kgeas.B1.glm.firth\n\nplink \\\n --bfile ${plinkFile} \\\n --clump-p1 0.0001 \\\n --clump-r2 0.1 \\\n --clump-kb 250 \\\n --clump ${sumStats} \\\n --clump-snp-field ID \\\n --clump-field P \\\n --out 1kg_eas\n
log
--clump: 40 clumps formed from 307 top variants.\n
check only the header and the first \"clump\" of SNPs. head -n 2 1kg_eas.clumped\n CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001 SP2\n2 1 2:55513738:C:T 55513738 1.69e-15 52 0 3 1 6 42 2:55305475:A:T(1),2:55338196:T:C(1),2:55347135:G:A(1),2:55351853:A:G(1),2:55363460:G:A(1),2:55395372:A:G(1),2:55395578:G:A(1),2:55395807:C:T(1),2:55405847:C:A(1),2:55408556:C:A(1),2:55410835:C:T(1),2:55413644:C:G(1),2:55435439:C:T(1),2:55449464:T:C(1),2:55469819:A:T(1),2:55492154:G:A(1),2:55500529:A:G(1),2:55502651:A:G(1),2:55508333:G:C(1),2:55563020:A:G(1),2:55572944:T:C(1),2:55585915:A:G(1),2:55599810:C:T(1),2:55605943:A:G(1),2:55611766:T:C(1),2:55612986:G:C(1),2:55619923:C:T(1),2:55622624:G:A(1),2:55624520:C:T(1),2:55628936:G:C(1),2:55638830:T:C(1),2:55639023:A:T(1),2:55639980:C:T(1),2:55640649:G:A(1),2:55641045:G:A(1),2:55642887:C:T(1),2:55647729:A:G(1),2:55650512:G:A(1),2:55659155:A:G(1),2:55665620:A:G(1),2:55667476:G:T(1),2:55670729:A:G(1),2:55676257:C:T(1),2:55685927:C:A(1),2:55689569:A:T(1),2:55689913:T:C(1),2:55693097:C:G(1),2:55707583:T:C(1),2:55720135:C:G(1)\n
"},{"location":"10_PRS/#beta-shrinkage-using-prs-cs","title":"Beta shrinkage using PRS-CS","text":"\\[ \\beta_j | \\Phi_j \\sim N(0,\\phi\\Phi_j) , \\Phi_j \\sim g \\] Reference: Ge, T., Chen, C. Y., Ni, Y., Feng, Y. C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications, 10(1), 1-10.
"},{"location":"10_PRS/#parameter-tuning","title":"Parameter tuning","text":"Method Description Cross-validation 10-fold cross validation. This method usually requires large-scale genotype dataset. Independent population Perform validation in an independent population of the same ancestry. Pseudo-validation A few methods can estimate a single optimal shrinkage parameter using only the base GWAS summary statistics."},{"location":"10_PRS/#pgs-catalog","title":"PGS Catalog","text":"Just like GWAS Catalog, you can now download published PRS models from PGS catalog.
URL: http://www.pgscatalog.org/
Reference: Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.
"},{"location":"10_PRS/#calculate-prs-using-plink","title":"Calculate PRS using PLINK","text":"plink --score <score_filename> [variant ID col.] [allele col.] [score col.] ['header']\n
<score_filename>
: the score file[variant ID col.]
: the column number for variant IDs[allele col.]
: the column number for effect alleles[score col.]
: the column number for betas['header']
: skip the first header linePlease check here for detailed documents on plink --score
.
Example
# genotype data\nplinkFile=../04_Data_QC/sample_data.clean\n# summary statistics for scoring\nsumStats=./t2d_plink_reduced.txt\n# SNPs after clumpping\nawk 'NR!=1{print $3}' 1kgeas.clumped > 1kgeas.valid.snp\n\nplink \\\n --bfile ${plinkFile} \\\n --score ${sumStats} 1 2 3 header \\\n --extract 1kgeas.valid.snp \\\n --out 1kgeas\n
For thresholding using P values, we can create a range file and a p-value file.
The options we use:
--q-score-range <range file> <data file> [variant ID col.] [data col.] ['header']\n
Example
# SNP - P value file for thresholding\nawk '{print $1,$4}' ${sumStats} > SNP.pvalue\n\n# create a range file with 3 columns: range label, p-value lower bound, p-value upper bound\nhead range_list\npT0.001 0 0.001\npT0.05 0 0.05\npT0.1 0 0.1\npT0.2 0 0.2\npT0.3 0 0.3\npT0.4 0 0.4\npT0.5 0 0.5\n
and then calculate the scores using the p-value ranges:
plink2 \\\n--bfile ${plinkFile} \\\n--score ${sumStats} 1 2 3 header cols=nallele,scoreavgs,denom,scoresums\\\n--q-score-range range_list SNP.pvalue \\\n--extract 1kgeas.valid.snp \\\n--out 1kgeas\n
You will get the following files:
1kgeas.pT0.001.sscore\n1kgeas.pT0.05.sscore\n1kgeas.pT0.1.sscore\n1kgeas.pT0.2.sscore\n1kgeas.pT0.3.sscore\n1kgeas.pT0.4.sscore\n1kgeas.pT0.5.sscore\n
Take a look at the files:
head 1kgeas.pT0.1.sscore\n#IID ALLELE_CT DENOM SCORE1_AVG SCORE1_SUM\nHG00403 54554 54976 2.84455e-05 1.56382\nHG00404 54574 54976 5.65172e-05 3.10709\nHG00406 54284 54976 -3.91872e-05 -2.15436\nHG00407 54348 54976 -9.87606e-05 -5.42946\nHG00409 54760 54976 1.67157e-05 0.918963\nHG00410 54656 54976 3.74405e-05 2.05833\nHG00419 54052 54976 -6.4035e-05 -3.52039\nHG00421 54210 54976 -1.55942e-05 -0.857305\nHG00422 54102 54976 5.28824e-05 2.90726\n
"},{"location":"10_PRS/#meta-scoring-methods-for-prs","title":"Meta-scoring methods for PRS","text":"It has been shown recently that the PRS models generated from multiple traits using a meta-scoring method potentially outperforms PRS models generated from a single trait. Inouye et al. first used this approach for generating a PRS model for CAD from multiple PRS models.
Potential advantages of meta-score for PRS generation
Reference: Inouye, M., Abraham, G., Nelson, C. P., Wood, A. M., Sweeting, M. J., Dudbridge, F., ... & UK Biobank CardioMetabolic Consortium CHD Working Group. (2018). Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology, 72(16), 1883-1893.
elastic net
Elastic net is a common approach for variable selection when there are highly correlated variables (for example, PRS of correlated diseases are often highly correlated.). When fitting linear or logistic models, L1 and L2 penalties are added (regularization).
\\[ \\hat{\\beta} \\equiv argmin({\\parallel y- X \\beta \\parallel}^2 + \\lambda_2{\\parallel \\beta \\parallel}^2 + \\lambda_1{\\parallel \\beta \\parallel} ) \\]After validation, PRS can be generated from distinct PRS for other genetically correlated diseases :
\\[PRS_{meta} = {w_1}PRS_{Trait1} + {w_2}PRS_{Trait2} + {w_3}PRS_{Trait3} + ... \\]An example: Abraham, G., Malik, R., Yonova-Doing, E., Salim, A., Wang, T., Danesh, J., ... & Dichgans, M. (2019). Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nature communications, 10(1), 1-10.
"},{"location":"10_PRS/#reference","title":"Reference","text":"Meta-analysis is one of the most commonly used statistical methods to combine the evidence from multiple studies into a single result.
Potential problems for small-scale genome-wide association studies
To address these problems, meta-analysis is a powerful approach to integrate multiple GWAS summary statistics, especially when more and more summary statistics are publicly available. . This method allows us to obtain increases in statistical power as sample size increases.
What we could achieve by conducting meta-analysis
Before performing any type of meta-analysis, we need to make sure our datasets contain sufficient information and the datasets are QCed and harmonized. It is important to perform this step to avoid any unexpected errors and heterogeneity.
Key points for Dataset selection
Key points for Quality control
Key points for Harmonization
Simply speaking, the fixed effects we mentioned here mean that the between-study variance is zero. Under the fixed effect model, we assume a common effect size across studies for a certain SNP.
Fixed effect model
\\[ \\bar{\\beta_{ij}} = {{\\sum_{i=1}^{k} {w_{ij} \\beta_{ij}}}\\over{\\sum_{i=1}^{k} {w_{ij}}}} \\]Cochran's Q test and \\(I^2\\)
\\[ Q = \\sum_{i=1}^{k} {w_i (\\beta_i - \\bar{\\beta})^2} \\] \\[ I_j^2 = {{Q_j - df_j}\\over{Q_j}}\\times 100% = {{Q - (k - 1)}\\over{Q}}\\times 100% \\]"},{"location":"11_meta_analysis/#metal","title":"METAL","text":"METAL is one of the most commonly used tools for GWA meta-analysis. Its official documentation can be found here. METAL supports two models: (1) Sample size based approach and (2) Inverse variance based approach.
A minimal example of meta-analysis using the IVW method
metal_script.txt# classical approach, uses effect size estimates and standard errors\nSCHEME STDERR \n\n# === DESCRIBE AND PROCESS THE FIRST INPUT FILE ===\nMARKER SNP\nALLELE REF_ALLELE OTHER_ALLELE\nEFFECT BETA\nPVALUE PVALUE \nSTDERR SE \nPROCESS inputfile1.txt\n\n# === THE SECOND INPUT FILE HAS THE SAME FORMAT AND CAN BE PROCESSED IMMEDIATELY ===\nPROCESS inputfile2.txt\n\nANALYZE\n
Then, just run the following command to execute the metal script.
metal meta_input.txt\n
"},{"location":"11_meta_analysis/#random-effects-meta-analysis","title":"Random effects meta-analysis","text":"On the other hand, random effects mean that we need to model the between-study variance, which is not zero in this case. Under the random effect model, we assume the true effect size for a certain SNP varies across studies.
If heterogeneity of effects exists across studies, we need to model the between-study variance to correct for the deflation of variance in fixed-effect estimates.
"},{"location":"11_meta_analysis/#gwama","title":"GWAMA","text":"Random effect model
The random effect variance component can be estimated by:
\\[ r_j^2 = max\\left(0, {{Q_j - (N_j -1)}\\over{\\sum_iw_{ij} - ({{\\sum_iw_{ij}^2} \\over {\\sum_iw_ {ij}}})}}\\right)\\]Then the effect size for SNP j can be obtained by:
\\[ \\bar{\\beta_j}^* = {{\\sum_{i=1}^{k} {w_{ij}^* \\beta_i}}\\over{\\sum_{i=1}^{k} {w_{ij}^*}}} \\]The weights are estimated by:
\\[w_{ij}^* = {{1}\\over{r_j^2 + Var(\\beta_{ij})}} \\]The random effect model was implemented in GWAMA, which is another very popular GWA meta-analysis tool. Its official documentation can be found here.
A minimal example of random effect meta-analysis using GWAMA
The input file for GWAMA contains the path to each sumstats. Column names need to be standardized.
GWAMA_script.inPop1.txt\nPop2.txt\nPop3.txt\n
GWAMA \\\n -i GWAMA_script.in \\\n --random \\\n -o myresults\n
"},{"location":"11_meta_analysis/#cross-ancestry-meta-analysis","title":"Cross-ancestry meta-analysis","text":""},{"location":"11_meta_analysis/#mantra","title":"MANTRA","text":"MANTRA (Meta-ANalysis of Transethnic Association studies) is one of the early efforts to address the heterogeneity for cross-ancestry meta-analysis.
MANTRA implements a Bayesian partition model where GWASs were clustered into ancestry clusters based on a prior model of similarity between them. MANTRA then uses Markov chain Monte Carlo (MCMC) algorithms to approximate the posterior distribution of parameters (which might be quite computationally intensive). MANTRA has been shown to increase power and mapping resolution over random-effects meta-analysis over a range of models of heterogeneity situations.
"},{"location":"11_meta_analysis/#mr-mega","title":"MR-MEGA","text":"MR-MEGA employs meta-regression to model the heterogeneity in effect sizes across ancestries. Its official documentation can be found here (The same first author as GWAMA).
Meta-regression implemented in MR-MEGA
It will first construct a matrix \\(D\\) of pairwise Euclidean distances between GWAS across autosomal variants. The elements of D , $d_{k'k} $ for a pair of studies can be expressed as the following. For each variant \\(j\\), \\(p_{kj}\\) is the allele frequency of j in study k, then:
\\[d_{k'k} = {{\\sum_jI_j(p_{kj}-p_{k'j})^2}\\over{\\sum_jI_j}}\\]Then multi-dimensional scaling (MDS) will be performed to derive T axes of genetic variation (\\(x_k\\) for study k)
For each variant j, the effect size of the reference allele can be modeled in a linear regression model as :
\\[E[\\beta_{kj}] = \\beta_j + \\sum_{t=1}^T\\beta_{tj}x_{kj}\\]A minimal example of meta-analysis using MR-MEGA
The input file for MR-MEGA contains the path to each sumstats. Column names need to be standardized like GWAMA.
MRMEGA_script.inPop1.txt.gz\nPop2.txt.gz\nPop3.txt.gz\nPop4.txt.gz\nPop5.txt.gz\nPop6.txt.gz\nPop7.txt.gz\nPop8.txt.gz\n
MR-MEGA \\\n -i MRMEGA_script.in \\\n --pc 4 \\\n -o myresults\n
"},{"location":"11_meta_analysis/#global-biobank-meta-analysis-initiative-gbmi","title":"Global Biobank Meta-analysis Initiative (GBMI)","text":"As a recent success achieved by meta-analysis, GBMI showed an example of the improvement of our understanding of diseases by taking advantage of large-scale meta-analyses.
For more details, you check check here.
"},{"location":"11_meta_analysis/#reference","title":"Reference","text":"Fine-mapping : Fine-mapping aims to identify the causal variant(s) within a locus for a disease, given the evidence of the significant association of the locus (or genomic region) in GWAS of a disease.
Fine-mapping using individual data is usually performed by fitting the multiple linear regression model:
\\[y = Xb + e\\]Fine-mapping (using Bayesian methods) aims to estimate the PIP (posterior inclusion probability), which indicates the evidence for SNP j having a non-zero effect (namely, causal).
PIP(Posterior Inclusion Probability)
PIP is often calculated by the sum of the posterior probabilities over all models that include variant j as causal.
\\[ PIP_j:=Pr(b_j\\neq0|X,y) \\]Bayesian methods and Posterior probability
\\[ Pr(M_m | O) = {{Pr(O | M_m) Pr(M_m)}\\over{\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}} \\]\\(O\\) : Observed data
\\(M\\) : Models (the configurations of causal variants in the context of fine-mapping).
\\(Pr(M_m | O)\\): Posterior Probability of Model m
\\(Pr(O | M_m)\\): Likelihood (the probability of observing your dataset given Model m is true.)
\\(Pr(M_m)\\): Prior distribution of Model m (the probability of Model m being true)
\\({\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}\\): Evidence (the probability of observing your dataset), namely \\(Pr(O)\\)
Credible sets
A credible set refers to the minimum set of variants that contains all causal SNPs with probability \\(\u03b1\\). (Under the single-causal-variant-per-locus assumption, the credible set is calculated by ranking variants based on their posterior probabilities, and then summing these until the cumulative sum is \\(>\u03b1\\)). We usually report 95% credible sets (\u03b1=95%) for fine-mapping analysis.
Commonly used tools for fine-mapping
Methods assuming only one causal variant in the locus
Methods assuming multiple causal variants in the locus
Methods assuming a small number of larger causal effects with a large number of infinitesimal effects
Methods for Cross-ancestry fine-mapping
You can check here for more information.
In this tutorial, we will introduce SuSiE as an example. SuSiE stands for Sum of Single Effects\u201d model.
The key idea behind SuSiE is :
\\[b = \\sum_{l=1}^L b_l \\]where each vector \\(b_l = (b_{l1}, \u2026, b_{lJ})^T\\) is a so-called single effect vector (a vector with only one non-zero element). L is the upper bound of number of causal variants. And this model could be fitted using Iterative Bayesian Stepwise Selection (IBSS).
For fine-mapping with summary statistics using Susie (SuSiE-RSS), IBSS was modified (IBSS-ss) to take sufficient statistics (which can be computed from other combinations of summary statistics) as input. SuSie will then approximate the sufficient statistics to run fine-mapping.
Quote
For details of SuSiE and SuSiE-RSS, please check : Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the \u201cSum of Single Effects\u201d model. PLoS Genetics, 18(7), e1010299. Link
"},{"location":"12_fine_mapping/#file-preparation","title":"File Preparation","text":"Using python to check novel loci and extract the files.
import gwaslab as gl\nimport pandas as pd\nimport numpy as np\n\nsumstats = gl.Sumstats(\"../06_Association_tests/1kgeas.B1.glm.firth\",fmt=\"plink2\")\n...\n\nsumstats.basic_check()\n...\n\nsumstats.get_lead()\n\nFri Jan 13 23:31:43 2023 Start to extract lead variants...\nFri Jan 13 23:31:43 2023 -Processing 1122285 variants...\nFri Jan 13 23:31:43 2023 -Significance threshold : 5e-08\nFri Jan 13 23:31:43 2023 -Sliding window size: 500 kb\nFri Jan 13 23:31:44 2023 -Found 59 significant variants in total...\nFri Jan 13 23:31:44 2023 -Identified 3 lead variants!\nFri Jan 13 23:31:44 2023 Finished extracting lead variants successfully!\n\nSNPID CHR POS EA NEA SE Z P OR N STATUS\n110723 2:55574452:G:C 2 55574452 C G 0.160948 -5.98392 2.178320e-09 0.381707 503 9960099\n424615 6:29919659:T:C 6 29919659 T C 0.155457 -5.89341 3.782970e-09 0.400048 503 9960099\n635128 9:36660672:A:G 9 36660672 G A 0.160275 5.63422 1.758540e-08 2.467060 503 9960099\n
We will perform fine-mapping for the first significant loci whose lead variant is 2:55574452:G:C
. # filter in the variants in the this locus.\n\nlocus = sumstats.filter_value('CHR==2 & POS>55074452 & POS<56074452')\nlocus.fill_data(to_fill=[\"BETA\"])\nlocus.harmonize(basic_check=False, ref_seq=\"/Users/he/mydata/Reference/Genome/human_g1k_v37.fasta\")\nlocus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\n
check in terminal:
head sig_locus.tsv\nSNPID CHR POS EA NEA BETA SE Z P OR N STATUS\n2:54535206:C:T 2 54535206 T C 0.30028978 0.142461 2.10786 0.0350429 1.35025 503 9960099\n2:54536167:C:G 2 54536167 G C 0.14885099 0.246871 0.602952 0.546541 1.1605 503 9960099\n2:54539096:A:G 2 54539096 G A -0.0038474211 0.288489 -0.0133355 0.98936 0.99616 503 9960099\n2:54540264:G:A 2 54540264 A G -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n2:54540614:G:T 2 54540614 T G -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n2:54540621:A:G 2 54540621 G A -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n2:54540970:T:C 2 54540970 C T -0.049506452 0.149053 -0.332144 0.739781 0.951699 503 9960099\n2:54544229:T:C 2 54544229 C T -0.14338203 0.151172 -0.948468 0.342891 0.866423 503 9960099\n2:54545593:T:C 2 54545593 C T -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n\nhead sig_locus.snplist\n2:54535206:C:T\n2:54536167:C:G\n2:54539096:A:G\n2:54540264:G:A\n2:54540614:G:T\n2:54540621:A:G\n2:54540970:T:C\n2:54544229:T:C\n2:54545593:T:C\n2:54546032:C:G\n
"},{"location":"12_fine_mapping/#ld-matrix-calculation","title":"LD Matrix Calculation","text":"Example
#!/bin/bash\n\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\n\n# LD r matrix\nplink \\\n --bfile ${plinkFile} \\\n --keep-allele-order \\\n --r square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt\n\n# LD r2 matrix\nplink \\\n --bfile ${plinkFile} \\\n --keep-allele-order \\\n --r2 square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt_r2\n
Take a look at the LD matrix (first 5 rows and columns) head -5 sig_locus_mt.ld | cut -f 1-5\n1 -0.145634 0.252616 -0.0876317 -0.0876317\n-0.145634 1 -0.0916734 -0.159635 -0.159635\n0.252616 -0.0916734 1 0.452333 0.452333\n-0.0876317 -0.159635 0.452333 1 1\n-0.0876317 -0.159635 0.452333 1 1\n\nhead -5 sig_locus_mt_r2.ld | cut -f 1-5\n1 0.0212091 0.0638148 0.00767931 0.00767931\n0.0212091 1 0.00840401 0.0254833 0.0254833\n0.0638148 0.00840401 1 0.204605 0.204605\n0.00767931 0.0254833 0.204605 1 1\n0.00767931 0.0254833 0.204605 1 1\n
Heatmap of the LD matrix: "},{"location":"12_fine_mapping/#fine-mapping-with-summary-statistics-using-susier","title":"Fine-mapping with summary statistics using SusieR","text":"Note
install.packages(\"susieR\")\n\n# Fine-mapping with summary statistics\nfitted_rss2 = susie_rss(bhat = sumstats$betahat, shat = sumstats$sebetahat, R = R, n = n, L = 10)\n
R
: a p
x p
LD r matrix. N
: Sample size. bhat
: Alternative summary data giving the estimated effects (a vector of length p
). This, together with shat, may be provided instead of z. shat
: Alternative summary data giving the standard errors of the estimated effects (a vector of length p
). This, together with bhat, may be provided instead of z. L
: Maximum number of non-zero effects in the susie regression model. (defaul : L = 10
)
Quote
For deatils, please check SusieR tutorial - Fine-mapping with susieR using summary statistics
Use susieR in jupyter notebook (with Python):
Please check : https://github.com/Cloufield/GWASTutorial/blob/main/12_fine_mapping/finemapping_susie.ipynb
"},{"location":"12_fine_mapping/#reference","title":"Reference","text":"Heritability is a term used in genetics to describe how much phenotypic variation can be explained by genetic variation.
For any phenotype, its variation \\(Var(P)\\) can be modeled as the combination of genetic effects \\(Var(G)\\) and environmental effects \\(Var(E)\\).
\\[ Var(P) = Var(G) + Var(E) \\]"},{"location":"13_heritability/#broad-sense-heritability","title":"Broad-sense Heritability","text":"The broad-sense heritability \\(H^2_{broad-sense}\\) is mathmatically defined as :
\\[ H^2_{broad-sense} = {Var(G)\\over{Var(P)}} \\]"},{"location":"13_heritability/#narrow-sense-heritability","title":"Narrow-sense Heritability","text":"Genetic effects \\(Var(G)\\) is composed of multiple effects including additive effects \\(Var(A)\\), dominant effects, recessive effects, epistatic effects and so forth.
Narrrow-sense heritability is defined as:
\\[ h^2_{narrow-sense} = {Var(A)\\over{Var(P)}} \\]"},{"location":"13_heritability/#snp-heritability","title":"SNP Heritability","text":"SNP heritability \\(h^2_{SNP}\\) : the proportion of phenotypic variance explained by tested SNPs in a GWAS.
Common methods to estimate SNP heritability includes:
Issue for binary traits :
The scale issue for binary traits
Conversion formula (Equation 23 from Lee. 2011):
\\[ h^2_{liability-scale} = h^2_{observed-scale} * {{K(1-K)}\\over{Z^2}} * {{K(1-K)}\\over{P(1-P)}} \\]scipy.stats.norm.pdf(T, loc=0, scale=1)
.scipy.stats.norm.ppf(1 - K, loc=0, scale=1)
or scipy.stats.norm.isf(K)
.The basic model behind GCTA-GREML is the linear mixed model (LMM):
\\[y = X\\beta + Wu + e\\] \\[ Var(y) = V = WW^{'}\\delta^2_u + I \\delta^2_e\\]GCTA defines \\(A = WW^{'}/N\\) and \\(\\delta^2_g\\) as the variance explained by SNPs.
So the oringinal model can be written as:
\\[y = X\\beta + g + e\\]Quote
For details, please check Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82. link.
"},{"location":"14_gcta_greml/#donwload","title":"Donwload","text":"Download the version of GCTA for your system from : https://yanglab.westlake.edu.cn/software/gcta/#Download
Example
wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip\nunzip gcta-1.94.1-linux-kernel-3-x86_64.zip\ncd gcta-1.94.1-linux-kernel-3-x86_64.zip\n\n./gcta-1.94.1\n*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 12:22:19 JST on Sun Jan 15 2023.\nHostname: Home-Desktop\n\nError: no analysis has been launched by the option(s)\nPlease see online documentation at https://yanglab.westlake.edu.cn/software/gcta/\n
Tip
Add GCTA to your environment
"},{"location":"14_gcta_greml/#make-grm","title":"Make GRM","text":"#!/bin/bash\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\ngcta \\\n --bfile ${plinkFile} \\\n --autosome \\\n --maf 0.01 \\\n --make-grm \\\n --out 1kg_eas\n
*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:21:24 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nOptions:\n\n--bfile ../04_Data_QC/sample_data.clean\n--autosome\n--maf 0.01\n--make-grm\n--out 1kg_eas\n\nNote: GRM is computed using the SNPs on the autosomes.\nReading PLINK FAM file from [../04_Data_QC/sample_data.clean.fam]...\n500 individuals to be included from FAM file.\n500 individuals to be included. 0 males, 0 females, 500 unknown.\nReading PLINK BIM file from [../04_Data_QC/sample_data.clean.bim]...\n1224104 SNPs to be included from BIM file(s).\nThreshold to filter variants: MAF > 0.010000.\nComputing the genetic relationship matrix (GRM) v2 ...\nSubset 1/1, no. subject 1-500\n 500 samples, 1224104 markers, 125250 GRM elements\nIDs for the GRM file have been saved in the file [1kg_eas.grm.id]\nComputing GRM...\n 100% finished in 7.4 sec\n1224104 SNPs have been processed.\n Used 1128732 valid SNPs.\nThe GRM computation is completed.\nSaving GRM...\nGRM has been saved in the file [1kg_eas.grm.bin]\nNumber of SNPs in each pair of individuals has been saved in the file [1kg_eas.grm.N.bin]\n\nAnalysis finished at 17:21:32 JST on Tue Dec 26 2023\nOverall computational time: 8.51 sec.\n
"},{"location":"14_gcta_greml/#estimation","title":"Estimation","text":"#!/bin/bash\n\n#the grm we calculated in step1\nGRM=1kg_eas\n\n# phenotype file\nphenotypeFile=../01_Dataset/1kgeas_binary_gcta.txt\n\n# disease prevalence used for conversion to liability-scale heritability\nprevalence=0.5\n\n# use 5PCs as covariates \nawk '{print $1,$2,$5,$6,$7,$8,$9}' ../05_PCA/plink_results_projected.sscore > 5PCs.txt\n\ngcta \\\n --grm ${GRM} \\\n --pheno ${phenotypeFIile} \\\n --prevalence ${prevalence} \\\n --qcovar 5PCs.txt \\\n --reml \\\n --out 1kg_eas\n
"},{"location":"14_gcta_greml/#results","title":"Results","text":"Warning
This is just to show the analysis pipeline. The trait was simulated under an unreal condition (effect size is extremely large) so the result is meaningless here.
For real analysis, you need a larger sample size to get robust estimation. Please see the GCTA FAQ
*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:36:37 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nAccepted options:\n--grm 1kg_eas\n--pheno ../01_Dataset/1kgeas_binary_gcta.txt\n--prevalence 0.5\n--qcovar 5PCs.txt\n--reml\n--out 1kg_eas\n\nNote: This is a multi-thread program. You could specify the number of threads by the --thread-num option to speed up the computation if there are multiple processors in your machine.\n\nReading IDs of the GRM from [1kg_eas.grm.id].\n500 IDs are read from [1kg_eas.grm.id].\nReading the GRM from [1kg_eas.grm.bin].\nGRM for 500 individuals are included from [1kg_eas.grm.bin].\nReading phenotypes from [../01_Dataset/1kgeas_binary_gcta.txt].\nNon-missing phenotypes of 503 individuals are included from [../01_Dataset/1kgeas_binary_gcta.txt].\nReading quantitative covariate(s) from [5PCs.txt].\n5 quantitative covariate(s) of 501 individuals are included from [5PCs.txt].\nAssuming a disease phenotype for a case-control study: 248 cases and 250 controls\n5 quantitative variable(s) included as covariate(s).\n498 individuals are in common in these files.\n\nPerforming REML analysis ... (Note: may take hours depending on sample size).\n498 observations, 6 fixed effect(s), and 2 variance component(s)(including residual variance).\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values: 0.12498 0.124846\nlogL: 95.34\nRunning AI-REML algorithm ...\nIter. logL V(G) V(e)\n1 95.34 0.14264 0.10708\n2 95.37 0.18079 0.06875\n3 95.40 0.18071 0.06888\n4 95.40 0.18071 0.06888\nLog-likelihood ratio converged.\n\nCalculating the logLikelihood for the reduced model ...\n(variance component 1 is dropped from the model)\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values: 0.24901\nlogL: 94.78319\nRunning AI-REML algorithm ...\nIter. logL V(e)\n1 94.79 0.24900\n2 94.79 0.24899\nLog-likelihood ratio converged.\n\nSummary result of REML analysis:\nSource Variance SE\nV(G) 0.180708 0.164863\nV(e) 0.068882 0.162848\nVp 0.249590 0.016001\nV(G)/Vp 0.724021 0.654075\nThe estimate of variance explained on the observed scale is transformed to that on the underlying liability scale:\n(Proportion of cases in the sample = 0.497992; User-specified disease prevalence = 0.500000)\nV(G)/Vp_L 1.137308 1.027434\n\nSampling variance/covariance of the estimates of variance components:\n2.717990e-02 -2.672171e-02\n-2.672171e-02 2.651955e-02\n\nSummary result of REML analysis has been saved in the file [1kg_eas.hsq].\n\nAnalysis finished at 17:36:38 JST on Tue Dec 26 2023\nOverall computational time: 0.08 sec.\n
"},{"location":"14_gcta_greml/#reference","title":"Reference","text":"Winner's curse refers to the phenomenon that genetic effects are systematically overestimated by thresholding or selection process in genetic association studies.
Winner's curse in auctions
This term was initially used to describe a phenomenon that occurs in auctions. The winning bid is very likely to overestimate the intrinsic value of an item even if all the bids are unbiased (the auctioned item is of equal value to all bidders). The thresholding process in GWAS resembles auctions, where the lead variants are the winning bids.
Reference:
The asymptotic distribution of \\(\\beta_{Observed}\\) is:
\\[\\beta_{Observed} \\sim N(\\beta_{True},\\sigma^2)\\]An example of distribution of \\(\\beta_{Observed}\\)
It is equivalent to:
\\[{{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}} \\sim N(0,1)\\]An example of distribution of \\({{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}}\\)
We can obtain the asymptotic sampling distribution (which is a truncated normal distribution) for \\(\\beta_{Observed}\\) by:
\\[f(x,\\beta_{True}) ={{1}\\over{\\sigma}} {{\\phi({{{x - \\beta_{True}}\\over{\\sigma}}})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]when
\\[|{{x}\\over{\\sigma}}|\\geq c\\]From the asymptotic sampling distribution, the expectation of effect sizes for the selected variants can then be approximated by:
\\[ E(\\beta_{Observed}; \\beta_{True}) = \\beta_{True} + \\sigma {{\\phi({{{\\beta_{True}}\\over{\\sigma}}-c}) - \\phi({{{-\\beta_{True}}\\over{\\sigma}}-c})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]Derivation of this equation can be found in the Appendix A of Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074.
Reference:
Also see reference: https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html
"},{"location":"16_mendelian_randomization/","title":"Mendelian randomization","text":""},{"location":"16_mendelian_randomization/#mendelian-randomization-introduction","title":"Mendelian randomization introduction","text":"Comparison between RCT and MR
"},{"location":"16_mendelian_randomization/#fundamental-assumption-gene-environment-equivalence","title":"Fundamental assumption: gene-environment equivalence","text":"(cited from George Davey Smith Mendelian Randomization - 25th April 2024)
The fundamental assumption of mendelian randomization (MR) is of gene-environment equivalence. MR reflects the phenocopy/ genocopy dialectic (Goldschmidt, Schmalhausen). The idea here is that all environmental effects can be mimicked by one or several mutations. (Zuckerkandl and Villet, PNAS 1988)
Gene-environment equivalence
If we consider BMI as the outcome, let's think about whether genetic variants related to the following exposures meet the gene-environment equivalence assumption:
Instrumental variable (IV) can be defined as a variable that is correlated with the exposure X and uncorrelated with the error \\(\\epsilon\\) in the following regression:
\\[ Y = X\\beta + \\epsilon \\]Key Assumptions
Assumptions Description Relevance Instrumental variables are strongly associated with the exposure.(IVs are not independent of X) Exclusion restriction Instrumental variables do not affect the outcome except through the exposure.(IV is independent of Y, conditional on X and C) Independence There are no confounders of the instrumental variables and the outcome.(IV is independent of C) Monotonicity Variants affect the exposure in the same direction for all individuals No assortative mating Assortative mating might cause bias in MR"},{"location":"16_mendelian_randomization/#two-stage-least-squares-2sls","title":"Two-stage least-squares (2SLS)","text":"\\[ X = \\mu_1 + \\beta_{IV} IV + \\epsilon_1 \\] \\[ Y = \\mu_2 + \\beta_{2SLS} \\hat{X} + \\epsilon_2 \\]"},{"location":"16_mendelian_randomization/#two-sample-mr","title":"Two-sample MR","text":"Two-sample MR refers to the approach that the genetic effects of the instruments on the exposure can be estimated in an independent sample other than that used to estimate effects between instruments on the outcome. As more and more GWAS summary statistics become publicly available, the scope of MR also expands with Two-sample MR methods.
\\[ \\hat{\\beta}_{X,Y} = {{\\hat{\\beta}_{IV,Y}}\\over{\\hat{\\beta}_{IV,X}}} \\]Caveats
For two-sample MR, there is an additional key assumption:
The two samples used for MR are from the same underlying populations. (The effect size of instruments on exposure should be the same in both samples.)
Therefore, for two-sample MR, we usually use datasets from similar non-overlapping populations in terms of not only ancestry but also contextual factors.
"},{"location":"16_mendelian_randomization/#iv-selection","title":"IV selection","text":"One of the first things to do when you plan to perform any type of MR is to check the associations of instrumental variables with the exposure to avoid bias caused by weak IVs.
The most commonly used method here is the F-statistic, which tests the association of instrumental variables with the exposure.
"},{"location":"16_mendelian_randomization/#practice","title":"Practice","text":"In this tutorial, we will walk you through how to perform a minimal TwoSampleMR analysis. We will use the R package TwoSampleMR, which provides easy-to-use functions for formatting, clumping and harmonizing GWAS summary statistics.
This package integrates a variety of commonly used MR methods for analysis, including:
> mr_method_list()\n obj\n1 mr_wald_ratio\n2 mr_two_sample_ml\n3 mr_egger_regression\n4 mr_egger_regression_bootstrap\n5 mr_simple_median\n6 mr_weighted_median\n7 mr_penalised_weighted_median\n8 mr_ivw\n9 mr_ivw_radial\n10 mr_ivw_mre\n11 mr_ivw_fe\n12 mr_simple_mode\n13 mr_weighted_mode\n14 mr_weighted_mode_nome\n15 mr_simple_mode_nome\n16 mr_raps\n17 mr_sign\n18 mr_uwr\n\n name PubmedID\n1 Wald ratio\n2 Maximum likelihood\n3 MR Egger 26050253\n4 MR Egger (bootstrap) 26050253\n5 Simple median\n6 Weighted median\n7 Penalised weighted median\n8 Inverse variance weighted\n9 IVW radial\n10 Inverse variance weighted (multiplicative random effects)\n11 Inverse variance weighted (fixed effects)\n12 Simple mode\n13 Weighted mode\n14 Weighted mode (NOME)\n15 Simple mode (NOME)\n16 Robust adjusted profile score (RAPS)\n17 Sign concordance test\n18 Unweighted regression\n
"},{"location":"16_mendelian_randomization/#inverse-variance-weighted-fixed-effects","title":"Inverse variance weighted (fixed effects)","text":"Assumption: the underlying 'true' effect is fixed across variants
Weight for the effect of ith variant:
\\[W_i = {1 \\over Var(\\beta_i)}\\]Effect size:
\\[\\beta = {{\\sum_{i=1}^N{w_i \\beta_i}}\\over{\\sum_{i=1}^Nw_i}}\\]SE:
\\[SE = {\\sqrt{{1}\\over{\\sum_{i=1}^Nw_i}}}\\]"},{"location":"16_mendelian_randomization/#file-preparation","title":"File Preparation","text":"To perform two-sample MR analysis, we need summary statistics for exposure and outcome generated from independent populations with the same ancestry.
In this tutorial, we will use sumstats from Biobank Japan pheweb and KoGES pheweb.
wget -O bbj_t2d.zip https://pheweb.jp/download/T2D
wget -O koges_bmi.txt.gz https://koges.leelabsg.org/download/KoGES_BMI
First, to use TwosampleMR, we need R>= 4.1. To install the package, run:
library(remotes)\ninstall_github(\"MRCIEU/TwoSampleMR\")\n
"},{"location":"16_mendelian_randomization/#loading-package","title":"Loading package","text":"library(TwoSampleMR)\n
"},{"location":"16_mendelian_randomization/#reading-exposure-sumstats","title":"Reading exposure sumstats","text":"#format exposures dataset\n\nexp_raw <- fread(\"koges_bmi.txt.gz\")\n
"},{"location":"16_mendelian_randomization/#extracting-instrumental-variables","title":"Extracting instrumental variables","text":"# select only significant variants\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_dat <- format_data( exp_raw,\n type = \"exposure\",\n snp_col = \"rsids\",\n beta_col = \"beta\",\n se_col = \"sebeta\",\n effect_allele_col = \"alt\",\n other_allele_col = \"ref\",\n eaf_col = \"af\",\n pval_col = \"pval\",\n)\n
"},{"location":"16_mendelian_randomization/#clumping-exposure-variables","title":"Clumping exposure variables","text":"clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\") \n
"},{"location":"16_mendelian_randomization/#outcome","title":"outcome","text":"out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\"))\nout_dat <- format_data( out_raw,\n type = \"outcome\",\n snp_col = \"SNPID\",\n beta_col = \"BETA\",\n se_col = \"SE\",\n effect_allele_col = \"Allele2\",\n other_allele_col = \"Allele1\",\n pval_col = \"p.value\",\n)\n
"},{"location":"16_mendelian_randomization/#harmonizing-data","title":"Harmonizing data","text":"harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\n
"},{"location":"16_mendelian_randomization/#perform-mr-analysis","title":"Perform MR analysis","text":"res <- mr(harmonized_data)\n\nid.exposure id.outcome outcome exposure method nsnp b se pval\n<chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>\n9J8pv4 IyUv6b outcome exposure MR Egger 28 1.3337580 0.69485260 6.596064e-02\n9J8pv4 IyUv6b outcome exposure Weighted median 28 0.6298980 0.09401352 2.083081e-11\n9J8pv4 IyUv6b outcome exposure Inverse variance weighted 28 0.5598956 0.23225806 1.592361e-02\n9J8pv4 IyUv6b outcome exposure Simple mode 28 0.6097842 0.15180476 4.232158e-04\n9J8pv4 IyUv6b outcome exposure Weighted mode 28 0.5946778 0.12820220 8.044488e-05\n
"},{"location":"16_mendelian_randomization/#sensitivity-analysis","title":"Sensitivity analysis","text":""},{"location":"16_mendelian_randomization/#heterogeneity","title":"Heterogeneity","text":"Test if there is heterogeneity among the causal effect of x on y estimated from each variants.
mr_heterogeneity(harmonized_data)\n\nid.exposure id.outcome outcome exposure method Q Q_df Q_pval\n<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>\n9J8pv4 IyUv6b outcome exposure MR Egger 670.7022 26 1.000684e-124\n9J8pv4 IyUv6b outcome exposure Inverse variance weighted 706.6579 27 1.534239e-131\n
"},{"location":"16_mendelian_randomization/#horizontal-pleiotropy","title":"Horizontal Pleiotropy","text":"Intercept in MR-Egger
mr_pleiotropy_test(harmonized_data)\n\nid.exposure id.outcome outcome exposure egger_intercept se pval\n<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>\n9J8pv4 IyUv6b outcome exposure -0.03603697 0.0305241 0.2484472\n
"},{"location":"16_mendelian_randomization/#single-snp-mr-and-leave-one-out-mr","title":"Single SNP MR and leave-one-out MR","text":"Single SNP MR
res_single <- mr_singlesnp(harmonized_data)\nres_single\n\nexposure outcome id.exposure id.outcome samplesize SNP b se p\n<chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <dbl> <dbl>\n1 exposure outcome 9J8pv4 IyUv6b NA rs10198356 0.6323140 0.2082837 2.398742e-03\n2 exposure outcome 9J8pv4 IyUv6b NA rs10209994 0.9477808 0.3225814 3.302164e-03\n3 exposure outcome 9J8pv4 IyUv6b NA rs10824329 0.6281765 0.3246214 5.297739e-02\n4 exposure outcome 9J8pv4 IyUv6b NA rs10938397 1.2376316 0.2775854 8.251150e-06\n5 exposure outcome 9J8pv4 IyUv6b NA rs11066132 0.6024303 0.2232401 6.963693e-03\n6 exposure outcome 9J8pv4 IyUv6b NA rs12522139 0.2905201 0.2890240 3.148119e-01\n7 exposure outcome 9J8pv4 IyUv6b NA rs12591730 0.8930490 0.3076687 3.700413e-03\n8 exposure outcome 9J8pv4 IyUv6b NA rs13013021 1.4867889 0.2207777 1.646925e-11\n9 exposure outcome 9J8pv4 IyUv6b NA rs1955337 0.5442640 0.2994146 6.910079e-02\n10 exposure outcome 9J8pv4 IyUv6b NA rs2076308 1.1176226 0.2657969 2.613132e-05\n11 exposure outcome 9J8pv4 IyUv6b NA rs2278557 0.6238587 0.2968184 3.556906e-02\n12 exposure outcome 9J8pv4 IyUv6b NA rs2304608 1.5054682 0.2968905 3.961740e-07\n13 exposure outcome 9J8pv4 IyUv6b NA rs2531995 1.3972908 0.3130157 8.045689e-06\n14 exposure outcome 9J8pv4 IyUv6b NA rs261967 1.5303384 0.2921192 1.616714e-07\n15 exposure outcome 9J8pv4 IyUv6b NA rs35332469 -0.2307314 0.3479219 5.072217e-01\n16 exposure outcome 9J8pv4 IyUv6b NA rs35560038 -1.5730870 0.2018968 6.619637e-15\n17 exposure outcome 9J8pv4 IyUv6b NA rs3755804 0.5314915 0.2325073 2.225933e-02\n18 exposure outcome 9J8pv4 IyUv6b NA rs4470425 0.6948046 0.3079944 2.407689e-02\n19 exposure outcome 9J8pv4 IyUv6b NA rs476828 1.1739083 0.1568550 7.207355e-14\n20 exposure outcome 9J8pv4 IyUv6b NA rs4883723 0.5479721 0.2855004 5.494141e-02\n21 exposure outcome 9J8pv4 IyUv6b NA rs509325 0.5491040 0.1598196 5.908641e-04\n22 exposure outcome 9J8pv4 IyUv6b NA rs55872725 1.3501891 0.1259791 8.419325e-27\n23 exposure outcome 9J8pv4 IyUv6b NA rs6089309 0.5657525 0.3347009 9.096620e-02\n24 exposure outcome 9J8pv4 IyUv6b NA rs6265 0.6457693 0.1901871 6.851804e-04\n25 exposure outcome 9J8pv4 IyUv6b NA rs6736712 0.5606962 0.3448784 1.039966e-01\n26 exposure outcome 9J8pv4 IyUv6b NA rs7560832 0.6032080 0.2904972 3.785077e-02\n27 exposure outcome 9J8pv4 IyUv6b NA rs825486 -0.6152759 0.3500334 7.878772e-02\n28 exposure outcome 9J8pv4 IyUv6b NA rs9348441 -4.9786332 0.2572782 1.992909e-83\n29 exposure outcome 9J8pv4 IyUv6b NA All - Inverse variance weighted 0.5598956 0.2322581 1.592361e-02\n30 exposure outcome 9J8pv4 IyUv6b NA All - MR Egger 1.3337580 0.6948526 6.596064e-02\n
leave-one-out MR
res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\n\nexposure outcome id.exposure id.outcome samplesize SNP b se p\n<chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <dbl> <dbl>\n1 exposure outcome 9J8pv4 IyUv6b NA rs10198356 0.5562834 0.2424917 2.178871e-02\n2 exposure outcome 9J8pv4 IyUv6b NA rs10209994 0.5520576 0.2388122 2.079526e-02\n3 exposure outcome 9J8pv4 IyUv6b NA rs10824329 0.5585335 0.2390239 1.945341e-02\n4 exposure outcome 9J8pv4 IyUv6b NA rs10938397 0.5412688 0.2388709 2.345460e-02\n5 exposure outcome 9J8pv4 IyUv6b NA rs11066132 0.5580606 0.2417275 2.096381e-02\n6 exposure outcome 9J8pv4 IyUv6b NA rs12522139 0.5667102 0.2395064 1.797373e-02\n7 exposure outcome 9J8pv4 IyUv6b NA rs12591730 0.5524802 0.2390990 2.085075e-02\n8 exposure outcome 9J8pv4 IyUv6b NA rs13013021 0.5189715 0.2386808 2.968017e-02\n9 exposure outcome 9J8pv4 IyUv6b NA rs1955337 0.5602635 0.2394505 1.929468e-02\n10 exposure outcome 9J8pv4 IyUv6b NA rs2076308 0.5431355 0.2394403 2.330758e-02\n11 exposure outcome 9J8pv4 IyUv6b NA rs2278557 0.5583634 0.2394924 1.972992e-02\n12 exposure outcome 9J8pv4 IyUv6b NA rs2304608 0.5372557 0.2377325 2.382639e-02\n13 exposure outcome 9J8pv4 IyUv6b NA rs2531995 0.5419016 0.2379712 2.277590e-02\n14 exposure outcome 9J8pv4 IyUv6b NA rs261967 0.5358761 0.2376686 2.415093e-02\n15 exposure outcome 9J8pv4 IyUv6b NA rs35332469 0.5735907 0.2378345 1.587739e-02\n16 exposure outcome 9J8pv4 IyUv6b NA rs35560038 0.6734906 0.2217804 2.391474e-03\n17 exposure outcome 9J8pv4 IyUv6b NA rs3755804 0.5610215 0.2413249 2.008503e-02\n18 exposure outcome 9J8pv4 IyUv6b NA rs4470425 0.5568993 0.2392632 1.993549e-02\n19 exposure outcome 9J8pv4 IyUv6b NA rs476828 0.5037555 0.2443224 3.922224e-02\n20 exposure outcome 9J8pv4 IyUv6b NA rs4883723 0.5602050 0.2397325 1.945000e-02\n21 exposure outcome 9J8pv4 IyUv6b NA rs509325 0.5608429 0.2468506 2.308693e-02\n22 exposure outcome 9J8pv4 IyUv6b NA rs55872725 0.4419446 0.2454771 7.180543e-02\n23 exposure outcome 9J8pv4 IyUv6b NA rs6089309 0.5597859 0.2388902 1.911519e-02\n24 exposure outcome 9J8pv4 IyUv6b NA rs6265 0.5547068 0.2436910 2.282978e-02\n25 exposure outcome 9J8pv4 IyUv6b NA rs6736712 0.5598815 0.2387602 1.902944e-02\n26 exposure outcome 9J8pv4 IyUv6b NA rs7560832 0.5588113 0.2396229 1.969836e-02\n27 exposure outcome 9J8pv4 IyUv6b NA rs825486 0.5800026 0.2367545 1.429330e-02\n28 exposure outcome 9J8pv4 IyUv6b NA rs9348441 0.7378967 0.1366838 6.717515e-08\n29 exposure outcome 9J8pv4 IyUv6b NA All 0.5598956 0.2322581 1.592361e-02\n
"},{"location":"16_mendelian_randomization/#visualization","title":"Visualization","text":""},{"location":"16_mendelian_randomization/#scatter-plot","title":"Scatter plot","text":"res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\n
"},{"location":"16_mendelian_randomization/#single-snp","title":"Single SNP","text":"res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\n
"},{"location":"16_mendelian_randomization/#leave-one-out","title":"Leave-one-out","text":"res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\n
"},{"location":"16_mendelian_randomization/#funnel-plot","title":"Funnel plot","text":"res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\n
"},{"location":"16_mendelian_randomization/#mr-steiger-directionality-test","title":"MR Steiger directionality test","text":"MR Steiger directionality test is a method to test the causal direction.
Steiger test: test whether the SNP-outcome correlation is greater than the SNP-exposure correlation.
harmonized_data$\"r.outcome\" <- get_r_from_lor(\n harmonized_data$\"beta.outcome\",\n harmonized_data$\"eaf.outcome\",\n 45383,\n 132032,\n 0.26,\n model = \"logit\",\n correction = FALSE\n)\n\nout <- directionality_test(harmonized_data)\nout\n\nid.exposure id.outcome exposure outcome snp_r2.exposure snp_r2.outcome correct_causal_direction steiger_pval\n<chr> <chr> <chr> <chr> <dbl> <dbl> <lgl> <dbl>\nrvi6Om ETcv15 BMI T2D 0.02125453 0.005496427 TRUE NA\n
Reference: Hemani, G., Tilling, K., & Davey Smith, G. (2017). Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS genetics, 13(11), e1007081.
"},{"location":"16_mendelian_randomization/#mr-base-web-app","title":"MR-Base (web app)","text":"MR-Base web app
"},{"location":"16_mendelian_randomization/#strobe-mr","title":"STROBE-MR","text":"Before reporting any MR results, please check the STROBE-MR Checklist first, which consists of 20 things that should be addressed when reporting a mendelian randomization study.
Coloc
uses the assumption of 0 or 1 causal variant in each trait, and tests for whether they share the same causal variant.
Note
Actually such a assumption is different from fine-mapping. In fine-mapping, the aim is to find the putative causal variants, which is determined at birth. In colocalization, the aim is to find the \"signal overlapping\" to support the causality inference, like eQTL --> A trait. It is possible that the causal variants are different in two traits.
Datasets used:
coloc
requires \"beta\", \"varbeta\", and \"snp\". For quantitative traits, the trait standard deviation \"sdY\" is required to estimate the scale of estimated beta.Result interpretation:
Basically, five configurations are calculated,
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf \n## 1.73e-08 7.16e-07 2.61e-05 8.20e-05 1.00e+00 \n## [1] \"PP abf for shared variant: 100%\"\n
\\(H_0\\): neither trait has a genetic association in the region
\\(H_1\\): only trait 1 has a genetic association in the region
\\(H_2\\): only trait 2 has a genetic association in the region
\\(H_3\\): both traits are associated, but with different causal variants
\\(H_4\\): both traits are associated and share a single causal variant
PP.H4.abf
is the posterior probability that two traits share a same causal variant.
Then based on H4
is true, a 95% credible set could be constructed (as a shared causal variant does not necessarily mean a specific variant).
o <- order(my.res$results$SNP.PP.H4,decreasing=TRUE)\ncs <- cumsum(my.res$results$SNP.PP.H4[o])\nw <- which(cs > 0.95)[1]\nmy.res$results[o,][1:w,]$snp\n
References:
Coloc: a package for colocalisation analyses
"},{"location":"17_colocalization/#coloc-assuming-multiple-causal-variants-or-multiple-signals","title":"Coloc assuming multiple causal variants or multiple signals","text":"When the single-causal variant assumption is violeted, several ways could be used to relieve it.
Assuming multiple causal variants in SuSiE-Coloc pipeline. In this pipeline, putative causal variants are fine-mapped, then each signal is passed to the coloc engine.
Conditioning analysis using GCTA-COJO-Coloc pipeline. In this pipeline, signals are segregated, then passed to the coloc engine.
Many other strategies and pipelines are available for colocalization and prioritize the variants/genes/traits. For example: * HyPrColoc * OpenTargets *
"},{"location":"18_Conditioning_analysis/","title":"Conditioning analysis","text":"Multiple association signals could exist in one locus, especially when observing complex LD structures in the regional plot. Conditioning on one signal allows the separation of independent signals.
Several ways to perform the conditioning analysis:
First, extract the individual genotype (dosage) to the text file. Then add it to covariates.
plink2 \\\n --pfile chr1.dose.Rsq0.3 vzs \\\n --extract chr1.list \\\n --threads 1 \\\n --export A \\\n --out genotype/chr1\n
The exported format could be found in Export non-PLINK 2 fileset.
Note
Major allele dosage would be outputted. If adding ref-first
, REF allele would be outputted. It does not matter as a covariate.
Then just paste it to the covariates table and run the association test.
Note
Some association test software will also provide options for condition analysis. For example, in PLINK, you can use --condition <variant ID>
for condition analysis. You can simply provide a list of variant IDs to run the condition analysis.
If raw genotypes and phenotypes are not available, GCTA-COJO performs conditioning analysis using sumstats and external LD reference.
cojo-top-SNPs 10
will perform a step-wise model selection to select 10 independently associated SNPs (including non-significant ones).
gcta \\\n --bfile chr1 \\\n --chr 1 \\\n --maf 0.001 \\\n --cojo-file chr1_cojo.input \\\n --cojo-top-SNPs 10 \\\n --extract-region-bp 1 152383617 5000 \\\n --out chr1_cojo.output\n
Note
bfile
is used to generate LD. A size of > 4000 unrelated samples is suggested. Estimation of LD in GATC is based on the hard-call genotype.
Input file format less chr1_cojo.input
:
ID ALLELE1 ALLELE0 A1FREQ BETA SE P N\nchr1:11171:CCTTG:C C CCTTG 0.0831407 -0.0459889 0.0710074 0.5172 180590\nchr1:13024:G:A A G 1.63957e-05 -3.2714 3.26302 0.3161 180590\n
Here A1
is the effect allele. Then --cojo-cond
could be used to generate new sumstats conditioned on the above-selected variant(s).
Reference:
In meiosis, homologous chromosomes are recombined. Recombination rates at different DNA regions are not equal. The fragments can be detected after tens of generations, causing Linkage disequilibrium, which refers to the non-random association of alleles of different loci.
Factors affecting LD
Suppose we have two SNPs whose alleles are \\(A/a\\) and \\(B/b\\).
The haplotype frequencies are:
Haplotype Frequency AB \\(p_{AB}\\) Ab \\(p_{Ab}\\) aB \\(p_{aB}\\) ab \\(p_{ab}\\)The allele frequencies are:
Allele Frequency A \\(p_A=p_{AB}+p_{Ab}\\) a \\(p_A=p_{aB}+p_{ab}\\) B \\(p_A=p_{AB}+p_{aB}\\) b \\(p_A=p_{Ab}+p_{ab}\\)D : the level of LD between A and B can be estimated using coefficient of linkage disequilibrium (D), which is defined as:
\\[D_{AB} = p_{AB} - p_Ap_B\\]If A and B are in linkage equilibrium, we can get
\\[D_{AB} = p_{AB} - p_Ap_B = 0\\]which means the coefficient of linkage disequilibrium is 0 in this case.
D can be calculated for each pair of alleles and their relationships can be expressed as:
\\[D_{AB} = -D_{Ab} = -D_{aB} = D_{ab} \\]So we can simply denote \\(D = D_{AB}\\), and the relationship between haplotype frequencies and allele frequencies can be summarized in the following table.
Allele A a Total B \\(p_{AB}=p_Ap_B+D\\) \\(p_{aB}=p_ap_B-D\\) \\(p_B\\) b \\(p_{AB}=p_Ap_b-D\\) \\(p_{AB}=p_ap_b+D\\) \\(p_b\\) Total \\(p_A\\) \\(p_a\\) 1The range of possible values of D depends on the allele frequencies, which is not suitable for comparison between different pairs of alleles.
Lewontin suggested a method for the normalization of D :
\\[D_{normalized} = {{D}\\over{D_{max}}}\\]where
\\[ D_{max} = \\begin{cases} max\\{-p_Ap_B, -(1-p_A)(1-p_B)\\} & \\text{when } D \\lt 0 \\\\ min\\{ p_A(1-p_B), p_B(1-p_A) \\} & \\text{when } D \\gt 0 \\\\ \\end{cases} \\]It measures how much proportion of the haplotypes had undergone recombination.
In practice, the most commonly used alternative metric to \\(D_{normalized}\\) is \\(r^2\\), the correlation coefficient, which can be obtained by:
\\[ r^2 = {{D^2}\\over{p_A(1-p_A)p_B(1-p_B)}} \\]Reference: Slatkin, M. (2008). Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477-485.
"},{"location":"19_ld/#ld-calculation-using-software","title":"LD Calculation using software","text":""},{"location":"19_ld/#ldstore2","title":"LDstore2","text":"LDstore2: http://www.christianbenner.com/#
Reference: Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).
"},{"location":"19_ld/#plink-ld","title":"PLINK LD","text":"Please check Calculate LD using PLINK.
"},{"location":"19_ld/#ld-lookup-using-ldlink","title":"LD Lookup using LDlink","text":"LDlink
LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups. Each included application is specialized for querying and displaying unique aspects of linkage disequilibrium.
https://ldlink.nci.nih.gov/?tab=home
Reference: Machiela, M. J., & Chanock, S. J. (2015). LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 31(21), 3555-3557.
LDlink is a very useful tool for quick lookups of any information related to LD.
"},{"location":"19_ld/#ldlink-ldpair","title":"LDlink-LDpair","text":"LDpair
"},{"location":"19_ld/#ldlink-ldproxy","title":"LDlink-LDproxy","text":"LDproxy for rs671
"},{"location":"19_ld/#query-in-batch-using-ldlink-api","title":"Query in batch using LDlink API","text":"LDlink provides API for queries using command line.
You need to register and get a token first.
https://ldlink.nci.nih.gov/?tab=apiaccess
Query LD proxies for variants using LDproxy API
curl -k -X GET 'https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs3&pop=MXL&r2_d=r2&window=500000& genome_build=grch37&token=faketoken123'\n
"},{"location":"19_ld/#ldlinkr","title":"LDlinkR","text":"There is also a related R package for LDlink.
Query LD proxies for variants using LDlinkR
install.packages(\"LDlinkR\")\n\nlibrary(LDlinkR)\n\nmy_proxies <- LDproxy(snp = \"rs671\", \n pop = \"EAS\", \n r2d = \"r2\", \n token = \"YourTokenHere123\",\n genome_build = \"grch38\"\n )\n
Reference: Myers, T. A., Chanock, S. J., & Machiela, M. J. (2020). LDlinkR: an R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Frontiers in genetics, 11, 157.
"},{"location":"19_ld/#ld-pruning","title":"LD-pruning","text":"Please check LD-pruning
"},{"location":"19_ld/#ld-clumping","title":"LD-clumping","text":"Please check LD-clumping
"},{"location":"19_ld/#ld-score","title":"LD score","text":"Definition: https://cloufield.github.io/GWASTutorial/08_LDSC/#ld-score
"},{"location":"19_ld/#ldsc","title":"LDSC","text":"LD score can be estimated with LDSC using PLINK format genotype data as the reference panel.
plinkPrefix=chr22\n\npython ldsc.py \\\n --bfile ${plinkPrefix}\n --l2 \\\n --ld-wind-cm 1\\\n --out ${plinkPrefix}\n
Check here for details.
"},{"location":"19_ld/#gcta","title":"GCTA","text":"GCTA also provides a function to estimate LD scores using PLINK format genotype data.
plinkPrefix=chr22\n\ngcta64 \\\n --bfile ${plinkPrefix} \\\n --ld-score \\\n --ld-wind 1000 \\\n --ld-rsq-cutoff 0.01 \\\n --out ${plinkPrefix}\n
Check here for details.
"},{"location":"19_ld/#ld-score-regression","title":"LD score regression","text":"Please check LD score regression
"},{"location":"19_ld/#reference","title":"Reference","text":"This table shows the relationship between the null hypothesis \\(H_0\\) and the results of a statistical test (whether or not to reject the null hypothesis \\(H_0\\) ).
H0 is True H0 is False Do Not Reject True negative : \\(1 - \\alpha\\) Type II error (false negative) : \\(\\beta\\) Reject Type I error (false positive) : \\(\\alpha\\) True positive : \\(1 - \\beta\\)\\(\\alpha\\) : significance level
By definition, the statistical power of a test refers to the probability that the test will correctly reject the null hypothesis, namely the True positive rate in the table above.
\\(Power = Pr ( Reject\\ | H_0\\ is\\ False) = 1 - \\beta\\)
Power
Factors affecting power
NCP describes the degree of difference between the alternative hypothesis \\(H_1\\) and the null hypothesis \\(H_0\\) values.
Consider a simple linear regression model:
\\[y = \\mu +\\beta x + \\epsilon\\]The variance of the error term:
\\[\\sigma^2 = Var(y) - Var(x)\\beta^2\\]Usually, the phenotypic variance that a single SNP could explain is very limited, so we can approximate \\(\\sigma^2\\) by:
\\[ \\sigma^2 \\thickapprox Var(y)\\]Under Hardy-Weinberg equilibrium, we can get:
\\[Var(x) = 2f(1-f)\\]So the Non-centrality parameter(NCP) \\(\\lambda\\) for \\(\\chi^2\\) distribution with degree of freedom 1:
\\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2\\]"},{"location":"20_power_analysis/#power-for-quantitative-traits","title":"Power for quantitative traits","text":"\\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2 \\thickapprox N \\times {{Var(x)\\beta^2}\\over{\\sigma^2}} \\thickapprox N \\times {{2f(1-f) \\beta^2 }\\over {Var(y)}} \\]Significance threshold: \\(C = CDF_{\\chi^2}^{-1}(1 - \\alpha,df=1)\\)
Denote :
Null hypothesis : \\(P_{case} = P_{control}\\)
To test whether one proportion \\(P_{case}\\) equals the other proportion \\(P_{control}\\), the test statistic is:
\\[z = {{P_{case} - P_{control}}\\over {\\sqrt{ {{P_{case}(1 - P_{case})}\\over{2N_{case}}} + {{P_{control}(1 - P_{control})}\\over{2N_{control}}} }}}\\]Significance threshold: \\(C = \\Phi^{-1}(1 - \\alpha / 2 )\\)
\\[ Power = Pr(|Z|>C) = 1 - \\Phi(-C-z) + \\Phi(C-z)\\]GAS power calculator
GAS power calculator implemented this method, and you can easily calculate the power using their website
"},{"location":"20_power_analysis/#reference","title":"Reference:","text":"Most variants identified in GWAS are located in regulatory regions, and these genetic variants could potentially affect complex traits through gene expression.
However, due to the limitation of samples and high cost, it is difficult to measure gene expression at a large scale. Consequently, many expression-trait associations have not been detected, especially for those with small effect sizes.
To address these issues, alternative approaches have been proposed and transcriptome-wide association study (TWAS) has become a common and easy-to-perform approach to identify genes whose expression is significantly associated with complex traits in individuals without directly measured expression levels.
GWAS and TWAS
"},{"location":"21_twas/#definition","title":"Definition","text":"TWAS is a method to identify significant expression-trait associations using expression imputation from genetic data or summary statistics.
Individual-level and summary-level TWAS
"},{"location":"21_twas/#fusion","title":"FUSION","text":"In this tutorial, we will introduce FUSION, which is one of the most commonly used tools for performing transcriptome-wide association studies (TWAS) using summary-level data.
url : http://gusevlab.org/projects/fusion/
FUSION trains predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. (http://gusevlab.org/projects/fusion/)
Quote
Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., ... & Pasaniuc, B. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3), 245-252.
"},{"location":"21_twas/#algorithm-for-imputing-expression-into-gwas-summary-statistics","title":"Algorithm for imputing expression into GWAS summary statistics","text":"ImpG-Summary algorithm was extended to impute the Z scores for the cis genetic component of expression.
FUSION statistical model
\\(Z\\) : a vector of standardized effect sizes (z scores) of SNPs for the target trait at a given locus
We impute the Z score of the expression and trait as a linear combination of elements of \\(Z\\) with weights \\(W\\).
\\[ W = \\Sigma_{e,s}\\Sigma_{s,s}^{-1} \\]\\(\\Sigma_{e,s}\\) : covariance matrix between all SNPs and gene expression
\\(\\Sigma_{s,s}\\) : covariance among all SNPs (LD)
Both \\(\\Sigma_{e,s}\\) and \\(\\Sigma_{s,s}\\) are estimated from reference datsets.
\\[ Z \\sim N(0, \\Sigma_{s,s} ) \\]The variance of \\(WZ\\) (imputed z score of expression and trait)
\\[ Var(WZ) = W\\Sigma_{s,s}W^t \\]The imputation Z score can be obtained by:
\\[ {{WZ}\\over{W\\Sigma_{s,s}W^t}^{1/2}} \\]ImpG-Summary algorithm
Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., ... & Price, A. L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906-2914.
"},{"location":"21_twas/#installation","title":"Installation","text":"Download FUSION from github and install
wget https://github.com/gusevlab/fusion_twas/archive/master.zip\nunzip master.zip\ncd fusion_twas-master\n
Download and unzip the LD reference data (1000 genome)
wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2\ntar xjvf LDREF.tar.bz2\n
Download and unzip plink2R
wget https://github.com/gabraham/plink2R/archive/master.zip\nunzip master.zip\n
Install R packages
# R >= 4.0\nR\n\ninstall.packages(c('optparse','RColorBrewer'))\ninstall.packages('plink2R-master/plink2R/',repos=NULL)\n
"},{"location":"21_twas/#example","title":"Example","text":"FUSION framework
Input:
Input GWAS sumstats fromat
Example:
SNP A1 A2 N CHISQ Z\nrs6671356 C T 70100.0 0.172612905312 0.415467092935\nrs6604968 G A 70100.0 0.291125788806 0.539560736902\nrs4970405 A G 70100.0 0.102204513891 0.319694407037\nrs12726255 G A 70100.0 0.312418295691 0.558943911042\nrs4970409 G A 70100.0 0.0524226849517 0.228960007319\n
Get sample sumstats and weights
wget https://data.broadinstitute.org/alkesgroup/FUSION/SUM/PGC2.SCZ.sumstats\n\nmkdir WEIGHTS\ncd WEIGHTS\nwget https://data.broadinstitute.org/alkesgroup/FUSION/WGT/GTEx.Whole_Blood.tar.bz2\ntar xjf GTEx.Whole_Blood.tar.bz2\n
WEIGHTS
files in each WEIGHTS folder
RDat weight files for each gene in a tissue type
GTEx.Whole_Blood.ENSG00000002549.8.LAP3.wgt.RDat GTEx.Whole_Blood.ENSG00000166394.10.CYB5R2.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002822.11.MAD1L1.wgt.RDat GTEx.Whole_Blood.ENSG00000166435.11.XRRA1.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002919.10.SNX11.wgt.RDat GTEx.Whole_Blood.ENSG00000166436.11.TRIM66.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002933.3.TMEM176A.wgt.RDat GTEx.Whole_Blood.ENSG00000166444.13.ST5.wgt.RDat\nGTEx.Whole_Blood.ENSG00000003137.4.CYP26B1.wgt.RDat GTEx.Whole_Blood.ENSG00000166471.6.TMEM41B.wgt.RDat\n...\n
Expression imputation
Rscript FUSION.assoc_test.R \\\n--sumstats PGC2.SCZ.sumstats \\\n--weights ./WEIGHTS/GTEx.Whole_Blood.pos \\\n--weights_dir ./WEIGHTS/ \\\n--ref_ld_chr ./LDREF/1000G.EUR. \\\n--chr 22 \\\n--out PGC2.SCZ.22.dat\n
Results
head PGC2.SCZ.22.dat\nPANEL FILE ID CHR P0 P1 HSQ BEST.GWAS.ID BEST.GWAS.Z EQTL.ID EQTL.R2 EQTL.Z EQTL.GWAS.Z NSNP NWGT MODEL MODELCV.R2 MODELCV.PV TWAS.Z TWAS.P\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000273311.1.DGCR11.wgt.RDat DGCR11 22 19033675 19035888 0.0551 rs2238767 -2.98 rs2283641 0.013728 4.33 2.5818 408 1 top1 0.014 0.018 2.5818 9.83e-03\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000100075.5.SLC25A1.wgt.RDat SLC25A1 22 19163095 19166343 0.0740 rs2238767 -2.98 rs762523 0.080367 5.36 -1.8211 406 1 top1 0.08 7.2e-08 -1.8216.86e-02\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000070371.11.CLTCL1.wgt.RDat CLTCL1 22 19166986 19279239 0.1620 rs4819843 3.04 rs809901 0.072193 5.53 -1.9928 456 19 enet 0.085 2.8e-08 -1.8806.00e-02\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000232926.1.AC000078.5.wgt.RDat AC000078.5 22 19874812 19875493 0.2226 rs5748555 -3.15 rs13057784 0.052796 5.60 -0.1652 514 44 enet 0.099 2e-09 0.0524 9.58e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000185252.13.ZNF74.wgt.RDat ZNF74 22 20748405 20762745 0.1120 rs595272 4.09 rs1005640 0.001422 3.44 -1.3677 301 8 enet 0.008 0.054 -0.8550 3.93e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000099940.7.SNAP29.wgt.RDat SNAP29 22 21213771 21245506 0.1286 rs595272 4.09 rs4820575 0.061763 5.94 -1.1978 416 27 enet 0.079 9.4e-08 -1.0354 3.00e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000272600.1.AC007308.7.wgt.RDat AC007308.7 22 21243494 21245502 0.2076 rs595272 4.09 rs165783 0.100625 6.79 -0.8871 408 12 lasso 0.16 5.4e-1-1.2049 2.28e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000183773.11.AIFM3.wgt.RDat AIFM3 22 21319396 21335649 0.0676 rs595272 4.09 rs565979 0.036672 4.50 -0.4474 362 1 top1 0.037 0.00024 -0.4474 6.55e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000230513.1.THAP7-AS1.wgt.RDat THAP7-AS1 22 21356175 21357118 0.2382 rs595272 4.09 rs2239961 0.105307 -7.04 -0.3783 347 5 lasso 0.15 7.6e-1 0.2292 8.19e-01\n
Descriptions of the output (cited from http://gusevlab.org/projects/fusion/ )
Colume number Column header Value Usage 1 FILE \u2026 Full path to the reference weight file used 2 ID FAM109B Feature/gene identifier, taken from --weights file 3 CHR 22 Chromosome 4 P0 42470255 Gene start (from --weights) 5 P1 42475445 Gene end (from --weights) 6 HSQ 0.0447 Heritability of the gene 7 BEST.GWAS.ID rs1023500 rsID of the most significant GWAS SNP in locus 8 BEST.GWAS.Z -5.94 Z-score of the most significant GWAS SNP in locus 9 EQTL.ID rs5758566 rsID of the best eQTL in the locus 10 EQTL.R2 0.058680 cross-validation R2 of the best eQTL in the locus 11 EQTL.Z -5.16 Z-score of the best eQTL in the locus 12 EQTL.GWAS.Z -5.0835 GWAS Z-score for this eQTL 13 NSNP 327 Number of SNPs in the locus 14 MODEL lasso Best performing model 15 MODELCV.R2 0.058870 cross-validation R2 of the best performing model 16 MODELCV.PV 3.94e-06 cross-validation P-value of the best performing model 17 TWAS.Z 5.1100 TWAS Z-score (our primary statistic of interest) 18 TWAS.P 3.22e-07 TWAS P-value"},{"location":"21_twas/#limitations","title":"Limitations","text":"Significant loci identified in TWAS also contain multiple tarit-associated genes. GWAS often identifies multiple variants in LD. Similarly, TWAS frequently identifies multiple genes in a locus.
Co-regulation may cause false positive results. Just like SNPs are correlated due to LD, gene expressions are often correlated due to co-regulation.
Sometimes even when co-regulation is not captured, the shared variants (or variants in strong LD) in different expression prediction models may cause false positive results.
Predicted expression account for only a limited portion of total gene expression. Total expression is affected not only by genetic components like cis-eQTL but also by other factors like environmental and technical components.
Other factors. For example, the window size for selecting variants may affect association results.
TWAS aims to test the relationship of the phenotype with the genetic component of the gene expression. But under current framework, TWAS only test the relationship of the phenotype with the predicted gene expression without accounting for the uncertainty in that prediction. The key point here is that the current framework omits the fact that the gene expression data is also the result of a sampling process from the analysis.
\"Consequently, the test of association between that predicted genetic component and a phenotype reduces to merely a (weighted) test of joint association of the SNPs with the phenotype, which means that they cannot be used to infer a genetic relationship between gene expression and the phenotype on a population level.\"
Quote
de Leeuw, C., Werme, J., Savage, J. E., Peyrot, W. J., & Posthuma, D. (2021). On the interpretation of transcriptome-wide association studies. bioRxiv, 2021-08.
"},{"location":"21_twas/#reference","title":"Reference","text":"Overview of REGENIE
Reference: https://rgcgithub.github.io/regenie/overview/
"},{"location":"32_whole_genome_regression/#whole-genome-model","title":"Whole genome model","text":""},{"location":"32_whole_genome_regression/#stacked-regressions","title":"Stacked regressions","text":""},{"location":"32_whole_genome_regression/#firth-correction","title":"Firth correction","text":""},{"location":"32_whole_genome_regression/#tutorial","title":"Tutorial","text":""},{"location":"32_whole_genome_regression/#installation","title":"Installation","text":"Please check here
"},{"location":"32_whole_genome_regression/#step1","title":"Step1","text":"Sample codes for running step 1
plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\n# revise the header of covariate file\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n --step 1 \\\n --bed ${plinkFile} \\\n --extract ${extract} \\\n --phenoFile ${phenoFile} \\\n --covarFile ${covarFile} \\\n --covarColList ${covarList} \\\n --bt \\\n --bsize 1000 \\\n --lowmem \\\n --lowmem-prefix tmpdir/regenie_tmp_preds \\\n --out 1kg_eas_step1_BT\n
"},{"location":"32_whole_genome_regression/#step2","title":"Step2","text":"Sample codes for running step 2
plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n --step 2 \\\n --bed ${plinkFile} \\\n --ref-first \\\n --phenoFile ${phenoFile} \\\n --covarFile ${covarFile} \\\n --covarColList ${covarList} \\\n --bt \\\n --bsize 400 \\\n --firth --approx --pThresh 0.01 \\\n --pred 1kg_eas_step1_BT_pred.list \\\n --out 1kg_eas_step1_BT\n
"},{"location":"32_whole_genome_regression/#visualization","title":"Visualization","text":""},{"location":"32_whole_genome_regression/#reference","title":"Reference","text":"Risk: the probability that a subject within a population will develop a given disease, or other health outcome, over a specified follow-up period.
\\[ R = {{E}\\over{E + N}} \\]Odds: the likelihood of a new event occurring rather than not occurring. It is the probability that an event will occur divided by the probability that the event will not occur.
\\[ Odds = {E \\over N } \\]"},{"location":"55_measure_of_effect/#hazard","title":"Hazard","text":"Hazard function \\(h(t)\\): the event rate at time \\(t\\) conditional on survival until time \\(t\\) (namely, \\(T\u2265t\\))
\\[ h(t) = Pr(t<=T<t_{+1} | T>=t ) \\]T\u00a0is a discrete random variable indicating the time of occurrence of the event.
"},{"location":"55_measure_of_effect/#relative-risk-rr-and-odds-ratio-or","title":"Relative risk (RR) and Odds ratio (OR)","text":""},{"location":"55_measure_of_effect/#22-contingency-table","title":"2\u00d72 Contingency Table","text":"Intervention I Control C Events E IE CE Non-events N IN CN"},{"location":"55_measure_of_effect/#relative-risk-rr","title":"Relative risk (RR)","text":"RR: relative risk (risk ratio), usually used in cohort studies.
\\[ RR = {{R_{Intervention}}\\over{R_{ conrol}}}={{IE/(IE+IN)}\\over{CE/(CE+CN)}} \\]"},{"location":"55_measure_of_effect/#odds-ratio-or","title":"Odds ratio (OR)","text":"OR: usually used in case control studies.
\\[ OR = {{Odds_{Intervention}}\\over{Odds_{ conrol}}}={{IE/IN}\\over{CE/CN}} = {{IE * CN}\\over{CE * IN}} \\]When the event occurs in less than 10% of the unexposed population, the OR provides a reasonable approximation of the RR.
"},{"location":"55_measure_of_effect/#hazard-ratios-hr","title":"Hazard ratios (HR)","text":"Hazard ratios (relative hazard) are usually estimated from Cox proportional hazards model:
\\[ h_i(t) = h_0(t) \\times e^{\\beta_0 + \\beta_1X_{i1} + ... + \\beta_nX_{in} } = h_0(t) \\times e^{X_i\\beta } \\]HR: the ratio of the hazard rates corresponding to the conditions characterised by two distinct levels of a treatment variable of interest.
\\[ HR = {{h(t | X_i)}\\over{h(t|X_j)}} = {{h_0(t) \\times e^{X_i\\beta }}\\over{h_0(t) \\times e^{X_j\\beta }}} = e^{(X_i-X_j)\\beta} \\]"},{"location":"60_awk/","title":"AWK","text":""},{"location":"60_awk/#awk-introduction","title":"AWK Introduction","text":"'awk' is one of the most powerful text processing tools for tabular text files.
"},{"location":"60_awk/#awk-syntax","title":"AWK syntax","text":"awk OPTION 'CONDITION {PROCESS}' FILENAME\n
Some special variables in awk:
$0
: all columns$n
: column n. For example, $1 means the first column. $4 means column 4.NR
: Row number.Using the sample sumstats, we will demonstrate some simple but useful one-liners.
# sample sumstats\nhead ../02_Linux_basics/sumstats.txt \n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"60_awk/#example-1","title":"Example 1","text":"Select variants on chromosome 2 (keeping the headers)
awk 'NR==1 || $1==2 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n2 22398 2:22398:C:T C T T ADD 503 1.287540.161017 1.56962 0.116503 .\n2 24839 2:24839:C:T C T T ADD 503 1.318170.179754 1.53679 0.124344 .\n2 26844 2:26844:C:T C T T ADD 503 1.3173 0.161302 1.70851 0.0875413 .\n2 28786 2:28786:T:C T C C ADD 503 1.3043 0.161184 1.64822 0.0993082 .\n2 30091 2:30091:C:G C G G ADD 503 1.3043 0.161184 1.64822 0.0993082 .\n2 30762 2:30762:A:G A G A ADD 503 1.099560.158614 0.598369 0.549594 .\n2 34503 2:34503:G:T G T T ADD 503 1.323720.179789 1.55988 0.118789 .\n2 39340 2:39340:A:G A G G ADD 503 1.3043 0.161184 1.64822 0.0993082 .\n2 55237 2:55237:T:C T C C ADD 503 1.314860.161988 1.68983 0.0910614 .\n
The NR
here means row number. The condition here NR==1 || $1==2
means if it is the first row or the first column is equal to 2, conduct the process print $0
, which mean print all columns.
Select all genome-wide significant variants (p<5e-8)
awk 'NR==1 || $13 <5e-8 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"60_awk/#example-3","title":"Example 3","text":"Create a bed-like format for annotation
awk 'NR>1 {print $1,$2,$2,$4,$5}' ../02_Linux_basics/sumstats.txt | head\n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n
"},{"location":"60_awk/#awk-workflow","title":"AWK workflow","text":"The workflow of awk can be summarized in the following figure:
awk workflow
"},{"location":"60_awk/#awk-variables","title":"AWK variables","text":"Frequently used awk variables
Variable Desciption NR The number of input records NF The number of input fields FS The input field separator. The default value is\" \"
OFS The output field separator. The default value is \" \"
RS The input record separator. The default value is \"\\n\"
ORS The output record separator.The default value is \"\\n\"
FILENAME The name of the current input file. FNR The current record number in the current file Handle csv and tsv files
head ../03_Data_formats/sample_data.csv\n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
awk -v FS=',' -v OFS=\"\\t\" '{print $1,$2}' sample_data.csv\n#CHROM POS\n1 13273\n1 14599\n1 14604\n1 14930\n1 69897\n1 86331\n1 91581\n1 122872\n1 135163\n
convert csv to tsv
awk 'BEGIN { FS=\",\"; OFS=\"\\t\" } {$1=$1; print}' sample_data.csv\n
Skip and replace headers
awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"CHR\\tPOS\"} NR>1 {print $1,$2}' sample_data.csv\n\nCHR POS\n1 13273\n1 14599\n1 14604\n1 14930\n1 69897\n1 86331\n1 91581\n1 122872\n1 135163\n
Extract a line
awk 'NR==4' sample_data.csv\n\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n
Print the last two columns
awk -v FS=',' '{print $(NF-1),$(NF)}' sample_data.csv\nP ERRCODE\n0.305961 .\n0.0104299 .\n0.0104299 .\n0.0269602 .\n0.0188466 .\n0.102694 .\n0.522847 .\n0.703856 .\n0.155079 .\n
"},{"location":"60_awk/#awk-operators","title":"AWK operators","text":"Arithmetic Operators
Arithmetic Operators Desciption+
add -
subtract *
multiply \\
divide %
modulus division **
x**y : x raised to the y-th power Logical Operators
Logical Operators Desciption\\|\\|
or &&
and !
not"},{"location":"60_awk/#awk-functions","title":"AWK functions","text":"Numeric functions in awk
Convert OR and P to BETA and -log10(P)
awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"SNPID\\tBETA\\tMLOG10P\"}NR>1{print $3,log($10),-log($13)/log(10)}' sample_data.csv\nSNPID BETA MLOG10P\n1:13273:G:C -0.287458 0.514334\n1:14599:T:A 0.593172 1.98172\n1:14604:A:G 0.593172 1.98172\n1:14930:A:G 0.531446 1.56928\n1:69897:T:C 0.457438 1.72477\n1:86331:A:G 0.385303 0.988455\n1:91581:G:A -0.0785866 0.281625\n1:122872:T:G 0.0687142 0.152516\n1:135163:C:T -0.339927 0.809447\n
String manipulating functions in awk
$ awk --help\nUsage: awk [POSIX or GNU style options] -f progfile [--] file ...\nUsage: awk [POSIX or GNU style options] [--] 'program' file ...\nPOSIX options: GNU long options: (standard)\n -f progfile --file=progfile\n -F fs --field-separator=fs\n -v var=val --assign=var=val\nShort options: GNU long options: (extensions)\n -b --characters-as-bytes\n -c --traditional\n -C --copyright\n -d[file] --dump-variables[=file]\n -D[file] --debug[=file]\n -e 'program-text' --source='program-text'\n -E file --exec=file\n -g --gen-pot\n -h --help\n -i includefile --include=includefile\n -l library --load=library\n -L[fatal|invalid] --lint[=fatal|invalid]\n -M --bignum\n -N --use-lc-numeric\n -n --non-decimal-data\n -o[file] --pretty-print[=file]\n -O --optimize\n -p[file] --profile[=file]\n -P --posix\n -r --re-interval\n -S --sandbox\n -t --lint-old\n -V --version\n\nTo report bugs, see node `Bugs' in `gawk.info', which is\nsection `Reporting Problems and Bugs' in the printed version.\n\ngawk is a pattern scanning and processing language.\nBy default it reads standard input and writes standard output.\n\nExamples:\n gawk '{ sum += $1 }; END { print sum }' file\n gawk -F: '{ print $1 }' /etc/passwd\n
"},{"location":"60_awk/#reference","title":"Reference","text":"sed
is also one of the most commonly used test-editing command in Linux, which is short for stream editor. sed
command edits the text from standard input in a line-by-line approach.
sed [OPTIONS] PROCESS [FILENAME]\n
"},{"location":"61_sed/#examples","title":"Examples","text":""},{"location":"61_sed/#sample-input","title":"sample input","text":"head ../02_Linux_basics/sumstats.txt\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"61_sed/#example-1-replacing-strings","title":"Example 1: Replacing strings","text":"s
for substitute g
for global
Replacing strings
\"Replace the separator from :
to _
\"
head 02_Linux_basics/sumstats.txt | sed 's/:/_/g'\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1_13273_G_C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1_14599_T_A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1_14604_A_G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1_14930_A_G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1_69897_T_C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1_86331_A_G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1_91581_G_A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1_122872_T_G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1_135163_C_T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"61_sed/#example-2-delete-headerthe-first-line","title":"Example 2: Delete header(the first line)","text":"-d
for deletion
Delete header(the first line)
head 02_Linux_basics/sumstats.txt | sed '1d'\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"69_resources/","title":"Resources","text":""},{"location":"69_resources/#sandbox","title":"Sandbox","text":"Sandbox provides tutorials for you to learn how to use bioinformatics tools right from your browser. Everything runs in a sandbox, so you can experiment all you want.
explainshell is a tool (with a web interface) capable of parsing man pages, extracting options and explain a given command-line by matching each argument to the relevant help text in the man page.
R can be downloaded from its official website CRAN (The Comprehensive R Archive Network).
CRAN
https://cran.r-project.org/
"},{"location":"75_R_basics/#install-r-using-conda","title":"Install R using conda","text":"It is convenient to use conda to manage your R environment.
conda install -c conda-forge r-base=4.x.x\n
"},{"location":"75_R_basics/#ide-for-r-positrstudio","title":"IDE for R: Posit(Rstudio)","text":"Posit(Rstudio) is one of the most commonly used Integrated development environment(IDE) for R.
https://posit.co/
"},{"location":"75_R_basics/#use-r-in-interactive-mode","title":"Use R in interactive mode","text":"R\n
"},{"location":"75_R_basics/#run-r-script","title":"Run R script","text":"Rscript mycode.R\n
"},{"location":"75_R_basics/#installing-and-using-r-packages","title":"Installing and Using R packages","text":"install.packages(\"package_name\")\n\nlibrary(package_name)\n
"},{"location":"75_R_basics/#basic-syntax","title":"Basic syntax","text":""},{"location":"75_R_basics/#assignment-and-evaluation","title":"Assignment and Evaluation","text":"> x <- 1\n\n> x\n[1] 1\n\n> print(x)\n[1] 1\n
"},{"location":"75_R_basics/#data-types","title":"Data types","text":""},{"location":"75_R_basics/#atomic-data-types","title":"Atomic data types","text":"logical, integer, real, complex, string (or character)
Atomic data types Description Examples logical booleanTRUE
, FALSE
integer integer 1
,2
numeric float number 0.01
complex complex number 1+0i
string string or chracter abc
"},{"location":"75_R_basics/#vectors","title":"Vectors","text":"myvector <- c(1,2,3)\nmyvector < 1:3\n\nmyvector <- c(TRUE,FALSE)\nmyvector <- c(0.01, 0.02)\nmyvector <- c(1+0i, 2+3i)\nmyvector <- c(\"a\",\"bc\")\n
"},{"location":"75_R_basics/#matrices","title":"Matrices","text":"> mymatrix <- matrix(1:6, nrow = 2, ncol = 3)\n> mymatrix\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n\n> ncol(mymatrix)\n[1] 3\n> nrow(mymatrix)\n[1] 2\n> dim(mymatrix)\n[1] 2 3\n> length(mymatrix)\n[1] 6\n
"},{"location":"75_R_basics/#list","title":"List","text":"list()
is a special vector-like data type that can contain different data types.
> mylist <- list(1, 0.02, \"a\", FALSE, c(1,2,3), matrix(1:6,nrow=2,ncol=3))\n> mylist\n[[1]]\n[1] 1\n\n[[2]]\n[1] 0.02\n\n[[3]]\n[1] \"a\"\n\n[[4]]\n[1] FALSE\n\n[[5]]\n[1] 1 2 3\n\n[[6]]\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n
"},{"location":"75_R_basics/#dataframe","title":"Dataframe","text":"> df <- data.frame(score = c(90,80,70,60), rank = c(\"a\", \"b\", \"c\", \"d\"))\n> df\n score rank\n1 90 a\n2 80 b\n3 70 c\n4 60 d\n
"},{"location":"75_R_basics/#subsetting","title":"Subsetting","text":"myvector\n[1] 1 2 3\n> myvector[0]\ninteger(0)\n> myvector[1]\n[1] 1\nmyvector[1:2]\n[1] 1 2\n> myvector[-1]\n[1] 2 3\n> myvector[-1:-2]\n[1] 3\n
> mymatrix\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n> mymatrix[0]\ninteger(0)\n> mymatrix[1]\n[1] 1\n> mymatrix[1,]\n[1] 1 3 5\n> mymatrix[1,2]\n[1] 3\n> mymatrix[1:2,2]\n[1] 3 4\n> mymatrix[,2]\n[1] 3 4\n
> df\n score rank\n1 90 a\n2 80 b\n3 70 c\n4 60 d\n> df[score]\nError in `[.data.frame`(df, score) : object 'score' not found\n> df[[score]]\nError in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :\n object 'score' not found\n> df[[\"score\"]]\n[1] 90 80 70 60\n> df[\"score\"]\n score\n1 90\n2 80\n3 70\n4 60\n> df[1, \"score\"]\n[1] 90\n> df[1:2, \"score\"]\n[1] 90 80\n> df[1:2,2]\n[1] \"a\" \"b\"\n> df[1:2,1]\n[1] 90 80\n> df[,c(\"rank\",\"score\")]\n rank score\n1 a 90\n2 b 80\n3 c 70\n4 d 60\n
"},{"location":"75_R_basics/#data-input-and-output","title":"Data Input and Output","text":"mydata <- read.table(\"data.txt\", header=T)\n\nwrite.table(mydata, \"data.txt\")\n
"},{"location":"75_R_basics/#control-flow","title":"Control flow","text":""},{"location":"75_R_basics/#if","title":"if","text":"if (x > y){\n print (\"x\")\n} else if (x < y){\n print (\"y\")\n} else {\n print(\"tie\")\n}\n
"},{"location":"75_R_basics/#for","title":"for","text":"> for (x in 1:5) {\n print(x)\n}\n\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n
"},{"location":"75_R_basics/#while","title":"while","text":"x<-0\nwhile (x<5)\n{\n x<-x+1\n print(\"Hello world\")\n}\n\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n
"},{"location":"75_R_basics/#functions","title":"Functions","text":"myfunction <- function(x){\n // actual code here\n return(result)\n}\n\n> my_add_function <- function(x,y){\n c = x + y\n return(c)\n}\n> my_add_function(1,3)\n[1] 4\n
"},{"location":"75_R_basics/#statistical-functions","title":"Statistical functions","text":""},{"location":"75_R_basics/#normal-distribution","title":"Normal distribution","text":"Function Description dnorm(x, mean = 0, sd = 1, log = FALSE) probability density function pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) cumulative density function qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) quantile function rnorm(n, mean = 0, sd = 1) generate random values from normal distribution > dnorm(1.96)\n[1] 0.05844094\n\n> pnorm(1.96)\n[1] 0.9750021\n\n> pnorm(1.96, lower.tail=FALSE)\n[1] 0.0249979\n\n> qnorm(0.975)\n[1] 1.959964\n\n> rnorm(10)\n [1] -0.05595019 0.83176199 0.58362601 -0.89434812 0.85722843 0.96199308\n [7] 0.47782706 -0.46322066 0.03525421 -1.00715141\n
"},{"location":"75_R_basics/#chi-square-distribution","title":"Chi-square distribution","text":"Function Description dchisq(x, df, ncp = 0, log = FALSE) probability density function pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) cumulative density function qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) quantile function rchisq(n, df, ncp = 0) generate random values from normal distribution"},{"location":"75_R_basics/#regression","title":"Regression","text":"lm(formula, data, subset, weights, na.action,\n method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n singular.ok = TRUE, contrasts = NULL, offset, \u2026)\n\n# linear regression\nresults <- lm(formula = y ~ x1 + x2)\n\n# logistic regression\nresults <- lm(formula = y ~ x1 + x2, family = \"binomial\")\n
Reference: - https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
"},{"location":"76_R_resources/","title":"R Resources","text":"Conda is an open-source package and environment management system.
It is a very handy tool when you need to manage python packages.
"},{"location":"80_anaconda/#download","title":"Download","text":"https://www.anaconda.com/products/distribution
For example, download the latest linux version:
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh\n
"},{"location":"80_anaconda/#install","title":"Install","text":"# give it permission to execute\nchmod +x Anaconda3-2021.11-Linux-x86_64.sh \n\n# install\nbash ./Anaconda3-2021.11-Linux-x86_64.sh\n
Follow the instructions on : https://docs.anaconda.com/anaconda/install/linux/
If everything goes well, then you can see the (base)
before the prompt, which indicate the base environment:
(base) [heyunye@gc019 ~]$\n
For how to use conda, please check : https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html
Examples:
# install a specific version of python package\nconda install pandas==1.5.2\n\n#create a new python 3.9 virtual environment with the name \"mypython39\"\nconda create -n mypython39 python=3.9\n\n#use environment.yml to create a virtual environment\nconda env create --file environment.yml\n\n# activate a virtual environment called ldsc\nconda activate ldsc\n\n# change back to base environment\nconda deactivate\n\n# list all packages in your current environment \nconda list\n\n# list all your current environments \nconda env list\n
"},{"location":"81_jupyter_notebook/","title":"Jupyter notebook","text":"Usyally, the conda will install the jupyter notebook (and the ipykernel) by default.
If not, using conda to install it:
conda install jupyter\n
"},{"location":"81_jupyter_notebook/#using-jupyter-notebook-on-a-local-or-remote-server","title":"Using Jupyter notebook on a local or remote server","text":""},{"location":"81_jupyter_notebook/#using-the-default-configuration","title":"Using the default configuration","text":""},{"location":"81_jupyter_notebook/#local-machine","title":"Local machine","text":"You could open it in the Anaconda interface or some other IDE.
If using the terminal, just typing:
jupyter-lab --port 9000 & \n
Then open the link in the browser.
http://localhost:9000/lab?token=???\nhttp://127.0.0.1:9000/lab?token=???\n
"},{"location":"81_jupyter_notebook/#remote-server","title":"Remote server","text":"Start in the command line of the remote server, adding a port.
jupyter-lab --ip 0.0.0.0 --port 9000 --no-browser &\n
It will generate an address the same as above. Then, on the local machine, using ssh to listen to the port.
ssh -NfL localhost:9000:localhost:9000 user@host\n
Note that the localhost:9000:localhost:9000
is localmachine:localport:remotemachine:remotehost
and user@host
is the user id and address of the remote server. When this is finished, open the above in the browser.
"},{"location":"81_jupyter_notebook/#using-customized-configuration","title":"Using customized configuration","text":"Steps:
Create a jupyter notebook configuration file if there is no such file
jupyter notebook --generate-config\n
The file is usually stored at:
~/.jupyter/jupyter_notebook_config.py\n
What the first few lines of Configuration file look like:
head ~/.jupyter/jupyter_notebook_config.py\n# Configuration file for jupyter-notebook.\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n
"},{"location":"81_jupyter_notebook/#add-the-port-information","title":"Add the port information","text":"Simply add c.NotebookApp.port =8889
to the configuration file and then save. Note: you can change the port you want to use.
# Configuration file for jupyter-notebook.\n\nc.NotebookApp.port = 8889\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n
"},{"location":"81_jupyter_notebook/#run-jupyter-notebook-server-on-remote-host","title":"Run jupyter notebook server on remote host","text":"On host side, set up the jupyter notebook server:
jupyter notebook\n
"},{"location":"81_jupyter_notebook/#use-ssh-tunnel-to-connect-to-the-remote-server-from-your-local-machine","title":"Use ssh tunnel to connect to the remote server from your local machine","text":"On your local machine, use ssh tunnel to connect to the jupyter notebook server:
ssh -N -f -L localhost:8889:localhost:8889 username@your_remote_host_name\n
"},{"location":"81_jupyter_notebook/#use-jupyter-notebook-in-your-browser","title":"Use jupyter notebook in your browser","text":"Then you can access juptyer notebook on your local browser using the link generated by jupyter notebook server. http://127.0.0.1:8889/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
In this section, we will briefly demostrate how to install a linux subsystem on windows.
"},{"location":"82_windows_linux_subsystem/#official-documents","title":"Official Documents","text":"\"You must be running Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11.\"
"},{"location":"82_windows_linux_subsystem/#steps","title":"Steps","text":"Step 3 : Reboot
Step 4 : Run the subsystem
Git is very powerful version control software. Git can track the changes in all the files of your projects and allow collarboration of multiple contributors.
For details, please check: https://git-scm.com/
"},{"location":"83_git_and_github/#github","title":"Github","text":"Github is an online platform, offering a cloud-based Git repository.
https://github.com/
"},{"location":"83_git_and_github/#create-a-new-id","title":"Create a new id","text":"Github signup page:
https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home
"},{"location":"83_git_and_github/#clone-a-repository","title":"Clone a repository","text":"Syntax: git colne <the url you just copied>
Example: git clone https://github.com/Cloufield/GWASTutorial.git
git pull
$ git config --global user.name \"myusername\"\n$ git config --global user.email myusername@myemail.com\n
"},{"location":"83_git_and_github/#create-access-tokens","title":"Create access tokens","text":"Please see github official documents on how to create a personal token:
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
Useful Resources
SSH stands for Secure Shell Protocol, which enables you to connect to remote server safely.
"},{"location":"84_ssh/#login-to-remote-server","title":"Login to remote server","text":"ssh <username>@<host>\n
Before you login in, you need to generate keys for ssh connection:
"},{"location":"84_ssh/#keys","title":"Keys","text":"ssh-keygen -t rsa -b 4096\n
You will get two keys, a public one and a private one. ~/.ssh/id_rsa.pub
~/.ssh/id_rsa
Warning
Don't share your private key with others.
What you need to do is just add you local public key to ~/.ssh/authorized_keys
on host server.
Suppose you are using a local machine:
Donwload files from remote host to local machine
scp <username>@<host>:remote_path local_path\n
Upload files from local machine to remote host
scp local_path <username>@<host>:remote_path\n
Info
-r
: copy recursively. This option is needed when you want to transfer an entire directory.
Example
Copy the local work directory to remote home directory
$ scp -r /home/gwaslab/work gwaslab@remote.com:/home/gwaslab \n
"},{"location":"84_ssh/#ssh-tunneling","title":"SSH Tunneling","text":"Quote
In this forwarding type, the SSH client listens on a given port and tunnels any connection to that port to the specified port on the remote SSH server, which then connects to a port on the destination machine. The destination machine can be the remote SSH server or any other machine. https://linuxize.com/post/how-to-setup-ssh-tunneling/
-L
: Local port forwarding
ssh -L [local_IP:]local_PORT:destination:destination_PORT <username>@<host>\n
"},{"location":"84_ssh/#other-ssh-options","title":"other SSH options","text":"-f
: send to background.-p
: port for connenction (default:22).-N
: not to execute any commands on the remote host. (so you will not open a remote shell but just forward ports.)(If needed) Try to use job scheduling system to run a simple script:
Two of the most commonly used job scheduling systems:
In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.
Aim
Aim:
Here is a brief overview of this mini project.
The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.
Tip
Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.
"},{"location":"95_Assignment/#1-download-the-publicly-available-1000-genome-vcf","title":"1. Download the publicly available 1000 Genome VCF","text":"Download the files we need from 1000 Genomes Project FTP site:
Tip
Note
If it takes too long or if you are using your local laptop, you can just download the files for chr1.
Sample shell script for downloading the files
#!/bin/bash\nfor chr in $(seq 1 22) #Note: If it takes too long, you can download just chr1.\ndo\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi\ndone\n\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai\n\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed\n
"},{"location":"95_Assignment/#2-re-align-normalize-and-remove-duplication","title":"2. Re-align, normalize and remove duplication","text":"We need to use bcftools to process the raw vcf files.
Install bcftools
http://www.htslib.org/download/
Since the variants are not normalized and also have many duplications, we need to clean the vcf files.
Re-align with the reference genome, normalize variants and remove duplications
#!/bin/bash\nfor chr in $(seq 1 22)\ndo\n bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \\\n ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \\\n bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \\\n bcftools norm -Ob --rm-dup both \\\n > ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \n bcftools index ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf\ndone\n
"},{"location":"95_Assignment/#3-convert-vcf-files-to-plink-binary-format","title":"3. Convert VCF files to plink binary format","text":"Example
#!/bin/bash\nfor chr in $(seq 1 22)\ndo\nplink \\\n --bcf ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \\\n --keep-allele-order \\\n --vcf-idspace-to _ \\\n --const-fid \\\n --allow-extra-chr 0 \\\n --split-x b37 no-fail \\\n --make-bed \\\n --out ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes\ndone\n
"},{"location":"95_Assignment/#4-using-snps-only-in-strict-masks","title":"4. Using SNPs only in strict masks","text":"Strict masks are in this directory.
Strict mask
The overlapped region with this mask is \u201ccallable\u201d (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/
Tip
Use plink --make-set
option with the BED
files to extract SNPs in the strict mask.
Tip
Use PLINK.
QC: only SNPs (exclude indels), MAF>0.1
Pruning: plink --indep-pariwise
Tip
plink --pca
Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.
Tip
You can use R, python, or any other tools you like (even Excel can do the job.)
(If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)
"},{"location":"95_Assignment/#checklist","title":"Checklist","text":"Note
(Just an example, there is no need to strictly follow this.)
Fundamental Exercise II
","text":"This tutorial is provided by the Laboratory of Complex Trait Genomics (Kamatani Lab) in the Deparment of Computational Biology and Medical Sciences at the Univerty of Tokyo. This tutorial is designed for the graduate course Fundamental Exercise II
.
This repository is currently maintained by Yunye He.
If you have any questions or suggestions, please feel free to contact gwaslab@gmail.com.
Enjoy this real \"Manhattan plot\"!
"},{"location":"Imputation/","title":"Imputation","text":"The missing data imputation is not a task specific to genetic studies. By comparing the genotyping array (generally 500k\u20131M markers) to the reference panel (WGSed), missing markers on the array are filled. The tabular data imputation methods could be used to impute the genotype data. However, haplotypes are coalesced from the ancestors, and the recombination events during gametogenesis, each individual's haplotype is a mosaic of all haplotypes in a population. Given these properties, hidden Markov model (HMM) based methods usually outperform tabular data-based ones.
This HMM was first described in Li & Stephens 2003. Here we will not go through tools over the past 20 years. We will introduce the concept and the usage of Minimac.
"},{"location":"Imputation/#figure-illustration","title":"Figure illustration","text":"In the figure, each row in the above panel represents a reference haplotype. The middle panel shows the genotyping array. Genotyped markers are squared and WGS-only markers are circled. The two colors represent the ref and alt alleles. You could also think they represent different haplotype fragments. The red triangles indicate the recombination hot spots, which a crossover between the reference haplotypes is more likely to happen.
Given the genotyped marker, matching probabilities are calculated for all potential paths through reference haplotypes. Then, in this example (the real case is not this simple), we assumed at each recombination hotspot, there is a free recombination. You will see that all paths chained by dark blue match 2 of the 4 genotyped markers. So these paths have equal probability.
Finally, missing markers are filled with the probability-weighted alleles on each path. For the left three circles, two paths are cyan and one path is orange, the imputation result will be 1/3 orange and 2/3 cyan.
"},{"location":"Imputation/#how-to-do-imputation","title":"How to do imputation","text":"The simplest way is to use the Michigan or TOPMed imputation server, if you don't have resources of WGS data. Just make your vcf, submit it to the server, and select the favored reference panel. There are built-in phasing, liftover, and QC on the server, but we would strongly suggest checking the data and doing these steps by yourself. For example:
Another way is to run the job locally. Recent tools are memory and computation efficient, you may run it in a small in-house server or even PC.
A typical workflow of Minimac is:
Parameter estimation (this step will create a m3vcf reference panel file):
Minimac3 \\\n --refHaps ./phased_reference.vcf.gz \\\n --processReference \\\n --prefix ./phased_reference \\\n --log\n
Imputation:
minimac4 \\\n --refHaps ./phased_reference.m3vcf \\\n --haps ./phased_target.vcf.gz \\\n --prefix ./result \\\n --format GT,DS,HDS,GP,SD \\\n --meta \\\n --log \\\n --cpus 10\n
Details of the options.
"},{"location":"Imputation/#after-imputation","title":"After imputation","text":"The output is a vcf file. First, we need to examine the imputation quality. It can be a long long story and I will not explain it in detail. Most of the time, when the following criteria meet,
The standard imputation quality metric, named Rsq
, efficiently discriminates the well-imputed variants at a threshold 0.7 (may loosen it to 0.3 to allow more variants in the GWAS).
Three types of genotypes are widely used in GWAS -- best-guess genotype, allelic dosage, and genotype probability. Using Dosage (DS) keeps the dataset smallest while most association test software only requires this information.
"},{"location":"PRS_evaluation/","title":"Polygenic risk scores evaluation","text":""},{"location":"PRS_evaluation/#regressions-for-evaluation-of-prs","title":"Regressions for evaluation of PRS","text":"\\[Phenotype \\sim PRS_{phenotype} + Covariates\\] \\[logit(P) \\sim PRS_{phenotype} + Covariates\\]Covariates usually include sex, age and top 10 PCs.
"},{"location":"PRS_evaluation/#evaluation","title":"Evaluation","text":""},{"location":"PRS_evaluation/#roc-aic-auc-and-c-index","title":"ROC, AIC, AUC, and C-index","text":"ROC
ROC: receiver operating characteristic curve shows the performance of a classification model at all thresholds.
AUC
AUC: area under the ROC Curve, a common measure for the performance of a classification model.
AIC
Akaike Information Criterion (AIC): a measure for comparison of different statistical models.
\\[AIC = 2k - 2ln(\\hat{L})\\]C-index
C-index: Harrell\u2019s C-index (concordance index), which is a metric to evaluate the predictive performance of models and is commonly used in survival analysis. It is a measure of the probability that the predicted scores \\(M_i\\) and \\(M_j\\) by a model of two randomly selected individuals \\(i\\) and \\(j\\), have the reverse relative order as their true event times \\(T_i, T_j\\).
\\[ C = Pr (M_j > M_i | T_j < T_i) \\]Interpretation: Individuals with higher scores should have higher risks of the disease events
Reference: Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L., & Rosati, R. A. (1982). Evaluating the yield of medical tests. Jama, 247(18), 2543-2546.
Reference: Longato, E., Vettoretti, M., & Di Camillo, B. (2020). A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. Journal of Biomedical Informatics, 108, 103496.
Coefficient of determination
\\(R^2\\) : coefficient of determination, which measures the amount of variance explained by the regression model.
In linear regression:
\\[ R^2 = 1 - {{RSS}\\over{TSS}} \\]Pseudo-R2 (Nagelkerke)
In logistic regression,
One of the most commonly used Pseudo-R2 for PRS analysis is Nagelkerke's \\(R^2\\)
\\[R^2_{Nagelkerke} = {{1 - ({{L_0}\\over{L_M}})^{2/n}}\\over{1 - L_0^{2/n}}}\\]R2 on liability scale
\\(R^2\\) on the liability scale for ascertained case-control studies
\\[ R^2_l = {{R_o^2 C}\\over{1 + R_o^2 \\theta C }} \\]\\(\\theta = m {{P-K}\\over{1-K}} ( m{{P-K}\\over{1-K}} - t)\\)
\\(K\\) : population disease prevalence
Reference : Lee, S. H., Goddard, M. E., Wray, N. R., & Visscher, P. M. (2012). A better coefficient of determination for genetic profile analysis. Genetic epidemiology, 36(3), 214-224.
The authors also provided R codes for calculation (removed unrelated codes for simplicity)
# R2 on the liability scale using the transformation\n\nnt = total number of the sample\nncase = number of cases\nncont = number of controls\nthd = the threshold on the normal distribution which truncates the proportion of disease prevalence\nK = population prevalence\nP = proportion of cases in the case-control samples\n\n#threshold\nthd = -qnorm(K,0,1)\n\n#value of standard normal density function at thd\nzv = dnorm(thd) \n\n#mean liability for case\nmv = zv/K \n\n#linear model\nlmv = lm(y\u223cg) \n\n#R20 : R2 on the observed scale\nR2O = var(lmv$fitted.values)/(ncase/nt*ncont/nt)\n\n# calculate correction factors\ntheta = mv*(P-K)/(1-K)*(mv*(P-K)/(1-K)-thd) \ncv = K*(1-K)/zv^2*K*(1-K)/(P*(1-P)) \n\n# convert to R2 on the liability scale\nR2 = R2O*cv/(1+R2O*theta*cv)\n
"},{"location":"PRS_evaluation/#bootstrap-confidence-interval-methods-for-r2","title":"Bootstrap Confidence Interval Methods for R2","text":"Bootstrap is a commonly used resampling method to generate a sampling distribution from the known sample dataset. It repeatedly takes random samples with replacement from the known sample dataset.
Steps:
The percentile bootstrap interval is then defined as the interval between \\(100 \\times \\alpha /2\\) and \\(100 \\times (1 - \\alpha /2)\\) percentiles of the parameters estimated by bootstrapping. We can use this method to estimate the bootstrap interval for \\(R^2\\).
"},{"location":"PRS_evaluation/#reference","title":"Reference","text":"Human genome is diploid. Distribution of variants between homologous chromosomes can affect the interpretation of genotype data, such as allele specific expression, context-informed annotation, loss-of-function compound heterozygous events.
Example
( SHAPEIT5 )
In the above illustration, when LoF variants are on both copies of a gene, the gene is thought knocked out
Trio data and long read sequencing can solve the haplotyping problem. That is not always possible. Statistical phasing is based on the Li & Stephens Markov model. The haploid version of this model (see Imputation) is easier to understand. Because the maternal and paternal haplotypes are independent, unphased genotype could be constructed by the addition of two haplotypes.
Recent methods had incopoorates long IBD sharing, local haplotypes, etc, to make it tractable for large datasets. You could read the following methods if you are interested.
In most of the cases, phasing is just a pre-step of imputation, and we do not care about how the phasing goes. But there are several considerations, like reference-based or reference-free, large and small sample size, rare variants cutoff. There is no single method that could best fit all cases.
Here I show one example using EAGLE2.
eagle \\\n --vcf=target.vcf.gz \\\n --geneticMapFile=genetic_map_hg19_withX.txt.gz \\\n --chrom=19 \\\n --outPrefix=target.eagle \\\n --numThreads=10\n
"},{"location":"TwoSampleMR/","title":"TwoSampleMR Tutorial","text":"In\u00a0[1]: Copied! library(data.table)\nlibrary(TwoSampleMR)\nlibrary(data.table) library(TwoSampleMR)
TwoSampleMR version 0.5.6 \n[>] New: Option to use non-European LD reference panels for clumping etc\n[>] Some studies temporarily quarantined to verify effect allele\n[>] See news(package='TwoSampleMR') and https://gwas.mrcieu.ac.uk for further details\n\n\nIn\u00a0[2]: Copied!
exp_raw <- fread(\"koges_bmi.txt.gz\")\n\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_raw$phenotype <- \"BMI\"\n\nexp_raw$n <- 72282\n\nexp_dat <- format_data( exp_raw,\n type = \"exposure\",\n snp_col = \"rsids\",\n beta_col = \"beta\",\n se_col = \"sebeta\",\n effect_allele_col = \"alt\",\n other_allele_col = \"ref\",\n eaf_col = \"af\",\n pval_col = \"pval\",\n phenotype_col = \"phenotype\",\n samplesize_col= \"n\"\n)\nclumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")\nexp_raw <- fread(\"koges_bmi.txt.gz\") exp_raw <- subset(exp_raw,exp_raw$pval<5e-8) exp_raw$phenotype <- \"BMI\" exp_raw$n <- 72282 exp_dat <- format_data( exp_raw, type = \"exposure\", snp_col = \"rsids\", beta_col = \"beta\", se_col = \"sebeta\", effect_allele_col = \"alt\", other_allele_col = \"ref\", eaf_col = \"af\", pval_col = \"pval\", phenotype_col = \"phenotype\", samplesize_col= \"n\" ) clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")
Warning message in .fun(piece, ...):\n\u201cDuplicated SNPs present in exposure data for phenotype 'BMI. Just keeping the first instance:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs4665740\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs7201608\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\u201d\nAPI: public: http://gwas-api.mrcieu.ac.uk/\n\nPlease look at vignettes for options on running this locally if you need to run many instances of this command.\n\nClumping rvi6Om, 2452 variants, using EAS population reference\n\nRemoving 2420 of 2452 variants due to LD with other variants or absence from LD reference panel\n\nIn\u00a0[16]: Copied!
out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\"))\n\nout_raw$phenotype <- \"T2D\"\n\nout_dat <- format_data( out_raw,\n type = \"outcome\",\n snp_col = \"SNPID\",\n beta_col = \"BETA\",\n se_col = \"SE\",\n effect_allele_col = \"Allele2\",\n other_allele_col = \"Allele1\",\n pval_col = \"p.value\",\n phenotype_col = \"phenotype\",\n samplesize_col= \"n\",\n eaf_col=\"AF_Allele2\"\n)\nout_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\", select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\")) out_raw$phenotype <- \"T2D\" out_dat <- format_data( out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", se_col = \"SE\", effect_allele_col = \"Allele2\", other_allele_col = \"Allele1\", pval_col = \"p.value\", phenotype_col = \"phenotype\", samplesize_col= \"n\", eaf_col=\"AF_Allele2\" )
Warning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201ceffect_allele column has some values that are not A/C/T/G or an indel comprising only these characters or D/I. These SNPs will be excluded.\u201d\nWarning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201cThe following SNP(s) are missing required information for the MR tests and will be excluded\n1:1142714:t:<cn0>\n1:4288465:t:<ins:me:alu>\n1:4882232:t:<cn0>\n1:5172414:g:<cn0>\n1:5173809:t:<cn0>\n1:5934301:g:<ins:me:alu>\n1:6814818:a:<ins:me:alu>\n1:7921468:c:<cn2>\n1:8502010:t:<ins:me:alu>\n1:8924066:c:<cn0>\n1:9171841:c:<cn0>\n1:9403667:a:<cn2>\n1:9595360:a:<cn0>\n1:9846036:c:<cn0>\n1:10067190:g:<cn0>\n1:10482499:g:<cn0>\n1:11682873:t:<cn0>\n1:11830220:t:<ins:me:sva>\n1:11988599:c:<cn0>\n1:12475666:t:<ins:me:sva>\n1:12737575:a:<ins:me:alu>\n1:12842004:a:<cn0>\n1:14437074:t:<cn0>\n1:14437868:a:<cn0>\n1:14713511:t:<cn2>\n1:14735732:g:<cn0>\n1:15343948:g:<cn0>\n1:16151682:c:<cn0>\n1:16329336:t:<ins:me:sva>\n1:16358741:g:<cn0>\n1:17676165:a:<cn0>\n1:19486410:c:<ins:me:alu>\n1:19855608:a:<cn2>\n1:20257109:t:<ins:me:alu>\n1:20310746:g:<cn0>\n1:20496899:c:<cn0>\n1:20497183:c:<cn0>\n1:20864015:t:<cn0>\n1:20944751:c:<ins:me:alu>\n1:21346279:a:<cn0>\n1:21492591:c:<ins:me:alu>\n1:21786418:t:<cn0>\n1:22302473:t:<cn0>\n1:22901908:t:<ins:me:alu>\n1:23908383:g:<cn0>\n1:24223580:g:<cn0>\n1:24520350:g:<cn0>\n1:24804603:c:<cn0>\n1:25055152:g:<cn0>\n1:26460095:a:<cn0>\n1:26961278:g:<cn0>\n1:29373390:t:<ins:me:alu>\n1:31090520:t:<ins:me:alu>\n1:31316259:t:<cn0>\n1:31720009:a:<cn0>\n1:32535965:g:<cn0>\n1:32544371:a:<cn0>\n1:33785116:c:<cn0>\n1:35101427:c:<cn0>\n1:35177287:g:<cn0>\n1:35627104:t:<cn0>\n1:36474694:t:<ins:me:alu>\n1:36733282:t:<cn0>\n1:37215810:a:<ins:me:alu>\n1:37816478:a:<cn0>\n1:38132306:t:<cn0>\n1:39084231:a:<cn0>\n1:39677675:t:<ins:me:alu>\n1:40524704:t:<ins:me:alu>\n1:40552356:a:<cn0>\n1:40976681:g:<cn0>\n1:41021684:a:<cn0>\n1:41785500:a:<ins:me:line1>\n1:42390318:c:<ins:me:alu>\n1:43694061:t:<cn0>\n1:44059290:a:<inv>\n1:45021223:t:<cn0>\n1:45708588:a:<cn0>\n1:45822649:t:<cn0>\n1:46333195:a:<ins:me:alu>\n1:46794814:t:<ins:me:alu>\n1:47267517:t:<cn0>\n1:47346571:a:<cn0>\n1:47623401:a:<cn0>\n1:47913001:t:<cn0>\n1:48820285:t:<ins:me:alu>\n1:48972537:g:<ins:me:alu>\n1:49357693:t:<ins:me:alu>\n1:49428756:t:<ins:me:line1>\n1:49861993:g:<ins:me:alu>\n1:50912662:c:<ins:me:alu>\n1:51102445:t:<cn0>\n1:52146313:a:<cn0>\n1:53594175:t:<cn0>\n1:53595112:c:<cn0>\n1:55092043:g:<cn0>\n1:55341923:c:<cn0>\n1:55342224:g:<cn0>\n1:55927718:a:<cn0>\n1:56268665:t:<ins:me:line1>\n1:56405404:t:<ins:me:line1>\n1:56879062:t:<ins:me:alu>\n1:57100960:t:<ins:me:sva>\n1:57208746:a:<cn0>\n1:58722032:t:<cn2>\n1:58743910:a:<cn0>\n1:58795378:a:<cn0>\n1:59205317:t:<ins:me:alu>\n1:59591483:t:<ins:me:alu>\n1:59871876:t:<ins:me:alu>\n1:60046725:a:<cn0>\n1:60048628:c:<cn0>\n1:60470604:t:<ins:me:alu>\n1:60487912:t:<cn0>\n1:60715714:t:<ins:me:line1>\n1:61144594:c:<ins:me:alu>\n1:62082822:a:<cn0>\n1:62113386:c:<cn0>\n1:62479250:t:<cn0>\n1:62622902:g:<cn0>\n1:62654739:c:<cn0>\n1:63841704:c:<ins:me:alu>\n1:64720497:a:<cn0>\n1:64850193:a:<ins:me:sva>\n1:65346960:t:<ins:me:alu>\n1:65412505:a:<cn0>\n1:68375746:a:<cn0>\n1:70061670:g:<ins:me:alu>\n1:70091056:t:<ins:me:alu>\n1:70093557:c:<ins:me:alu>\n1:70412360:t:<ins:me:alu>\n1:70424730:t:<cn2>\n1:70820401:t:<cn0>\n1:70912433:g:<ins:me:alu>\n1:72449620:a:<cn0>\n1:72755694:t:<cn0>\n1:72766343:t:<cn0>\n1:72778537:g:<cn0>\n1:73092779:c:<cn2>\n1:74312425:a:<cn0>\n1:75148055:t:<ins:me:alu>\n1:75192907:c:<ins:me:line1>\n1:75301685:t:<ins:me:alu>\n1:75557174:c:<ins:me:alu>\n1:76392967:t:<ins:me:alu>\n1:76416074:a:<ins:me:alu>\n1:76900598:c:<cn0>\n1:77577928:t:<ins:me:alu>\n1:77634327:a:<ins:me:alu>\n1:77764994:t:<ins:me:alu>\n1:77830614:t:<cn0>\n1:78446240:c:<ins:me:sva>\n1:78607067:t:<ins:me:alu>\n1:78649157:a:<cn0>\n1:78800902:t:<ins:me:line1>\n1:79108845:t:<ins:me:alu>\n1:79331208:c:<ins:me:alu>\n1:79582082:t:<ins:me:alu>\n1:79855600:c:<cn0>\n1:80221781:t:<cn0>\n1:80299106:t:<ins:me:alu>\n1:80504615:t:<cn0>\n1:80554065:t:<cn0>\n1:80955976:t:<ins:me:line1>\n1:81422415:c:<cn0>\n1:82312054:g:<ins:me:alu>\n1:82850409:g:<ins:me:alu>\n1:83041946:t:<cn0>\n1:84056670:a:<cn0>\n1:84388330:g:<cn0>\n1:84517858:a:<cn0>\n1:84712009:g:<cn0>\n1:84913274:c:<ins:me:alu>\n1:85293152:g:<ins:me:alu>\n1:85620127:t:<ins:me:alu>\n1:85910957:g:<cn0>\n1:86400829:t:<cn0>\n1:86696940:a:<ins:me:alu>\n1:87064962:c:<cn2>\n1:87096974:c:<cn0>\n1:87096990:t:<cn0>\n1:88813625:t:<ins:me:alu>\n1:89209563:t:<ins:me:alu>\n1:89733616:t:<ins:me:line1>\n1:89811425:g:<cn0>\n1:90370569:t:<ins:me:alu>\n1:90914512:g:<ins:me:line1>\n1:91878937:g:<cn0>\n1:92131841:g:<inv>\n1:92232051:t:<cn0>\n1:93291972:c:<cn0>\n1:93498232:t:<ins:me:alu>\n1:94288372:c:<cn0>\n1:95192010:a:<ins:me:line1>\n1:95342701:g:<ins:me:alu>\n1:95522242:t:<cn0>\n1:97458273:t:<inv>\n1:98605297:t:<ins:me:alu>\n1:99610528:a:<ins:me:alu>\n1:99698454:g:<ins:me:alu>\n1:100355940:a:<ins:me:alu>\n1:100645536:g:<ins:me:alu>\n1:100994221:g:<ins:me:alu>\n1:101693230:t:<cn0>\n1:101695346:a:<cn0>\n1:101770067:g:<ins:me:alu>\n1:101978980:t:<ins:me:line1>\n1:102568923:g:<ins:me:line1>\n1:102920544:t:<ins:me:alu>\n1:103054499:t:<ins:me:alu>\n1:104359763:g:<cn0>\n1:104443176:t:<cn0>\n1:104574487:t:<ins:me:alu>\n1:105054083:t:<ins:me:alu>\n1:105070244:c:<ins:me:alu>\n1:105138650:t:<ins:me:alu>\n1:105231111:t:<ins:me:alu>\n1:105832823:g:<cn0>\n1:106015797:t:<cn0>\n1:106978443:t:<cn0>\n1:107896853:g:<cn0>\n1:107949843:t:<ins:me:alu>\n1:108142479:t:<ins:me:alu>\n1:108369370:a:<cn0>\n1:108402972:a:<cn0>\n1:109366972:g:<cn0>\n1:109573240:a:<cn0>\n1:110187159:a:<cn0>\n1:110225019:c:<cn0>\n1:111013750:a:<cn0>\n1:111472607:g:<cn0>\n1:111802597:g:<ins:me:sva>\n1:111827762:a:<cn0>\n1:111896187:c:<ins:me:sva>\n1:112032284:t:<ins:me:alu>\n1:112123691:t:<ins:me:alu>\n1:112691740:a:<cn0>\n1:112736007:a:<ins:me:alu>\n1:112992009:t:<ins:me:alu>\n1:113799625:g:<cn0>\n1:114925678:t:<cn0>\n1:115178042:c:<cn0>\n1:116229468:c:<cn0>\n1:116983571:t:<ins:me:alu>\n1:117593370:a:<cn0>\n1:119526940:a:<cn0>\n1:119553366:c:<ins:me:line1>\n1:120012853:a:<cn0>\n1:152555495:g:<cn0>\n1:152643788:a:<cn0>\n1:152760084:c:<cn0>\n1:153133703:a:<cn0>\n1:154123770:t:<ins:me:alu>\n1:154324167:g:<cn0>\n1:154865017:g:<ins:me:alu>\n1:157173860:t:<cn0>\n1:157363502:t:<ins:me:alu>\n1:157540655:g:<cn0>\n1:157887236:t:<inv>\n1:158371473:a:<ins:me:alu>\n1:158488410:a:<cn0>\n1:158726918:a:<cn0>\n1:160979498:c:<cn0>\n1:162263027:t:<ins:me:alu>\n1:163088865:t:<ins:me:alu>\n1:163314443:g:<ins:me:alu>\n1:163639693:t:<ins:me:alu>\n1:165553149:t:<ins:me:line1>\n1:165861400:t:<ins:me:sva>\n1:166189445:t:<ins:me:alu>\n1:167506110:g:<ins:me:alu>\n1:167712862:g:<ins:me:alu>\n1:168926083:a:<ins:me:sva>\n1:169004356:c:<cn0>\n1:169042039:c:<cn0>\n1:169225213:t:<cn0>\n1:169524859:t:<ins:me:line1>\n1:170603451:a:<ins:me:alu>\n1:170991168:c:<ins:me:alu>\n1:171358314:t:<ins:me:alu>\n1:172177959:g:<cn0>\n1:172825753:g:<cn0>\n1:173811663:a:<cn0>\n1:174654509:g:<cn0>\n1:174796517:t:<cn0>\n1:174894014:g:<cn0>\n1:175152408:g:<cn0>\n1:177509016:g:<cn0>\n1:177544393:g:<cn0>\n1:177946159:a:<cn0>\n1:178397612:t:<ins:me:alu>\n1:178495321:a:<cn0>\n1:178692798:t:<ins:me:alu>\n1:179491966:t:<ins:me:alu>\n1:179607260:a:<cn0>\n1:180272299:a:<cn0>\n1:180857564:c:<ins:me:alu>\n1:181043348:a:<cn0>\n1:181588360:t:<ins:me:alu>\n1:181601286:t:<ins:me:alu>\n1:181853551:g:<ins:me:alu>\n1:182420857:t:<ins:me:alu>\n1:183308627:a:<cn0>\n1:185009806:t:<cn0>\n1:185504717:c:<ins:me:alu>\n1:185584799:t:<ins:me:alu>\n1:185857064:a:<cn0>\n1:187464747:t:<cn0>\n1:187522081:g:<ins:me:alu>\n1:187609013:t:<cn0>\n1:187716053:g:<cn0>\n1:187932575:t:<cn0>\n1:187955397:c:<ins:me:alu>\n1:188174657:t:<ins:me:alu>\n1:188186464:t:<ins:me:alu>\n1:188438213:t:<ins:me:alu>\n1:188615934:g:<ins:me:alu>\n1:189247039:a:<ins:me:alu>\n1:190052658:t:<cn0>\n1:190309695:t:<cn0>\n1:190773296:t:<ins:me:alu>\n1:190874469:t:<ins:me:alu>\n1:191466954:t:<ins:me:line1>\n1:191580781:a:<ins:me:alu>\n1:191817437:c:<ins:me:alu>\n1:191916438:t:<cn0>\n1:192008678:t:<ins:me:line1>\n1:192262268:a:<ins:me:line1>\n1:193549655:c:<ins:me:line1>\n1:193675125:t:<ins:me:alu>\n1:193999047:t:<cn0>\n1:194067859:t:<ins:me:alu>\n1:194575585:t:<cn0>\n1:194675140:c:<ins:me:alu>\n1:195146820:c:<ins:me:alu>\n1:195746415:a:<ins:me:line1>\n1:195885406:g:<cn0>\n1:195904499:g:<cn0>\n1:196464453:a:<ins:me:line1>\n1:196602664:a:<cn0>\n1:196728877:g:<cn0>\n1:196734744:a:<cn0>\n1:196761370:t:<ins:me:alu>\n1:197756784:c:<inv>\n1:197894025:c:<cn0>\n1:198093872:c:<ins:me:alu>\n1:198243300:t:<ins:me:alu>\n1:198529696:t:<ins:me:line1>\n1:198757296:t:<cn0>\n1:198773749:t:<cn0>\n1:198815313:a:<ins:me:alu>\n1:202961159:t:<ins:me:alu>\n1:203684252:t:<cn0>\n1:204238474:c:<ins:me:alu>\n1:204345055:t:<ins:me:alu>\n1:204381864:c:<cn0>\n1:205178526:t:<inv>\u201d\nIn\u00a0[17]: Copied!
harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\nharmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)
Harmonising BMI (rvi6Om) and T2D (ETcv15)\n\nIn\u00a0[18]: Copied!
harmonized_data\nharmonized_data A data.frame: 28 \u00d7 29 SNPeffect_allele.exposureother_allele.exposureeffect_allele.outcomeother_allele.outcomebeta.exposurebeta.outcomeeaf.exposureeaf.outcomeremove\u22efpval.exposurese.exposuresamplesize.exposureexposuremr_keep.exposurepval_origin.exposureid.exposureactionmr_keepsamplesize.outcome <chr><chr><chr><chr><chr><dbl><dbl><dbl><dbl><lgl>\u22ef<dbl><dbl><dbl><chr><lgl><chr><chr><dbl><lgl><lgl> 1rs10198356GAGA 0.044 0.0278218160.4500.46949841FALSE\u22ef1.5e-170.005172282BMITRUEreportedrvi6Om1TRUENA 2rs10209994CACA 0.030 0.0284334240.6400.65770918FALSE\u22ef2.0e-080.005472282BMITRUEreportedrvi6Om1TRUENA 3rs10824329AGAG 0.029 0.0182171190.5100.56240335FALSE\u22ef1.7e-080.005172282BMITRUEreportedrvi6Om1TRUENA 4rs10938397GAGA 0.036 0.0445547360.2800.29915686FALSE\u22ef1.0e-100.005672282BMITRUEreportedrvi6Om1TRUENA 5rs11066132TCTC-0.053-0.0319288060.1600.24197159FALSE\u22ef1.0e-130.007172282BMITRUEreportedrvi6Om1TRUENA 6rs12522139GTGT-0.037-0.0107492430.2700.24543922FALSE\u22ef1.8e-100.005772282BMITRUEreportedrvi6Om1TRUENA 7rs12591730AGAG 0.037 0.0330428120.2200.25367536FALSE\u22ef1.5e-080.006572282BMITRUEreportedrvi6Om1TRUENA 8rs13013021TCTC 0.070 0.1040752230.9070.90195307FALSE\u22ef1.9e-150.008872282BMITRUEreportedrvi6Om1TRUENA 9rs1955337 TGTG 0.036 0.0195935030.3000.24112816FALSE\u22ef7.4e-110.005672282BMITRUEreportedrvi6Om1TRUENA 10rs2076308 CGCG 0.037 0.0413520380.3100.31562874FALSE\u22ef3.4e-110.005572282BMITRUEreportedrvi6Om1TRUENA 11rs2278557 GCGC 0.034 0.0212111960.3200.29052039FALSE\u22ef7.4e-100.005572282BMITRUEreportedrvi6Om1TRUENA 12rs2304608 ACAC 0.031 0.0466695150.4700.44287320FALSE\u22ef1.1e-090.005172282BMITRUEreportedrvi6Om1TRUENA 13rs2531995 TCTC 0.031 0.0433160150.3700.33584772FALSE\u22ef5.2e-090.005372282BMITRUEreportedrvi6Om1TRUENA 14rs261967 CACA 0.032 0.0489708280.4400.39718313FALSE\u22ef3.5e-100.005172282BMITRUEreportedrvi6Om1TRUENA 15rs35332469CTCT-0.035 0.0080755980.2200.17678428FALSE\u22ef3.6e-080.006372282BMITRUEreportedrvi6Om1TRUENA 16rs35560038TATA-0.047 0.0739350890.5900.61936434FALSE\u22ef1.4e-190.005272282BMITRUEreportedrvi6Om1TRUENA 17rs3755804 TCTC 0.043 0.0228541340.2800.30750660FALSE\u22ef1.5e-140.005672282BMITRUEreportedrvi6Om1TRUENA 18rs4470425 ACAC-0.030-0.0208441370.4500.44152032FALSE\u22ef4.9e-090.005172282BMITRUEreportedrvi6Om1TRUENA 19rs476828 CTCT 0.067 0.0786518590.2700.25309742FALSE\u22ef2.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 20rs4883723 AGAG 0.039 0.0213709100.2800.22189601FALSE\u22ef8.3e-120.005772282BMITRUEreportedrvi6Om1TRUENA 21rs509325 GTGT 0.065 0.0356917590.2800.26816326FALSE\u22ef7.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 22rs55872725TCTC 0.090 0.1215170230.1200.20355108FALSE\u22ef1.8e-310.007772282BMITRUEreportedrvi6Om1TRUENA 23rs6089309 CTCT-0.033-0.0186698330.7000.65803267FALSE\u22ef3.5e-090.005672282BMITRUEreportedrvi6Om1TRUENA 24rs6265 TCTC-0.049-0.0316426960.4600.40541994FALSE\u22ef6.1e-220.005172282BMITRUEreportedrvi6Om1TRUENA 25rs6736712 GCGC-0.053-0.0297168990.9170.93023505FALSE\u22ef2.1e-080.009572282BMITRUEreportedrvi6Om1TRUENA 26rs7560832 CACA-0.150-0.0904811950.0120.01129784FALSE\u22ef2.0e-090.025072282BMITRUEreportedrvi6Om1TRUENA 27rs825486 TCTC-0.031 0.0190735540.6900.75485104FALSE\u22ef3.1e-080.005672282BMITRUEreportedrvi6Om1TRUENA 28rs9348441 ATAT-0.036 0.1792307940.4700.42502848FALSE\u22ef1.3e-120.005172282BMITRUEreportedrvi6Om1TRUENA In\u00a0[6]: Copied!
res <- mr(harmonized_data)\nres <- mr(harmonized_data)
Analysing 'rvi6Om' on 'hff6sO'\n\nIn\u00a0[7]: Copied!
res\nres A data.frame: 5 \u00d7 9 id.exposureid.outcomeoutcomeexposuremethodnsnpbsepval <chr><chr><chr><chr><chr><int><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 281.33375800.694852606.596064e-02 rvi6Omhff6sOT2DBMIWeighted median 280.62989800.085163151.399605e-13 rvi6Omhff6sOT2DBMIInverse variance weighted280.55989560.232258061.592361e-02 rvi6Omhff6sOT2DBMISimple mode 280.60978420.133054299.340189e-05 rvi6Omhff6sOT2DBMIWeighted mode 280.59467780.126803557.011481e-05 In\u00a0[8]: Copied!
mr_heterogeneity(harmonized_data)\nmr_heterogeneity(harmonized_data) A data.frame: 2 \u00d7 8 id.exposureid.outcomeoutcomeexposuremethodQQ_dfQ_pval <chr><chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 670.7022261.000684e-124 rvi6Omhff6sOT2DBMIInverse variance weighted706.6579271.534239e-131 In\u00a0[9]: Copied!
mr_pleiotropy_test(harmonized_data)\nmr_pleiotropy_test(harmonized_data) A data.frame: 1 \u00d7 7 id.exposureid.outcomeoutcomeexposureegger_interceptsepval <chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMI-0.036036970.03052410.2484472 In\u00a0[10]: Copied!
res_single <- mr_singlesnp(harmonized_data)\nres_single <- mr_singlesnp(harmonized_data) In\u00a0[11]: Copied!
res_single\nres_single A data.frame: 30 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs10198356 0.63231400.20828372.398742e-03 2BMIT2Drvi6Omhff6sONArs10209994 0.94778080.32258143.302164e-03 3BMIT2Drvi6Omhff6sONArs10824329 0.62817650.32462145.297739e-02 4BMIT2Drvi6Omhff6sONArs10938397 1.23763160.27758548.251150e-06 5BMIT2Drvi6Omhff6sONArs11066132 0.60243030.22324016.963693e-03 6BMIT2Drvi6Omhff6sONArs12522139 0.29052010.28902403.148119e-01 7BMIT2Drvi6Omhff6sONArs12591730 0.89304900.30766873.700413e-03 8BMIT2Drvi6Omhff6sONArs13013021 1.48678890.22077771.646925e-11 9BMIT2Drvi6Omhff6sONArs1955337 0.54426400.29941466.910079e-02 10BMIT2Drvi6Omhff6sONArs2076308 1.11762260.26579692.613132e-05 11BMIT2Drvi6Omhff6sONArs2278557 0.62385870.29681843.556906e-02 12BMIT2Drvi6Omhff6sONArs2304608 1.50546820.29689053.961740e-07 13BMIT2Drvi6Omhff6sONArs2531995 1.39729080.31301578.045689e-06 14BMIT2Drvi6Omhff6sONArs261967 1.53033840.29211921.616714e-07 15BMIT2Drvi6Omhff6sONArs35332469 -0.23073140.34792195.072217e-01 16BMIT2Drvi6Omhff6sONArs35560038 -1.57308700.20189686.619637e-15 17BMIT2Drvi6Omhff6sONArs3755804 0.53149150.23250732.225933e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.69480460.30799442.407689e-02 19BMIT2Drvi6Omhff6sONArs476828 1.17390830.15685507.207355e-14 20BMIT2Drvi6Omhff6sONArs4883723 0.54797210.28550045.494141e-02 21BMIT2Drvi6Omhff6sONArs509325 0.54910400.15981965.908641e-04 22BMIT2Drvi6Omhff6sONArs55872725 1.35018910.12597918.419325e-27 23BMIT2Drvi6Omhff6sONArs6089309 0.56575250.33470099.096620e-02 24BMIT2Drvi6Omhff6sONArs6265 0.64576930.19018716.851804e-04 25BMIT2Drvi6Omhff6sONArs6736712 0.56069620.34487841.039966e-01 26BMIT2Drvi6Omhff6sONArs7560832 0.60320800.29049723.785077e-02 27BMIT2Drvi6Omhff6sONArs825486 -0.61527590.35003347.878772e-02 28BMIT2Drvi6Omhff6sONArs9348441 -4.97863320.25727821.992909e-83 29BMIT2Drvi6Omhff6sONAAll - Inverse variance weighted 0.55989560.23225811.592361e-02 30BMIT2Drvi6Omhff6sONAAll - MR Egger 1.33375800.69485266.596064e-02 In\u00a0[12]: Copied!
res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\nres_loo <- mr_leaveoneout(harmonized_data) res_loo A data.frame: 29 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs101983560.55628340.24249172.178871e-02 2BMIT2Drvi6Omhff6sONArs102099940.55205760.23881222.079526e-02 3BMIT2Drvi6Omhff6sONArs108243290.55853350.23902391.945341e-02 4BMIT2Drvi6Omhff6sONArs109383970.54126880.23887092.345460e-02 5BMIT2Drvi6Omhff6sONArs110661320.55806060.24172752.096381e-02 6BMIT2Drvi6Omhff6sONArs125221390.56671020.23950641.797373e-02 7BMIT2Drvi6Omhff6sONArs125917300.55248020.23909902.085075e-02 8BMIT2Drvi6Omhff6sONArs130130210.51897150.23868082.968017e-02 9BMIT2Drvi6Omhff6sONArs1955337 0.56026350.23945051.929468e-02 10BMIT2Drvi6Omhff6sONArs2076308 0.54313550.23944032.330758e-02 11BMIT2Drvi6Omhff6sONArs2278557 0.55836340.23949241.972992e-02 12BMIT2Drvi6Omhff6sONArs2304608 0.53725570.23773252.382639e-02 13BMIT2Drvi6Omhff6sONArs2531995 0.54190160.23797122.277590e-02 14BMIT2Drvi6Omhff6sONArs261967 0.53587610.23766862.415093e-02 15BMIT2Drvi6Omhff6sONArs353324690.57359070.23783451.587739e-02 16BMIT2Drvi6Omhff6sONArs355600380.67349060.22178042.391474e-03 17BMIT2Drvi6Omhff6sONArs3755804 0.56102150.24132492.008503e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.55689930.23926321.993549e-02 19BMIT2Drvi6Omhff6sONArs476828 0.50375550.24432243.922224e-02 20BMIT2Drvi6Omhff6sONArs4883723 0.56020500.23973251.945000e-02 21BMIT2Drvi6Omhff6sONArs509325 0.56084290.24685062.308693e-02 22BMIT2Drvi6Omhff6sONArs558727250.44194460.24547717.180543e-02 23BMIT2Drvi6Omhff6sONArs6089309 0.55978590.23889021.911519e-02 24BMIT2Drvi6Omhff6sONArs6265 0.55470680.24369102.282978e-02 25BMIT2Drvi6Omhff6sONArs6736712 0.55988150.23876021.902944e-02 26BMIT2Drvi6Omhff6sONArs7560832 0.55881130.23962291.969836e-02 27BMIT2Drvi6Omhff6sONArs825486 0.58000260.23675451.429330e-02 28BMIT2Drvi6Omhff6sONArs9348441 0.73789670.13668386.717515e-08 29BMIT2Drvi6Omhff6sONAAll 0.55989560.23225811.592361e-02 In\u00a0[29]: Copied!
harmonized_data$\"r.outcome\" <- get_r_from_lor(\n harmonized_data$\"beta.outcome\",\n harmonized_data$\"eaf.outcome\",\n 45383,\n 132032,\n 0.26,\n model = \"logit\",\n correction = FALSE\n)\nharmonized_data$\"r.outcome\" <- get_r_from_lor( harmonized_data$\"beta.outcome\", harmonized_data$\"eaf.outcome\", 45383, 132032, 0.26, model = \"logit\", correction = FALSE ) In\u00a0[34]: Copied!
out <- directionality_test(harmonized_data)\nout\nout <- directionality_test(harmonized_data) out
r.exposure and/or r.outcome not present.\n\nCalculating approximate SNP-exposure and/or SNP-outcome correlations, assuming all are quantitative traits. Please pre-calculate r.exposure and/or r.outcome using get_r_from_lor() for any binary traits\n\nA data.frame: 1 \u00d7 8 id.exposureid.outcomeexposureoutcomesnp_r2.exposuresnp_r2.outcomecorrect_causal_directionsteiger_pval <chr><chr><chr><chr><dbl><dbl><lgl><dbl> rvi6OmETcv15BMIT2D0.021254530.005496427TRUENA In\u00a0[\u00a0]: Copied!
res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\nres <- mr(harmonized_data) p1 <- mr_scatter_plot(res, harmonized_data) p1[[1]] In\u00a0[\u00a0]: Copied!
res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\nres_single <- mr_singlesnp(harmonized_data) p2 <- mr_forest_plot(res_single) p2[[1]] In\u00a0[\u00a0]: Copied!
res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\nres_loo <- mr_leaveoneout(harmonized_data) p3 <- mr_leaveoneout_plot(res_loo) p3[[1]] In\u00a0[\u00a0]: Copied!
res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\nres_single <- mr_singlesnp(harmonized_data) p4 <- mr_funnel_plot(res_single) p4[[1]] In\u00a0[\u00a0]: Copied!
\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"Visualization/","title":"Visualization by gwaslab","text":"In\u00a0[2]: Copied!
import gwaslab as gl\nimport gwaslab as gl In\u00a0[3]: Copied!
sumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")\nsumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")
Tue Dec 26 15:56:49 2023 GWASLab v3.4.22 https://cloufield.github.io/gwaslab/\nTue Dec 26 15:56:49 2023 (C) 2022-2023, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\nTue Dec 26 15:56:49 2023 Start to load format from formatbook....\nTue Dec 26 15:56:49 2023 -plink2 format meta info:\nTue Dec 26 15:56:49 2023 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear\nTue Dec 26 15:56:49 2023 - format_source : https://www.cog-genomics.org/plink/2.0/formats\nTue Dec 26 15:56:49 2023 - format_version : Alpha 3.3 final (3 Jun)\nTue Dec 26 15:56:49 2023 - last_check_date : 20220806\nTue Dec 26 15:56:49 2023 -plink2 to gwaslab format dictionary:\nTue Dec 26 15:56:49 2023 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\nTue Dec 26 15:56:49 2023 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\nTue Dec 26 15:56:49 2023 Start to initiate from file :1kgeas.B1.glm.firth\nTue Dec 26 15:56:50 2023 -Reading columns : REF,ID,ALT,POS,OR,LOG(OR)_SE,Z_STAT,OBS_CT,A1,#CHROM,P,A1_FREQ\nTue Dec 26 15:56:50 2023 -Renaming columns to : REF,SNPID,ALT,POS,OR,SE,Z,N,EA,CHR,P,EAF\nTue Dec 26 15:56:50 2023 -Current Dataframe shape : 1128732 x 12\nTue Dec 26 15:56:50 2023 -Initiating a status column: STATUS ...\nTue Dec 26 15:56:50 2023 NEA not available: assigning REF to NEA...\nTue Dec 26 15:56:50 2023 -EA,REF and ALT columns are available: assigning NEA...\nTue Dec 26 15:56:50 2023 -For variants with EA == ALT : assigning REF to NEA ...\nTue Dec 26 15:56:50 2023 -For variants with EA != ALT : assigning ALT to NEA ...\nTue Dec 26 15:56:50 2023 Start to reorder the columns...\nTue Dec 26 15:56:50 2023 -Current Dataframe shape : 1128732 x 14\nTue Dec 26 15:56:50 2023 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\nTue Dec 26 15:56:50 2023 Finished sorting columns successfully!\nTue Dec 26 15:56:50 2023 -Column: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \nTue Dec 26 15:56:50 2023 -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\nTue Dec 26 15:56:50 2023 Finished loading data successfully!\nIn\u00a0[4]: Copied!
sumstats.data\nsumstats.data Out[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 0 1:15774:G:A 1 15774 A G 0.028283 NaN NaN NaN NaN 495 9999999 G A 1 1:15777:A:G 1 15777 G A 0.073737 NaN NaN NaN NaN 495 9999999 A G 2 1:57292:C:T 1 57292 T C 0.104675 NaN NaN NaN NaN 492 9999999 C T 3 1:77874:G:A 1 77874 A G 0.019153 0.462750 0.249299 0.803130 1.122280 496 9999999 G A 4 1:87360:C:T 1 87360 T C 0.023139 NaN NaN NaN NaN 497 9999999 C T ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 1128727 22:51217954:G:A 22 51217954 A G 0.033199 NaN NaN NaN NaN 497 9999999 G A 1128728 22:51218377:G:C 22 51218377 C G 0.033333 0.362212 -0.994457 0.320000 0.697534 495 9999999 G C 1128729 22:51218615:T:A 22 51218615 A T 0.033266 0.362476 -1.029230 0.303374 0.688618 496 9999999 T A 1128730 22:51222100:G:T 22 51222100 T G 0.039157 NaN NaN NaN NaN 498 9999999 G T 1128731 22:51239678:G:T 22 51239678 T G 0.034137 NaN NaN NaN NaN 498 9999999 G T
1128732 rows \u00d7 14 columns
In\u00a0[5]: Copied!sumstats.get_lead(sig_level=5e-8)\nsumstats.get_lead(sig_level=5e-8)
Tue Dec 26 15:56:51 2023 Start to extract lead variants...\nTue Dec 26 15:56:51 2023 -Processing 1128732 variants...\nTue Dec 26 15:56:51 2023 -Significance threshold : 5e-08\nTue Dec 26 15:56:51 2023 -Sliding window size: 500 kb\nTue Dec 26 15:56:51 2023 -Found 43 significant variants in total...\nTue Dec 26 15:56:51 2023 -Identified 4 lead variants!\nTue Dec 26 15:56:51 2023 Finished extracting lead variants successfully!\nOut[5]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 54904 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A 113179 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T 549726 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G 1088750 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C In\u00a0[9]: Copied!
sumstats.plot_mqq(skip=2,anno=True)\nsumstats.plot_mqq(skip=2,anno=True)
Tue Dec 26 15:59:17 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:59:17 2023 -Genomic coordinates version: 99...\nTue Dec 26 15:59:17 2023 -WARNING!!! Genomic coordinates version is unknown...\nTue Dec 26 15:59:17 2023 -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:59:17 2023 -Raw input contains 1128732 variants...\nTue Dec 26 15:59:17 2023 -Plot layout mode is : mqq\nTue Dec 26 15:59:17 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:59:17 2023 Start conversion and sanity check:\nTue Dec 26 15:59:17 2023 -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:59:17 2023 -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:59:17 2023 -Removed 220793 variants with nan in P column ...\nTue Dec 26 15:59:17 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:59:17 2023 -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:59:17 2023 -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:59:17 2023 -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:59:17 2023 Finished data conversion and sanity check.\nTue Dec 26 15:59:17 2023 Start to create manhattan plot with 6866 variants:\nTue Dec 26 15:59:17 2023 -Found 4 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:17 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:17 2023 -Annotating using column CHR:POS...\nTue Dec 26 15:59:17 2023 -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:17 2023 Start to create QQ plot with 6866 variants:\nTue Dec 26 15:59:17 2023 Expected range of P: (0,1.0)\nTue Dec 26 15:59:17 2023 -Lambda GC (MLOG10P mode) at 0.5 is 0.98908\nTue Dec 26 15:59:17 2023 Finished creating QQ plot successfully!\nTue Dec 26 15:59:17 2023 -Skip saving figures!\nOut[9]:
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)In\u00a0[6]: Copied!
sumstats.basic_check()\nsumstats.basic_check()
Tue Dec 27 23:08:13 2022 Start to check IDs...\nTue Dec 27 23:08:13 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:13 2022 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)\nTue Dec 27 23:08:14 2022 Finished checking IDs successfully!\nTue Dec 27 23:08:14 2022 Start to fix chromosome notation...\nTue Dec 27 23:08:14 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:17 2022 -Vairants with standardized chromosome notation: 1122299\nTue Dec 27 23:08:19 2022 -All CHR are already fixed...\nTue Dec 27 23:08:21 2022 Finished fixing chromosome notation successfully!\nTue Dec 27 23:08:21 2022 Start to fix basepair positions...\nTue Dec 27 23:08:21 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:21 2022 -Converting to Int64 data type ...\nTue Dec 27 23:08:22 2022 -Position upper_bound is: 250,000,000\nTue Dec 27 23:08:24 2022 -Remove outliers: 0\nTue Dec 27 23:08:24 2022 -Converted all position to datatype Int64.\nTue Dec 27 23:08:24 2022 Finished fixing basepair position successfully!\nTue Dec 27 23:08:24 2022 Start to fix alleles...\nTue Dec 27 23:08:24 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:25 2022 -Detected 0 variants with alleles that contain bases other than A/C/T/G .\nTue Dec 27 23:08:25 2022 -Converted all bases to string datatype and UPPERCASE.\nTue Dec 27 23:08:27 2022 Finished fixing allele successfully!\nTue Dec 27 23:08:27 2022 Start sanity check for statistics ...\nTue Dec 27 23:08:27 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:27 2022 -Checking if 0 <=N<= inf ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad N.\nTue Dec 27 23:08:27 2022 -Checking if -37.5 <Z< 37.5 ...\nTue Dec 27 23:08:27 2022 -Removed 14 variants with bad Z.\nTue Dec 27 23:08:27 2022 -Checking if 5e-300 <= P <= 1 ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad P.\nTue Dec 27 23:08:27 2022 -Checking if 0 <SE< inf ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad SE.\nTue Dec 27 23:08:27 2022 -Checking if -10 <log(OR)< 10 ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad OR.\nTue Dec 27 23:08:27 2022 -Checking STATUS...\nTue Dec 27 23:08:28 2022 -Coverting STAUTUS to interger.\nTue Dec 27 23:08:28 2022 -Removed 14 variants with bad statistics in total.\nTue Dec 27 23:08:28 2022 Finished sanity check successfully!\nTue Dec 27 23:08:28 2022 Start to normalize variants...\nTue Dec 27 23:08:28 2022 -Current Dataframe shape : 1122285 x 11\nTue Dec 27 23:08:29 2022 -No available variants to normalize..\nTue Dec 27 23:08:29 2022 Finished normalizing variants successfully!\nIn\u00a0[7]: Copied!
sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\")\n#2:55513738\nsumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\") #2:55513738
Tue Dec 26 15:58:10 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:10 2023 -Genomic coordinates version: 19...\nTue Dec 26 15:58:10 2023 -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:10 2023 -Raw input contains 1128732 variants...\nTue Dec 26 15:58:10 2023 -Plot layout mode is : r\nTue Dec 26 15:58:10 2023 -Region to plot : chr2:54513738-56513738.\nTue Dec 26 15:58:10 2023 -Extract SNPs in region : chr2:54513738-56513738...\nTue Dec 26 15:58:10 2023 -Extract SNPs in specified regions: 865\nTue Dec 26 15:58:10 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:10 2023 Start conversion and sanity check:\nTue Dec 26 15:58:10 2023 -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:10 2023 -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:10 2023 -Removed 160 variants with nan in P column ...\nTue Dec 26 15:58:10 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:10 2023 -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:10 2023 -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:11 2023 -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:11 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:11 2023 Start to create manhattan plot with 705 variants:\nTue Dec 26 15:58:11 2023 -Extracting lead variant...\nTue Dec 26 15:58:11 2023 -Loading gtf files from:default\n
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
Tue Dec 26 15:58:40 2023 -plotting gene track..\nTue Dec 26 15:58:40 2023 -Finished plotting gene track..\nTue Dec 26 15:58:40 2023 -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:58:40 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:58:40 2023 -Annotating using column CHR:POS...\nTue Dec 26 15:58:40 2023 -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:58:40 2023 -Skip saving figures!\nOut[7]:
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)In\u00a0[8]: Copied!
gl.download_ref(\"1kg_eas_hg19\")\ngl.download_ref(\"1kg_eas_hg19\")
Tue Dec 27 22:44:52 2022 Start to download 1kg_eas_hg19 ...\nTue Dec 27 22:44:52 2022 -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 27 22:52:33 2022 -Updating record in config file...\nTue Dec 27 22:52:35 2022 -Updating record in config file...\nTue Dec 27 22:52:35 2022 -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi\nTue Dec 27 22:52:35 2022 Downloaded 1kg_eas_hg19 successfully!\nIn\u00a0[8]: Copied!
sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")\nsumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")
Tue Dec 26 15:58:41 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:41 2023 -Genomic coordinates version: 19...\nTue Dec 26 15:58:41 2023 -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:41 2023 -Raw input contains 1128732 variants...\nTue Dec 26 15:58:41 2023 -Plot layout mode is : r\nTue Dec 26 15:58:41 2023 -Region to plot : chr2:54531536-56731536.\nTue Dec 26 15:58:41 2023 -Checking prefix for chromosomes in vcf files...\nTue Dec 26 15:58:41 2023 -No prefix for chromosomes in the VCF files.\nTue Dec 26 15:58:41 2023 -Extract SNPs in region : chr2:54531536-56731536...\nTue Dec 26 15:58:41 2023 -Extract SNPs in specified regions: 967\nTue Dec 26 15:58:41 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:41 2023 Start conversion and sanity check:\nTue Dec 26 15:58:41 2023 -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:41 2023 -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:41 2023 -Removed 172 variants with nan in P column ...\nTue Dec 26 15:58:41 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:41 2023 -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:41 2023 -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:41 2023 -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:41 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:41 2023 Start to load reference genotype...\nTue Dec 26 15:58:41 2023 -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 26 15:58:43 2023 -Retrieving index...\nTue Dec 26 15:58:43 2023 -Ref variants in the region: 71908\nTue Dec 26 15:58:43 2023 -Matching variants using POS, NEA, EA ...\nTue Dec 26 15:58:43 2023 -Calculating Rsq...\nTue Dec 26 15:58:43 2023 Finished loading reference genotype successfully!\nTue Dec 26 15:58:43 2023 Start to create manhattan plot with 795 variants:\nTue Dec 26 15:58:43 2023 -Extracting lead variant...\nTue Dec 26 15:58:44 2023 -Loading gtf files from:default\n
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
Tue Dec 26 15:59:12 2023 -plotting gene track..\nTue Dec 26 15:59:12 2023 -Finished plotting gene track..\nTue Dec 26 15:59:13 2023 -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:13 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:13 2023 -Annotating using column CHR:POS...\nTue Dec 26 15:59:13 2023 -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:13 2023 -Skip saving figures!\nOut[8]:
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)In\u00a0[\u00a0]: Copied!
\n"},{"location":"Visualization/#visualization-by-gwaslab","title":"Visualization by gwaslab\u00b6","text":""},{"location":"Visualization/#import-gwaslab-package","title":"Import gwaslab package\u00b6","text":""},{"location":"Visualization/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"Visualization/#check-the-lead-variants-in-significant-loci","title":"Check the lead variants in significant loci\u00b6","text":""},{"location":"Visualization/#create-mahattan-plot","title":"Create mahattan plot\u00b6","text":""},{"location":"Visualization/#qc-check","title":"QC check\u00b6","text":""},{"location":"Visualization/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"Visualization/#create-regional-plot-with-ld-information","title":"Create regional plot with LD information\u00b6","text":""},{"location":"finemapping_susie/","title":"Finemapping using susieR","text":"In\u00a0[1]: Copied!
import gwaslab as gl\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nimport gwaslab as gl import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt In\u00a0[2]: Copied!
sumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")\nsumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")
2024/04/18 10:40:48 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/\n2024/04/18 10:40:48 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\n2024/04/18 10:40:48 Start to load format from formatbook....\n2024/04/18 10:40:48 -plink2 format meta info:\n2024/04/18 10:40:48 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear\n2024/04/18 10:40:48 - format_source : https://www.cog-genomics.org/plink/2.0/formats\n2024/04/18 10:40:48 - format_version : Alpha 3.3 final (3 Jun)\n2024/04/18 10:40:48 - last_check_date : 20220806\n2024/04/18 10:40:48 -plink2 to gwaslab format dictionary:\n2024/04/18 10:40:48 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\n2024/04/18 10:40:48 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\n2024/04/18 10:40:48 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz\n2024/04/18 10:40:49 -Reading columns : Z_STAT,A1_FREQ,POS,ALT,REF,P,A1,OR,OBS_CT,#CHROM,LOG(OR)_SE,ID\n2024/04/18 10:40:49 -Renaming columns to : Z,EAF,POS,ALT,REF,P,EA,OR,N,CHR,SE,SNPID\n2024/04/18 10:40:49 -Current Dataframe shape : 1128732 x 12\n2024/04/18 10:40:49 -Initiating a status column: STATUS ...\n2024/04/18 10:40:49 #WARNING! Version of genomic coordinates is unknown...\n2024/04/18 10:40:49 NEA not available: assigning REF to NEA...\n2024/04/18 10:40:49 -EA,REF and ALT columns are available: assigning NEA...\n2024/04/18 10:40:49 -For variants with EA == ALT : assigning REF to NEA ...\n2024/04/18 10:40:49 -For variants with EA != ALT : assigning ALT to NEA ...\n2024/04/18 10:40:49 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:49 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:49 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:49 Finished reordering the columns.\n2024/04/18 10:40:49 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \n2024/04/18 10:40:49 -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\n2024/04/18 10:40:49 -Verified: T T T T T T T T T T T T T T \n2024/04/18 10:40:50 -Current Dataframe memory usage: 106.06 MB\n2024/04/18 10:40:50 Finished loading data successfully!\nIn\u00a0[3]: Copied!
sumstats.basic_check()\nsumstats.basic_check()
2024/04/18 10:40:50 Start to check SNPID/rsID...v3.4.43\n2024/04/18 10:40:50 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:50 -Checking SNPID data type...\n2024/04/18 10:40:50 -Converting SNPID to pd.string data type...\n2024/04/18 10:40:50 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)\n2024/04/18 10:40:51 Finished checking SNPID/rsID.\n2024/04/18 10:40:51 Start to fix chromosome notation (CHR)...v3.4.43\n2024/04/18 10:40:51 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:51 -Checking CHR data type...\n2024/04/18 10:40:51 -Variants with standardized chromosome notation: 1128732\n2024/04/18 10:40:51 -All CHR are already fixed...\n2024/04/18 10:40:52 Finished fixing chromosome notation (CHR).\n2024/04/18 10:40:52 Start to fix basepair positions (POS)...v3.4.43\n2024/04/18 10:40:52 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 107.13 MB\n2024/04/18 10:40:52 -Converting to Int64 data type ...\n2024/04/18 10:40:53 -Position bound:(0 , 250,000,000)\n2024/04/18 10:40:53 -Removed outliers: 0\n2024/04/18 10:40:53 Finished fixing basepair positions (POS).\n2024/04/18 10:40:53 Start to fix alleles (EA and NEA)...v3.4.43\n2024/04/18 10:40:53 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:53 -Converted all bases to string datatype and UPPERCASE.\n2024/04/18 10:40:53 -Variants with bad EA : 0\n2024/04/18 10:40:54 -Variants with bad NEA : 0\n2024/04/18 10:40:54 -Variants with NA for EA or NEA: 0\n2024/04/18 10:40:54 -Variants with same EA and NEA: 0\n2024/04/18 10:40:54 -Detected 0 variants with alleles that contain bases other than A/C/T/G .\n2024/04/18 10:40:55 Finished fixing alleles (EA and NEA).\n2024/04/18 10:40:55 Start to perform sanity check for statistics...v3.4.43\n2024/04/18 10:40:55 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:55 -Comparison tolerance for floats: 1e-07\n2024/04/18 10:40:55 -Checking if 0 <= N <= 2147483647 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na N.\n2024/04/18 10:40:55 -Checking if -1e-07 < EAF < 1.0000001 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na EAF.\n2024/04/18 10:40:55 -Checking if -9999.0000001 < Z < 9999.0000001 ...\n2024/04/18 10:40:55 -Examples of invalid variants(SNPID): 1:15774:G:A,1:15777:A:G,1:57292:C:T,1:87360:C:T,1:625392:T:C ...\n2024/04/18 10:40:55 -Examples of invalid values (Z): NA,NA,NA,NA,NA ...\n2024/04/18 10:40:55 -Removed 220793 variants with bad/na Z.\n2024/04/18 10:40:55 -Checking if -1e-07 < P < 1.0000001 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na P.\n2024/04/18 10:40:55 -Checking if -1e-07 < SE < inf ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na SE.\n2024/04/18 10:40:55 -Checking if -100.0000001 < OR < 100.0000001 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na OR.\n2024/04/18 10:40:55 -Checking STATUS and converting STATUS to categories....\n2024/04/18 10:40:56 -Removed 220793 variants with bad statistics in total.\n2024/04/18 10:40:56 -Data types for each column:\n2024/04/18 10:40:56 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \n2024/04/18 10:40:56 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:40:56 -Verified: T T T T T T T T T T T T T T \n2024/04/18 10:40:56 Finished sanity check for statistics.\n2024/04/18 10:40:56 Start to check data consistency across columns...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 -Tolerance: 0.001 (Relative) and 0.001 (Absolute)\n2024/04/18 10:40:56 -No availalbe columns for data consistency checking...Skipping...\n2024/04/18 10:40:56 Finished checking data consistency across columns.\n2024/04/18 10:40:56 Start to normalize indels...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 -No available variants to normalize..\n2024/04/18 10:40:56 Finished normalizing variants successfully!\n2024/04/18 10:40:56 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 Finished sorting coordinates.\n2024/04/18 10:40:56 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:56 Finished reordering the columns.\n
Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.
In\u00a0[4]: Copied!sumstats.get_lead()\nsumstats.get_lead()
2024/04/18 10:40:56 Start to extract lead variants...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56 -Processing 907939 variants...\n2024/04/18 10:40:56 -Significance threshold : 5e-08\n2024/04/18 10:40:56 -Sliding window size: 500 kb\n2024/04/18 10:40:56 -Using P for extracting lead variants...\n2024/04/18 10:40:56 -Found 43 significant variants in total...\n2024/04/18 10:40:56 -Identified 4 lead variants!\n2024/04/18 10:40:56 Finished extracting lead variants.\nOut[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 44298 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9960099 G A 91266 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9960099 C T 442239 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9960099 T G 875859 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9960099 T C In\u00a0[5]: Copied!
sumstats.plot_mqq()\nsumstats.plot_mqq()
2024/04/18 10:40:57 Start to create MQQ plot...v3.4.43:\n2024/04/18 10:40:57 -Genomic coordinates version: 99...\n2024/04/18 10:40:57 #WARNING! Genomic coordinates version is unknown.\n2024/04/18 10:40:57 -Genome-wide significance level to plot is set to 5e-08 ...\n2024/04/18 10:40:57 -Raw input contains 907939 variants...\n2024/04/18 10:40:57 -MQQ plot layout mode is : mqq\n2024/04/18 10:40:57 Finished loading specified columns from the sumstats.\n2024/04/18 10:40:57 Start data conversion and sanity check:\n2024/04/18 10:40:57 -Removed 0 variants with nan in CHR or POS column ...\n2024/04/18 10:40:57 -Removed 0 variants with CHR <=0...\n2024/04/18 10:40:57 -Removed 0 variants with nan in P column ...\n2024/04/18 10:40:57 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\n2024/04/18 10:40:57 -Sumstats P values are being converted to -log10(P)...\n2024/04/18 10:40:57 -Sanity check: 0 na/inf/-inf variants will be removed...\n2024/04/18 10:40:57 -Converting data above cut line...\n2024/04/18 10:40:57 -Maximum -log10(P) value is 14.772946706439042 .\n2024/04/18 10:40:57 Finished data conversion and sanity check.\n2024/04/18 10:40:57 Start to create MQQ plot with 907939 variants...\n2024/04/18 10:40:58 -Creating background plot...\n2024/04/18 10:40:59 Finished creating MQQ plot successfully!\n2024/04/18 10:40:59 Start to extract variants for annotation...\n2024/04/18 10:40:59 -Found 4 significant variants with a sliding window size of 500 kb...\n2024/04/18 10:40:59 Finished extracting variants for annotation...\n2024/04/18 10:40:59 Start to process figure arts.\n2024/04/18 10:40:59 -Processing X ticks...\n2024/04/18 10:40:59 -Processing X labels...\n2024/04/18 10:40:59 -Processing Y labels...\n2024/04/18 10:40:59 -Processing Y tick lables...\n2024/04/18 10:40:59 -Processing Y labels...\n2024/04/18 10:40:59 -Processing lines...\n2024/04/18 10:40:59 Finished processing figure arts.\n2024/04/18 10:40:59 Start to annotate variants...\n2024/04/18 10:40:59 -Skip annotating\n2024/04/18 10:40:59 Finished annotating variants.\n2024/04/18 10:40:59 Start to create QQ plot with 907939 variants:\n2024/04/18 10:40:59 -Plotting all variants...\n2024/04/18 10:40:59 -Expected range of P: (0,1.0)\n2024/04/18 10:40:59 -Lambda GC (MLOG10P mode) at 0.5 is 0.98908\n2024/04/18 10:40:59 -Processing Y tick lables...\n2024/04/18 10:40:59 Finished creating QQ plot successfully!\n2024/04/18 10:40:59 Start to save figure...\n2024/04/18 10:40:59 -Skip saving figure!\n2024/04/18 10:40:59 Finished saving figure...\n2024/04/18 10:40:59 Finished creating plot successfully!\nOut[5]:
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fa6ad1132b0>)In\u00a0[6]: Copied!
locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')\nlocus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
2024/04/18 10:41:06 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06 -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06 Finished filtering values.\nIn\u00a0[7]: Copied!
locus.fill_data(to_fill=[\"BETA\"])\nlocus.fill_data(to_fill=[\"BETA\"])
2024/04/18 10:41:06 Start filling data using existing columns...v3.4.43\n2024/04/18 10:41:06 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \n2024/04/18 10:41:06 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:41:06 -Verified: T T T T T T T T T T T T T T \n2024/04/18 10:41:06 -Overwrite mode: False\n2024/04/18 10:41:06 -Skipping columns: []\n2024/04/18 10:41:06 -Filling columns: ['BETA']\n2024/04/18 10:41:06 - Filling Columns iteratively...\n2024/04/18 10:41:06 - Filling BETA value using OR column...\n2024/04/18 10:41:06 Finished filling data using existing columns.\n2024/04/18 10:41:06 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:06 -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:06 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:06 Finished reordering the columns.\nIn\u00a0[8]: Copied!
locus.data\nlocus.data Out[8]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 91067 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960099 A T 91068 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960099 G A 91069 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960099 G A 91070 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960099 A C 91071 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960099 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 91441 2:56004219:G:T 2 56004219 G T 0.171717 0.148489 0.169557 0.875763 0.381159 1.160080 495 9960099 G T 91442 2:56007034:T:C 2 56007034 T C 0.260121 0.073325 0.145565 0.503737 0.614446 1.076080 494 9960099 T C 91443 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960099 C G 91444 2:56009480:A:T 2 56009480 A T 0.157258 0.135667 0.177621 0.763784 0.444996 1.145300 496 9960099 A T 91445 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960099 C T
379 rows \u00d7 15 columns
In\u00a0[9]: Copied!locus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")\nlocus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")
2024/04/18 10:41:07 Start to check if NEA is aligned with reference sequence...v3.4.43\n2024/04/18 10:41:07 -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:07 -Reference genome FASTA file: /home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\n2024/04/18 10:41:07 -Loading fasta records:2 \n2024/04/18 10:41:19 -Checking records\n2024/04/18 10:41:19 -Building numpy fasta records from dict\n2024/04/18 10:41:20 -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )\n2024/04/18 10:41:20 -Checking records for ( len(NEA) > 4 or len(EA) > 4 )\n2024/04/18 10:41:20 -Finished checking records\n2024/04/18 10:41:20 -Variants allele on given reference sequence : 264\n2024/04/18 10:41:20 -Variants flipped : 115\n2024/04/18 10:41:20 -Raw Matching rate : 100.00%\n2024/04/18 10:41:20 -Variants inferred reverse_complement : 0\n2024/04/18 10:41:20 -Variants inferred reverse_complement_flipped : 0\n2024/04/18 10:41:20 -Both allele on genome + unable to distinguish : 0\n2024/04/18 10:41:20 -Variants not on given reference sequence : 0\n2024/04/18 10:41:20 Finished checking if NEA is aligned with reference sequence.\n2024/04/18 10:41:20 Start to adjust statistics based on STATUS code...v3.4.43\n2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.43\n2024/04/18 10:41:20 -Flipping 115 variants...\n2024/04/18 10:41:20 -Swapping column: NEA <=> EA...\n2024/04/18 10:41:20 -Flipping column: BETA = - BETA...\n2024/04/18 10:41:20 -Flipping column: Z = - Z...\n2024/04/18 10:41:20 -Flipping column: EAF = 1 - EAF...\n2024/04/18 10:41:20 -Flipping column: OR = 1 / OR...\n2024/04/18 10:41:20 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x\n2024/04/18 10:41:20 Finished adjusting statistics based on STATUS code.\n2024/04/18 10:41:20 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Finished sorting coordinates.\n2024/04/18 10:41:20 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.03 MB\n2024/04/18 10:41:20 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:20 Finished reordering the columns.\nOut[9]:
<gwaslab.g_Sumstats.Sumstats at 0x7fa6a33a8130>In\u00a0[10]: Copied!
locus.data\nlocus.data Out[10]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T
379 rows \u00d7 15 columns
In\u00a0[11]: Copied!locus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\nlocus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None) locus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None) In\u00a0[12]: Copied!
!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt\n\n!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r2 square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt_r2\n!plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r square \\ --extract sig_locus.snplist \\ --out sig_locus_mt !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract sig_locus.snplist \\ --out sig_locus_mt_r2
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to sig_locus_mt.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract sig_locus.snplist\n --keep-allele-order\n --out sig_locus_mt\n --r square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r square to sig_locus_mt.ld ... 0% [processingwriting] done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to sig_locus_mt_r2.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract sig_locus.snplist\n --keep-allele-order\n --out sig_locus_mt_r2\n --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt_r2.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting] done.\nIn\u00a0[13]: Copied!
import rpy2\nimport rpy2.robjects as ro\nfrom rpy2.robjects.packages import importr\nimport rpy2.robjects.numpy2ri as numpy2ri\nnumpy2ri.activate()\nimport rpy2 import rpy2.robjects as ro from rpy2.robjects.packages import importr import rpy2.robjects.numpy2ri as numpy2ri numpy2ri.activate()
INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY\nINFO:rpy2.situation:R home found: /home/yunye/anaconda3/envs/gwaslab_py39/lib/R\nINFO:rpy2.situation:R library path: \nINFO:rpy2.situation:LD_LIBRARY_PATH: \nINFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save\nINFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.\nIn\u00a0[14]: Copied!
df = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\")\ndf\ndf = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\") df Out[14]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T
379 rows \u00d7 15 columns
In\u00a0[15]: Copied!# import susieR as object\nsusieR = importr('susieR')\n# import susieR as object susieR = importr('susieR') In\u00a0[16]: Copied!
# convert pd.DataFrame to numpy\nld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None)\nR_df = ld.values\nld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None)\nR_df2 = ld2.values\n# convert pd.DataFrame to numpy ld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None) R_df = ld.values ld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None) R_df2 = ld2.values In\u00a0[17]: Copied!
R_df\nR_df Out[17]:
array([[ 1.00000e+00, 9.58562e-01, -3.08678e-01, ..., 1.96204e-02,\n -3.54602e-04, -7.14868e-03],\n [ 9.58562e-01, 1.00000e+00, -2.97617e-01, ..., 2.47755e-02,\n -1.49234e-02, -7.00509e-03],\n [-3.08678e-01, -2.97617e-01, 1.00000e+00, ..., -3.49335e-02,\n -1.37163e-02, -2.12828e-02],\n ...,\n [ 1.96204e-02, 2.47755e-02, -3.49335e-02, ..., 1.00000e+00,\n 5.26193e-02, -3.09069e-02],\n [-3.54602e-04, -1.49234e-02, -1.37163e-02, ..., 5.26193e-02,\n 1.00000e+00, -3.01142e-01],\n [-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02,\n -3.01142e-01, 1.00000e+00]])In\u00a0[18]: Copied!
plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0])\nsns.heatmap(data=R_df2,ax=ax[1])\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\nplt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0]) sns.heatmap(data=R_df2,ax=ax[1]) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[18]:
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>
https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#fine-mapping-with-susier-using-summary-statistics
In\u00a0[19]: Copied!ro.r('set.seed(123)')\nfit = susieR.susie_rss(\n bhat = df[\"BETA\"].values.reshape((len(R_df),1)),\n shat = df[\"SE\"].values.reshape((len(R_df),1)),\n R = R_df,\n L = 10,\n n = 503\n)\nro.r('set.seed(123)') fit = susieR.susie_rss( bhat = df[\"BETA\"].values.reshape((len(R_df),1)), shat = df[\"SE\"].values.reshape((len(R_df),1)), R = R_df, L = 10, n = 503 ) In\u00a0[20]: Copied!
# show the results of susie_get_cs\nprint(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\n# show the results of susie_get_cs print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
$L1\n[1] 200 218 221 224\n\n\n
We found 1 credible set here
In\u00a0[21]: Copied!# add the information to dataframe for plotting\ndf[\"cs\"] = 0\nn_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\nfor i in range(n_cs):\n cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]\n df.loc[np.array(cs_index)-1,\"cs\"] = i + 1\ndf[\"pip\"] = np.array(susieR.susie_get_pip(fit))\n# add the information to dataframe for plotting df[\"cs\"] = 0 n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0]) for i in range(n_cs): cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i] df.loc[np.array(cs_index)-1,\"cs\"] = i + 1 df[\"pip\"] = np.array(susieR.susie_get_pip(fit)) In\u00a0[22]: Copied!
fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))\ndf[\"MLOG10P\"] = -np.log10(df[\"P\"])\ncol_to_plot = \"MLOG10P\"\np=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot],\n marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\naxes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\n marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[0].set_xlabel(\"position\")\naxes[0].set_xlim((55400000, 55800000))\naxes[0].set_ylabel(col_to_plot)\naxes[0].legend()\n\np=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"],\n marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[1].set_xlabel(\"position\")\naxes[1].set_xlim((55400000, 55800000))\naxes[1].set_ylabel(\"PIP\")\naxes[1].legend()\nfig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1)) df[\"MLOG10P\"] = -np.log10(df[\"P\"]) col_to_plot = \"MLOG10P\" p=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2) axes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot], marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[0].set_xlabel(\"position\") axes[0].set_xlim((55400000, 55800000)) axes[0].set_ylabel(col_to_plot) axes[0].legend() p=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2) axes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[1].set_xlabel(\"position\") axes[1].set_xlim((55400000, 55800000)) axes[1].set_ylabel(\"PIP\") axes[1].legend()
/tmp/ipykernel_420/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.\n axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\nOut[22]:
<matplotlib.legend.Legend at 0x7fa6a330d5e0>
The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.
Lets then check the variants in credible set
In\u00a0[23]: Copied!df.loc[np.array(cs_index)-1,:]\ndf.loc[np.array(cs_index)-1,:] Out[23]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT cs pip MLOG10P 199 2:55513738:C:T 2 55513738 T C 0.623992 1.219516 0.153159 7.96244 1.686760e-15 3.385550 496 9960019 C T 1 0.325435 14.772947 217 2:55605943:A:G 2 55605943 G A 0.685484 1.321987 0.166688 7.93089 2.175840e-15 3.750867 496 9960019 A G 1 0.267953 14.662373 220 2:55612986:G:C 2 55612986 C G 0.685223 1.302133 0.166154 7.83691 4.617840e-15 3.677133 494 9960019 G C 1 0.150449 14.335561 223 2:55622624:G:A 2 55622624 A G 0.688508 1.324109 0.167119 7.92315 2.315640e-15 3.758833 496 9960019 G A 1 0.255449 14.635329 In\u00a0[24]: Copied!
!echo \"2:55513738:C:T\" > credible.snplist\n!echo \"2:55605943:A:G\" >> credible.snplist\n!echo \"2:55612986:G:C\" >> credible.snplist\n!echo \"2:55620927:G:A\" >> credible.snplist\n!echo \"2:55622624:G:A\" >> credible.snplist\n\n!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r2 square \\\n --extract credible.snplist \\\n --out credible_r\n\n!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r2 square \\\n --extract credible.snplist \\\n --out credible_r2\n!echo \"2:55513738:C:T\" > credible.snplist !echo \"2:55605943:A:G\" >> credible.snplist !echo \"2:55612986:G:C\" >> credible.snplist !echo \"2:55620927:G:A\" >> credible.snplist !echo \"2:55622624:G:A\" >> credible.snplist !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r2
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to credible_r.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract credible.snplist\n --keep-allele-order\n --out credible_r\n --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r.ld ... 0% [processingwriting] done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to credible_r2.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract credible.snplist\n --keep-allele-order\n --out credible_r2\n --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r2.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r2.ld ... 0% [processingwriting] done.\nIn\u00a0[25]: Copied!
credible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"]\nld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None)\nld.columns=credible_snplist\nld.index=credible_snplist\nld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None)\nld2.columns=credible_snplist\nld2.index=credible_snplist\ncredible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"] ld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None) ld.columns=credible_snplist ld.index=credible_snplist ld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None) ld2.columns=credible_snplist ld2.index=credible_snplist In\u00a0[26]: Copied!
plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0)\nsns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1)\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\nplt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0) sns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[26]:
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>
Variants in the credible set are in strong LD with the bona fide causal variant.
This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.
"},{"location":"finemapping_susie/#finemapping-using-susier","title":"Finemapping using susieR\u00b6","text":""},{"location":"finemapping_susie/#data-preparation","title":"Data preparation\u00b6","text":""},{"location":"finemapping_susie/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"finemapping_susie/#data-standardization-and-sanity-check","title":"Data standardization and sanity check\u00b6","text":""},{"location":"finemapping_susie/#extract-lead-variants","title":"Extract lead variants\u00b6","text":""},{"location":"finemapping_susie/#create-manhattan-plot-for-checking","title":"Create manhattan plot for checking\u00b6","text":""},{"location":"finemapping_susie/#extract-the-variants-around-255513738ct-for-finemapping","title":"Extract the variants around 2:55513738:C:T for finemapping\u00b6","text":""},{"location":"finemapping_susie/#convert-or-to-beta","title":"Convert OR to BETA\u00b6","text":""},{"location":"finemapping_susie/#align-nea-with-reference-sequence","title":"Align NEA with reference sequence\u00b6","text":""},{"location":"finemapping_susie/#output-the-sumstats-of-this-locus","title":"Output the sumstats of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-plink-to-get-ld-matrix-for-this-locus","title":"Run PLINK to get LD matrix for this locus\u00b6","text":""},{"location":"finemapping_susie/#finemapping","title":"Finemapping\u00b6","text":""},{"location":"finemapping_susie/#load-locus-sumstats","title":"Load locus sumstats\u00b6","text":""},{"location":"finemapping_susie/#import-sumsier","title":"Import sumsieR\u00b6","text":""},{"location":"finemapping_susie/#load-ld-matrix","title":"Load LD matrix\u00b6","text":""},{"location":"finemapping_susie/#visualize-the-ld-structure-of-this-locus","title":"Visualize the LD structure of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-finemapping-use-susier","title":"Run finemapping use susieR\u00b6","text":""},{"location":"finemapping_susie/#extract-credible-sets-and-pip","title":"Extract credible sets and PIP\u00b6","text":""},{"location":"finemapping_susie/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"finemapping_susie/#pitfalls","title":"Pitfalls\u00b6","text":""},{"location":"finemapping_susie/#check-ld-of-the-causal-variant-and-variants-in-the-credible-set","title":"Check LD of the causal variant and variants in the credible set\u00b6","text":""},{"location":"finemapping_susie/#load-ld-and-plot","title":"Load LD and plot\u00b6","text":""},{"location":"plot_PCA/","title":"Plotting PCA","text":"In\u00a0[1]: Copied!import pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd import matplotlib.pyplot as plt import seaborn as sns In\u00a0[2]: Copied!
pca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\")\npca\npca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\") pca Out[2]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752
500 rows \u00d7 14 columns
In\u00a0[6]: Copied!ped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\")\nped\nped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\") ped Out[6]: sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00096 GBR EUR male NaN NaN 1 HG00097 GBR EUR female NaN NaN 2 HG00099 GBR EUR female NaN NaN 3 HG00100 GBR EUR female NaN NaN 4 HG00101 GBR EUR male NaN NaN ... ... ... ... ... ... ... 2499 NA21137 GIH SAS female NaN NaN 2500 NA21141 GIH SAS female NaN NaN 2501 NA21142 GIH SAS female NaN NaN 2502 NA21143 GIH SAS female NaN NaN 2503 NA21144 GIH SAS female NaN NaN
2504 rows \u00d7 6 columns
In\u00a0[7]: Copied!pcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\")\npcaped\npcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\") pcaped Out[7]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 HG00403 CHS EAS male NaN NaN 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 HG00404 CHS EAS female NaN NaN 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 HG00406 CHS EAS male NaN NaN 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 HG00407 CHS EAS female NaN NaN 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 HG00409 CHS EAS male NaN NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 NA19087 JPT EAS female NaN NaN 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 NA19088 JPT EAS male NaN NaN 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 NA19089 JPT EAS male NaN NaN 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 NA19090 JPT EAS female NaN NaN 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752 NA19091 JPT EAS male NaN NaN
500 rows \u00d7 20 columns
In\u00a0[8]: Copied!plt.figure(figsize=(10,10))\nsns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50)\nplt.figure(figsize=(10,10)) sns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50) Out[8]:
<Axes: xlabel='PC1_AVG', ylabel='PC2_AVG'>"},{"location":"plot_PCA/#plotting-pca","title":"Plotting PCA\u00b6","text":""},{"location":"plot_PCA/#loading-files","title":"loading files\u00b6","text":""},{"location":"plot_PCA/#merge-pca-and-population-information","title":"Merge PCA and population information\u00b6","text":""},{"location":"plot_PCA/#plotting","title":"Plotting\u00b6","text":""},{"location":"prs_tutorial/","title":"PRS Tutorial","text":"In\u00a0[1]: Copied!
import sys\nsys.path.insert(0,\"/Users/he/work/PRSlink/src\")\nimport prslink as pl\nimport sys sys.path.insert(0,\"/Users/he/work/PRSlink/src\") import prslink as pl In\u00a0[2]: Copied!
a= pl.PRS()\na= pl.PRS() In\u00a0[3]: Copied!
a.add_score(\"./1kgeas.0.1.profile\", \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.2.profile\", \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.3.profile\", \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.4.profile\", \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.5.profile\", \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.1.profile\", \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.2.profile\", \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.3.profile\", \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.4.profile\", \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.5.profile\", \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")
- Dataset shape before loading : (0, 1)\n- Loading score data from file: ./1kgeas.0.1.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.1\n - Overlapping IDs:0\n- Loading finished successfully!\n- Dataset shape after loading : (504, 2)\n- Dataset shape before loading : (504, 2)\n- Loading score data from file: ./1kgeas.0.05.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.05\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 3)\n- Dataset shape before loading : (504, 3)\n- Loading score data from file: ./1kgeas.0.2.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.2\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 4)\n- Dataset shape before loading : (504, 4)\n- Loading score data from file: ./1kgeas.0.3.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.3\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 5)\n- Dataset shape before loading : (504, 5)\n- Loading score data from file: ./1kgeas.0.4.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.4\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 6)\n- Dataset shape before loading : (504, 6)\n- Loading score data from file: ./1kgeas.0.5.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.5\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 7)\n- Dataset shape before loading : (504, 7)\n- Loading score data from file: ./1kgeas.0.001.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.01\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 8)\nIn\u00a0[4]: Copied!
a.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")\na.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")
- Dataset shape before loading : (504, 8)\n- Loading pheno data from file: ../01_Dataset/t2d/1kgeas_t2d.txt\n - Setting ID:IID\n - Loading pheno:T2D\n - Loaded columns: T2D\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 9)\nIn\u00a0[5]: Copied!
a.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")\na.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")
- Dataset shape before loading : (504, 9)\n- Loading covar data from file: ./1kgeas.eigenvec\n - Setting ID:IID\n - Loading covar:PC1 PC2 PC3 PC4 PC5\n - Loaded columns: PC1 PC2 PC3 PC4 PC5\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 14)\nIn\u00a0[6]: Copied!
a.data[\"T2D\"] = a.data[\"T2D\"]-1\na.data[\"T2D\"] = a.data[\"T2D\"]-1 In\u00a0[7]: Copied!
a.data\na.data Out[7]: IID 0.1 0.05 0.2 0.3 0.4 0.5 0.01 T2D PC1 PC2 PC3 PC4 PC5 0 HG00403 -0.000061 -2.812450e-05 -0.000019 -2.131690e-05 -0.000024 -0.000022 0.000073 0 0.000107 0.039080 0.021048 0.016633 0.063373 1 HG00404 0.000025 4.460810e-07 0.000041 4.370760e-05 0.000024 0.000018 0.000156 1 -0.001216 0.045148 0.009013 0.028122 0.041474 2 HG00406 0.000011 2.369040e-05 -0.000009 2.928090e-07 -0.000010 -0.000008 -0.000188 0 0.005020 0.044668 0.016583 0.020077 -0.031782 3 HG00407 -0.000133 -1.326670e-04 -0.000069 -5.677710e-05 -0.000062 -0.000057 -0.000744 1 0.005408 0.034132 0.014955 0.003872 0.009794 4 HG00409 0.000010 -3.120730e-07 -0.000012 -1.873660e-05 -0.000025 -0.000023 -0.000367 1 -0.002121 0.031752 -0.048352 -0.043185 0.064674 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 499 NA19087 -0.000042 -6.215880e-05 -0.000038 -1.116230e-05 -0.000019 -0.000018 -0.000397 0 -0.067583 -0.040340 0.015038 0.039039 -0.010774 500 NA19088 0.000085 9.058670e-05 0.000047 2.666260e-05 0.000016 0.000014 0.000723 0 -0.069752 -0.047710 0.028578 0.036714 -0.000906 501 NA19089 -0.000067 -4.767610e-05 -0.000011 -1.393760e-05 -0.000019 -0.000016 -0.000126 0 -0.073989 -0.046706 0.040089 -0.034719 -0.062692 502 NA19090 0.000064 3.989030e-05 0.000022 7.445850e-06 0.000010 0.000003 -0.000149 0 -0.061156 -0.034606 0.032674 -0.016363 -0.065390 503 NA19091 0.000051 4.469220e-05 0.000043 3.089720e-05 0.000019 0.000016 0.000028 0 -0.067749 -0.052950 0.036908 -0.023856 -0.058515
504 rows \u00d7 14 columns
In\u00a0[13]: Copied!a.set_k({\"T2D\":0.2})\na.set_k({\"T2D\":0.2}) In\u00a0[14]: Copied!
a.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)\na.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)
- Binary trait: fitting logistic regression...\n - Binary trait: using records with phenotype being 0 or 1...\nOptimization terminated successfully.\n Current function value: 0.668348\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.653338\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.657903\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.654492\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.654413\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.653085\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.654681\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.661290\n Iterations 5\nOut[14]: PHENO TYPE PRS N_CASE N BETA CI_L CI_U P R2_null R2_full Delta_R2 AUC_null AUC_full Delta_AUC R2_lia_null R2_lia_full Delta_R2_lia SE 0 T2D B 0.01 200 502 0.250643 0.064512 0.436773 0.008308 0.010809 0.029616 0.018808 0.536921 0.586821 0.049901 0.010729 0.029826 0.019096 NaN 1 T2D B 0.05 200 502 0.310895 0.119814 0.501976 0.001428 0.010809 0.038545 0.027736 0.536921 0.601987 0.065066 0.010729 0.038925 0.028196 NaN 2 T2D B 0.5 200 502 0.367803 0.169184 0.566421 0.000284 0.010809 0.046985 0.036176 0.536921 0.605397 0.068477 0.010729 0.047553 0.036824 NaN 3 T2D B 0.2 200 502 0.365641 0.169678 0.561604 0.000255 0.010809 0.047479 0.036670 0.536921 0.607318 0.070397 0.010729 0.048079 0.037349 NaN 4 T2D B 0.3 200 502 0.367788 0.171062 0.564515 0.000248 0.010809 0.047686 0.036877 0.536921 0.608493 0.071573 0.010729 0.048315 0.037585 NaN 5 T2D B 0.1 200 502 0.374750 0.181520 0.567979 0.000144 0.010809 0.050488 0.039679 0.536921 0.613957 0.077036 0.010729 0.051270 0.040540 NaN 6 T2D B 0.4 200 502 0.389232 0.189866 0.588597 0.000130 0.010809 0.051145 0.040336 0.536921 0.609238 0.072318 0.010729 0.051845 0.041116 NaN In\u00a0[15]: Copied!
a.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)\na.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)
Optimization terminated successfully.\n Current function value: 0.668348\n Iterations 5\nIn\u00a0[16]: Copied!
a.plot_prs(a.score_cols)\na.plot_prs(a.score_cols) In\u00a0[\u00a0]: Copied!
\n"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"GWASTutorial","text":"
Note: this tutorial is being updated to Version 2024
This Github page aims to provide a hands-on tutorial on common analysis in Complex Trait Genomics. This tutorial is designed for the course Fundamental Exercise II
provided by The Laboratory of Complex Trait Genomics at the University of Tokyo. For more information, please see About.
This tutorial covers the minimum skills and knowledge required to perform a typical genome-wide association study (GWAS). The contents are categorized into the following groups. Additionally, for absolute beginners, we also prepared a section on command lines in Linux.
If you have any questions or suggestions, please feel free to let us know in the Issue section of this repository.
"},{"location":"#contents","title":"Contents","text":""},{"location":"#command-lines","title":"Command lines","text":"In these sections, we will briefly introduce the Post-GWAS analyses, which will dig deeper into the GWAS summary statistics. \u00a0
Introductions on GWAS-related issues
504 EAS individuals from 1000 Genomes Project Phase 3 version 5
Url: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Genome build: human_g1k_v37.fasta (hg19)
"},{"location":"01_Dataset/#genotype-data-processing","title":"Genotype Data Processing","text":"plink --mac 2 --max--maf 0.01 --thin 0.02
)plink --maf 0.01 --thin 0.15
)Note
The sample dataset 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip
has been included in 01_Dataset
when you clone the repository. There is no need to download it again if you clone this repository.
You can also simply run download_sampledata.sh
in 01_Dataset
and the dataset will be downloaded and decompressed.
./download_sampledata.sh\n
Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.
or you can manually download it from this link.
Unzip the dataset unzip -j 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip
, and you will get the following files:
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n
"},{"location":"01_Dataset/#phenotype-simulation","title":"Phenotype Simulation","text":"Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.
gcta \\\n --bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015 \\\n --simu-cc 250 254 \\\n --simu-causal-loci causal.snplist \\\n --simu-hsq 0.8 \\\n --simu-k 0.5 \\\n --simu-rep 1 \\\n --out 1kgeas_binary\n
$ cat causal.snplist\n2:55620927:G:A 3\n8:97094292:C:T 3\n20:42758834:T:C 3\n7:134326056:G:T 3\n1:167562605:G:A 3\n
Warning
This simulation is just used for showing the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result itself is meaningless.
Allele frequency and Effect size
"},{"location":"01_Dataset/#reference","title":"Reference","text":"This section is intended to provide a minimum introduction of the command line in Linux system for handling genomic data. (If you are alreay familiar with Linux commands, it is completely ok to skip this section.)
If you are a beginner with no background in programming, it would be helpful if you could learn some basic commands first before any analysis. In this section, we will introduce the most basic commands which enable you to handle genomic files in the terminal using command lines in a linux system.
For Mac users
This tutorial will probably work with no problems. Just simply open your terminal and follow the tutorial. (Note: A few commands might be different on MacOS.)
For Windows users
You can simply insall WSL to get a linux environment. Please check here for how to install WSL.
"},{"location":"02_Linux_basics/#table-of-contents","title":"Table of Contents","text":"man
Main functions of the Linux kernel
Some of the most common linux distributions
Linux and Linus
Linux is named after Linus Benedict Torvalds, who is a legendary Finnish software engineer who lead the development of the Linux kernel. He also developped the amazing version control software - Git.
Reference: https://en.wikipedia.org/wiki/Linux
"},{"location":"02_Linux_basics/#how-do-we-interact-with-computers","title":"How do we interact with computers?","text":"GUI and CUI
Shell
$
is the prompt for bash shell, which indicate that you can type commands after the $
sign.%
, and C shell uses >
as the prompt sign.bash
. Tip
The reason why we want to use CUI for large-scale data analysis is that CUI is better in term of precision, memory usage and processing speed.
"},{"location":"02_Linux_basics/#overview-of-the-basic-commands-in-linux","title":"Overview of the basic commands in Linux","text":"Unlike clicking and dragging files in Windows or MacOS, in Linux, we usually handle files by typing commands in the terminal.
Here is a list of the basic commands we are going to cover in this brief tutorial:
Basic Linux commands
Function group Commands Description Directoriespwd
, ls
, mkdir
, rmdir
Commands for checking, creating and removing directories Files touch
,cp
,mv
,rm
Commands for creating, copying, moving and removing files Checking files cat
,zcat
,head
,tail
,less
,more
,wc
Commands for inspecting files Archiving and compression tar
,gzip
,gunzip
,zip
,unzip
Commands for Archiving and Compressing files Manipulating text sort
,uniq
,cut
,join
,tr
Commands for manipulating text files Modifying permission chmod
,chown
, chgrp
Commands for changing the permissions of files and directories Links ln
Commands for creating symbolic and hard links Pipe, redirect and others pipe, >
,>>
,*
,.
,..
A group of miscellaneous commands Advance text editing awk
, sed
Commands for more complicated text manipulation and editing"},{"location":"02_Linux_basics/#how-to-check-the-usage-of-a-command-using-man","title":"How to check the usage of a command using man
:","text":"The first command we might want to learn is man
, which shows the manual for a certain command. When you forget how to use a command, you can always use man
to check.
man
: Check the manual of a command (e.g., man chmod
) or --help
option (e.g., chmod --help
)
For example, we want to check the usage of pwd
:
Use man
to get the manual for commands
$ man pwd\n
Then you will see the manual of pwd
in your terminal. PWD(1) User Commands PWD(1)\n\nNAME\n pwd - print name of current/working directory\n\nSYNOPSIS\n pwd [OPTION]...\n\nDESCRIPTION\n Print the full filename of the current working directory.\n....\n
Explain shell
Or you can use this wonderful website to get explanations for your commands.
URL : https://explainshell.com/
"},{"location":"02_Linux_basics/#commands","title":"Commands","text":""},{"location":"02_Linux_basics/#directories","title":"Directories","text":"The first set of commands are: pwd
, cd
, ls
, mkdir
and rmdir
, which are related to directories (like the folders in a Windows system).
pwd
","text":"pwd
: Print working directory, which means printing the path of the current directory (working directory)
Use pwd
to print the current directory you are in
$ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n
This command prints the absolute path.
An example of Linux file system and file paths
Type Description Example Absolute path path starting from root (the orange path)/home/User3/GWASTutorial/02_Linux_basics/README.md
Relative path path starting from the current directory (the blue path) ./GWASTutorial/02_Linux_basics/README.md
Tip: use readlink
to obtain the absolute path of a file
To get the absolute path of a file, you can use readlink -f [filename]
.
$ readlink -f README.md \n/home/he/work/GWASTutorial/02_Linux_basics/README.md\n
"},{"location":"02_Linux_basics/#cd","title":"cd
","text":"cd
: Change the current working directory.
Use cd
to change directory to 02_Linux_basics
and then print the current directory
$ cd 02_Linux_basics\n$ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n
"},{"location":"02_Linux_basics/#ls","title":"ls
","text":"ls
: List the contents in the working directory
Some frequently used options for ls
:
-l
: in a list-like format-h
: convert file size into a human readable format (KB,MB,GB...)-a
: list all files (including hidden files, namly those files with a period at the beginning of the filename)Simply list the files and directories in the current directory
$ ls\nREADME.md sumstats.txt\n
List the files and directories with options -lha
$ ls -lha\ndrwxr-xr-x 4 he staff 128B Dec 23 14:07 .\ndrwxr-xr-x 17 he staff 544B Dec 23 12:13 ..\n-rw-r--r-- 1 he staff 0B Oct 17 11:24 README.md\n-rw-r--r-- 1 he staff 31M Dec 23 14:07 sumstats.txt\n
Tip: use tree
to visualize the structure of a directory
You can use tree
command to visualize the structure of a directory.
$ tree ./02_Linux_basics/\n./02_Linux_basics/\n\u251c\u2500\u2500 README.md\n\u2514\u2500\u2500 sumstats.txt\n\n0 directories, 2 files\n
"},{"location":"02_Linux_basics/#mkdir-rmdir","title":"mkdir
& rmdir
","text":"mkdir
: Create a new empty directoryrmdir
: Delete an empty directoryMake a directory and delete it
$ mkdir new_directory\n$ ls\nnew_directory README.md sumstats.txt\n$ rmdir new_directory/\n$ ls\nREADME.md sumstats.txt\n
"},{"location":"02_Linux_basics/#manipulating-files","title":"Manipulating files","text":"This set of commands includes: touch
, mv
, rm
and cp
touch
","text":"touch
command is used to create a new empty file.
Create an empty text file called newfile.txt
in this directory
$ ls -l\ntotal 64048\n-rw-r--r-- 1 he staff 0 Oct 17 11:24 README.md\n-rw-r--r-- 1 he staff 32790417 Dec 23 14:07 sumstats.txt\n\ntouch newfile.txt\n\n$ touch newfile.txt\n$ ls -l\ntotal 64048\n-rw-r--r-- 1 he staff 0 Oct 17 11:24 README.md\n-rw-r--r-- 1 he staff 0 Dec 23 14:14 newfile.txt\n-rw-r--r-- 1 he staff 32790417 Dec 23 14:07 sumstats.txt\n
"},{"location":"02_Linux_basics/#mv","title":"mv
","text":"mv
has two functions:
The following command will create a new directoru called new_directory
, and move sumstats.txt
into that directory. Just like draggig a file in to a folder in window system.
Move a file to a different directory
# make a new directory\n$ mkdir new_directory\n\n#move sumstats to the new directory\n$ mv sumstats.txt new_directory/\n\n# list the item in new_directory\n$ ls new_directory/\nsumstats.txt\n
Now, let's move it back to the current directory and rename it to sumstats_new.txt
.
Rename a file using mv
$ mv ./new_directory/sumstats.txt ./\n
Note: ./
means the current directory You can also use mv
to rename a file: #rename\n$mv sumstats.txt sumstats_new.txt \n
"},{"location":"02_Linux_basics/#rm","title":"rm
","text":"rm
: Remove files or diretories
Remove a file and a directory
# remove a file\n$rm file\n\n#remove files in a directory (recursive mode)\n$rm -r directory/\n
There is no trash can in Linux command-line interface
If you delete a file with rm
, it will be very difficult to restore it. Please be careful wehn using rm
.
cp
","text":"cp
command is used to copy files or diretories.
Copy a file and a directory
#cp files\n$cp file1 file2\n\n# copy directory\n$cp -r directory1/ directory2/\n
"},{"location":"02_Linux_basics/#links","title":"Links","text":"Symbolic link is like a shortcut on window system, which is a special type of file that points to another file.
It is very useful when you want to organize your tool box or working space.
You can use ln -s pathA pathB
to create such a link.
Create a symbolic link for plink
Let`s create a symbolic link for plink first.
# /home/he/tools/plink/plink is the orinial file\n# /home/he/tools/bin is the path for the symbolic link \nln -s /home/he/tools/plink/plink /home/he/tools/bin\n
And then check the link.
cd /home/he/tools/bin\nls -lha\nlrwxr-xr-x 1 he staff 27B Aug 30 11:30 plink -> /home/he/tools/plink/plink\n
"},{"location":"02_Linux_basics/#archiving-and-compression","title":"Archiving and Compression","text":"Results for millions of variants are usually very large, sometimes >10GB, or consists of multiple files.
To save space and make it easier to transfer, we need to archive and compress these files.
Archiving and Compression
Commoly used commands for archiving and compression:
Extensions Create Extract Functionsfile.gz
gzip
gunzip
compress files.tar
tar -cvf
tar -xvf
archive files.tar.gz
or files.tgz
tar -czvf
tar -xvzf
archive and compress file.zip
zip
unzip
archive and compress Compress and decompress a file using gzip
and gunzip
$ ls -lh\n-rw-r--r-- 1 he staff 31M Dec 23 14:07 sumstats.txt\n\n$ gzip sumstats.txt\n$ ls -lh\n-rw-r--r-- 1 he staff 9.9M Dec 23 14:07 sumstats.txt.gz\n\n$ gunzip sumstats.txt.gz\n$ ls -lh\n-rw-r--r-- 1 he staff 31M Dec 23 14:07 sumstats.txt\n
"},{"location":"02_Linux_basics/#read-and-check-files","title":"Read and check files","text":"We have a group of handy commands to check part of or the entire file, including cat
, zcat
, less
, head
, tail
, wc
cat
","text":"cat
command can print the contents of files or concatenate the files.
Create and then cat
the file a_text_file.txt
$ ls -lha > a_text_file.txt\n$ cat a_text_file.txt \ntotal 32M\ndrwxr-x--- 2 he staff 4.0K Apr 2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr 1 22:20 ..\n-rw-r----- 1 he staff 0 Apr 2 00:37 a_text_file.txt\n-rw-r----- 1 he staff 5.0K Apr 1 22:20 README.md\n-rw-r----- 1 he staff 32M Mar 30 18:17 sumstats.txt\n
Warning
Be careful not to cat
a text file with a huge number of lines. You can try to cat sumstats.txt
and see what happends.
By the way, > a_text_file.txt
here means redirect the output to file a_text_file.txt
.
zcat
","text":"zcat
is similar to cat
, but can only applied to compressed files.
cat
and zcat
a gzipped text file
$ gzip a_text_file.txt \n$ cat a_text_file.txt.gz TGba_text_file. txt\u044f\n@\u0231\u00bbO\ud8ac\udc19v\u0602\ud85e\udca9\u00bc\ud9c3\udce0bq}\udb06\udca4\\\ueee0\u00a4n\u0662\u00aa\uda40\udc2cn\u00bb\u06a1\u01ed\n w5J_\u00bd\ud88d\ude27P\u07c9=\u00ffK\n(\u05a3\u0530\u00a7\u04a4\u0176a\u0786 \u00acM\u00adR\udbb5\udc8am\u00b3\u00fee\u00b8\u00a4\u00bc\u05cdSd\ufff1\u07f2\ub4e4\u00aa\u00adv\n \u5a41 resize: unknown character, exiting.\n\n$ zcat a_text_file.txt.gz \ntotal 32M\ndrwxr-x--- 2 he staff 4.0K Apr 2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr 1 22:20 ..\n-rw-r----- 1 he staff 0 Apr 2 00:37 a_text_file.txt\n-rw-r----- 1 he staff 5.0K Apr 1 22:20 README.md\n-rw-r----- 1 he staff 32M Mar 30 18:17 sumstats.txt\n
gzcat
Use gzcat
instead of zcat
if your device is running MacOS.
head
","text":"head
: Print the first 10 lines.
-n
: option to change the number of lines.
Check the first 10 lines and only the first line of the file sumstats.txt
$ head sumstats.txt \nCHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 319 17 2 1 1 ADD 10000 1.04326 0.0495816 0.854176 0.393008 .\n1 319 22 1 2 2 ADD 10000 1.03347 0.0493972 0.666451 0.505123 .\n1 418 23 1 2 2 ADD 10000 1.02668 0.0498185 0.528492 0.597158 .\n1 537 30 1 2 2 ADD 10000 1.01341 0.0498496 0.267238 0.789286 .\n1 546 31 2 1 1 ADD 10000 1.02051 0.0336786 0.60284 0.546615 .\n1 575 33 2 1 1 ADD 10000 1.09795 0.0818305 1.14199 0.25346 .\n1 752 44 2 1 1 ADD 10000 1.02038 0.0494069 0.408395 0.682984 .\n1 913 50 2 1 1 ADD 10000 1.07852 0.0493585 1.53144 0.12566 .\n1 1356 77 2 1 1 ADD 10000 0.947521 0.0339805 -1.5864 0.112649 .\n\n$ head -n 1 sumstats.txt \nCHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n
"},{"location":"02_Linux_basics/#tail","title":"tail
","text":"Similar to head
, you can use tail
ro check the last 10 lines. -n
works in the same way.
Check the last 10 lines of the file sumstats.txt
$ tail sumstats.txt \n22 99996057 9959945 2 1 1 ADD 10000 1.03234 0.0335547 0.948413 0.342919.\n22 99996465 9959971 2 1 1 ADD 10000 1.04755 0.0337187 1.37769 0.1683 .\n22 99997041 9960013 2 1 1 ADD 10000 1.01942 0.0937548 0.205195 0.837419.\n22 99997608 9960051 2 1 1 ADD 10000 0.969928 0.0397711 -0.767722 0. 442652 .\n22 99997629 9960055 2 1 1 ADD 10000 0.986949 0.0395305 -0.332315 0. 739652 .\n22 99997742 9960061 2 1 1 ADD 10000 0.990829 0.0396614 -0.232298 0. 816307 .\n22 99998121 9960086 2 1 1 ADD 10000 1.04448 0.0335879 1.29555 0.19513 .\n22 99998455 9960106 2 1 1 ADD 10000 0.880953 0.152754 -0.829771 0. 406668 .\n22 99999208 9960146 2 1 1 ADD 10000 0.944604 0.065187 -0.874248 0. 381983 .\n22 99999382 9960164 2 1 1 ADD 10000 0.970509 0.033978 -0.881014 0.37831 .\n
"},{"location":"02_Linux_basics/#wc","title":"wc
","text":"wc
: short for word count, which count the lines, words, and characters in a file.
For example,
Count the lines, words, and characters in sumstats.txt
$ wc sumstats.txt \n 445933 5797129 32790417 sumstats.txt\n
This means that sumstats.txt
has 445933 lines, 5797129 words, and 32790417 characters. "},{"location":"02_Linux_basics/#edit-files","title":"Edit files","text":"Vim is a handy text editor for command line.
Vim - text editor
vim README.md\n
Simple workflow using Vim
vim file_to_edit.txt
i
to enter the INSERT mode.Esc
key to escape the INSERT mode.:wq
to quit and also save the file.Vim is a little bit hard to learn for beginners, but when you get familiar with it, it will be a mighty and convenient tool. For more detailed tutorials on Vim, you can check: https://github.com/iggredible/Learn-Vim
Other common command line text editors
The permissions of a file or directory are represented as a 10-character string (1+3+3+3) :
For example, this represents a directory(the initial d) which is readable, writable and executable for the owner(the first 3: rwx), users in the same group(the 3 characters in the middle: rwx) and others (last 3 characters: rwx).
drwxrwxrwx
-> d (directory or file) rwx (permissions for owner) rwx (permissions for users in the same group) rwx (permissions for other users)
r
readable w
writable x
executable d
directory -
file Command for checking the permissions of files in the current directory: ls -l
Command for changing permissions: chmod
, chown
, chgrp
Syntax:
chmod [3-digit Binary notation] [path]\n
Number notation Permission 3-digit Binary notation 7 rwx
111 6 rw-
110 5 r-x
101 4 r--
100 3 -wx
011 2 -w-
010 1 --x
001 0 ---
000 Change the permissions of the file README.md
to 660
# there is a readme file in the directory, and its permissions are -rw-r----- \n$ ls -lh\ntotal 4.0K\n-rw-r----- 1 he staff 2.1K Feb 24 01:16 README.md\n\n# let's change the permissions to 660, which is a numeric notation of -rw-rw---- based on the table above\n$ chmod 660 README.md \n\n# chack again, and it was changed.\n$ ls -lh\ntotal 4.0K\n-rw-rw---- 1 he staff 2.1K Feb 24 01:16 README.md\n
Note
These commands are very important because we use genome data, which could raise severe ethical and privacy issues if there is data leak.
Warning
Please always be cautious when handling human genomic data.
"},{"location":"02_Linux_basics/#others","title":"Others","text":"There are a group of very handy and flexible commands which will greatly improve your efficiency. These include |
, >
, >>
,*
,.
,..
,~
,and -
.
|
(pipe)","text":"Pipe basically is used to pass the output of the previous command to the next command as input, instead of printing is in terminal. Using pipe you can do very complicated manipulations of the files.
An example of Pipe
cat sumstats.txt | sort | uniq | wc\n
This means (1) print sumstats, (2) sort the output, (3) then keep the unique lines and finally (4) count the lines and words."},{"location":"02_Linux_basics/#_1","title":">
","text":">
redirects output to a new file (if the file already exist, it will be overwritten)
Redirects the output of cat sumstats.txt | sort | uniq | wc
to count.txt
cat sumstats.txt | sort | uniq | wc > count.txt\n
"},{"location":"02_Linux_basics/#_2","title":">>
","text":">>
redirects output to a file by appending to the end of the file (if the file already exist, it will not be overwritten)
Redirects the output of cat sumstats.txt | sort | uniq | wc
to count.txt
by appending
cat sumstats.txt | sort | uniq | wc >> count.txt\n
Other useful commands include :
Command Description Example Code Example code meaning*
represent zero or more characters - - ?
represent a single character - - .
the current directory - - ..
the parent directory of the current directory. cd ..
change to the parent directory of the current directory ~
the home directory cd ~
change to the curent user's home directory -
the last directory you are working in. cd -
change to the last directory you are working in. Wildcards
The asterisk *
and the question mark ?
are called wildcard characters or wildcards in Linux, which are special symbols that can represent other normal characters. Wildcards are especially useful when handling multiple files with similar pattern in their names.
Warning
Be extremely careful when you use rm and *. It is disastrous when you mistakenly type rm *
If you have a lot of commands to run, or if you want to automate some complex manipulations, bash scripts are a good way to address this issue.
We can use vim to create a bash script called hello.sh
A simple example of bash scripts:
Example
hello.sh#!/bin/bash\necho \"Hello, world1\"\necho \"Hello, world2\"\n
#!
is called shebang, which tells the system which interpreter to use to execute the shell script.
Then use chmod
to give it permission to execute.
chmod +x hello.sh \n
Now we can run the srcipt by ./hello.sh
:
./hello.sh\n\"Hello, world1\" \n\"Hello, world2\" \n
"},{"location":"02_Linux_basics/#advanced-text-editing","title":"Advanced text editing","text":"(optional: awk, sed, cut, sort, join, uniq)
cut
: cutting out columns from files.sort
: sorting the lines of a file.uniq
: filter the duplicated lines in a file.join
: join two tabular files based on specified keys.Advanced commands:
awk
: https://cloufield.github.io/GWASTutorial/60_awk/sed
: https://cloufield.github.io/GWASTutorial/61_sed/Git is a powerful version control software and github is a platform where you can share your codes.
Currently you just need to learn git clone
, which simply downloads an existing repository.
git clone https://github.com/Cloufield/GWASTutorial.git
You can also check here for more information.
Quote
We can use wget [option] [url]
command to download files to local machine.
-O
option specify the file name you want to change for the downloaded file.
Use wget to download the hg19 reference genome from UCSC
# Download hg19 reference genome from UCSC\nwget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n\n# Download hg19 reference genome from UCSC and rename it to my_refgenome.fa.gz\nwget -O my_refgenome.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n
"},{"location":"02_Linux_basics/#exercise","title":"Exercise","text":"The questions are generated by Microsoft Bing!
What is the command to list all files and directories in your current working directory?
ls
cd
pwd
mkdir
What is the command to create a new directory named \u201ctest\u201d?
cd test
pwd test
mkdir test
ls test
What is the command to copy a file named \u201cdata.txt\u201d from your current working directory to another directory named \u201cbackup\u201d?
cp data.txt backup/
mv data.txt backup/
rm data.txt backup/
cat data.txt backup/
What is the command to display the first 10 lines of a file named \u201cresults.csv\u201d?
head results.csv
tail results.csv
less results.csv
more results.csv
What is the command to count the number of lines, words, and characters in a file named \u201creport.txt\u201d?
wc report.txt
count report.txt
size report.txt
stat report.txt
What is the command to search for a pattern in a file named \u201clog.txt\u201d and print only the matching lines?
grep pattern log.txt
find pattern log.txt
locate pattern log.txt
search pattern log.txt
What is the command to sort the contents of a file named \u201cnames.txt\u201d in alphabetical order and save the output to a new file named \u201csorted_names.txt\u201d?
sort names.txt > sorted_names.txt
sort names.txt < sorted_names.txt
sort names.txt >> sorted_names.txt
sort names.txt << sorted_names.txt
What is the command to display the difference between two files named \u201cold_version.py\u201d and \u201cnew_version.py\u201d?
diff old_version.py new_version.py
cmp old_version.py new_version.py
diffy old_version.py new_version.py
compare old_version.py new_version.py
What is the command to change the permissions of a file named \u201cscript.sh\u201d to make it executable by everyone?
chmod +x script.sh
chmod 777 script.sh
chmod ugo+x script.sh
All of the above
What is the command to run a program named \u201cprogram.exe\u201d in the background and redirect its output to a file named \u201coutput.log\u201d?
program.exe & > output.log
program.exe > output.log &
program.exe < output.log &
program.exe & < output.log
This section lists some of the most commonly used formats in complex trait genomic analysis.
"},{"location":"03_Data_formats/#table-of-contents","title":"Table of Contents","text":"Simple text file
.txt
cat sample_text.txt \nLorem ipsum dolor sit amet, consectetur adipiscing elit. In ut sem congue, tristique tortor et, ullamcorper elit. Nulla elementum, erat ac fringilla mattis, nisi tellus euismod dui, interdum laoreet orci velit vel leo. Vestibulum neque mi, pharetra in tempor id, malesuada at ipsum. Duis tellus enim, suscipit sit amet vestibulum in, ultricies vitae erat. Proin consequat id quam sed sodales. Ut a magna non tellus dictum aliquet vitae nec mi. Suspendisse potenti. Vestibulum mauris sem, viverra ac metus sed, scelerisque ornare arcu. Vivamus consequat, libero vitae aliquet tempor, lorem leo mattis arcu, et viverra erat ligula sit amet tortor. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Praesent ut massa ac tortor lobortis placerat. Pellentesque aliquam tortor augue, at rutrum magna molestie et. Etiam congue nulla in venenatis congue. Nunc ac felis pharetra, cursus leo et, finibus eros.\n
Random texts are generated using - https://www.lipsum.com/"},{"location":"03_Data_formats/#tsv","title":"tsv","text":"Tab-separated values Tabular data format
.tsv
head sample_data.tsv\n#CHROM POS ID REF ALT A1 FIRTH? TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C N ADD 503 0.750168 0.280794 -1.02373 0.305961 .\n1 14599 1:14599:T:A T A A N ADD 503 1.80972 0.231595 2.56124 0.0104299 .\n1 14604 1:14604:A:G A G G N ADD 503 1.80972 0.231595 2.56124 0.0104299 .\n1 14930 1:14930:A:G A G G N ADD 503 1.70139 0.240245 2.21209 0.0269602 .\n1 69897 1:69897:T:C T C T N ADD 503 1.58002 0.194774 2.34855 0.0188466 .\n1 86331 1:86331:A:G A G G N ADD 503 1.47006 0.236102 1.63193 0.102694 .\n1 91581 1:91581:G:A G A A N ADD 503 0.924422 0.122991 -0.638963 0.522847 .\n1 122872 1:122872:T:G T G G N ADD 503 1.07113 0.180776 0.380121 0.703856 .\n1 135163 1:135163:C:T C T T N ADD 503 0.711822 0.23908 -1.42182 0.155079 .\n
"},{"location":"03_Data_formats/#csv","title":"csv","text":"Comma-separated values Tabular data format
.csv
head sample_data.csv \n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
"},{"location":"03_Data_formats/#data-formats-in-bioinformatics","title":"Data formats in bioinformatics","text":"A typical workflow for generating genotype data for genome-wide association analysis.
"},{"location":"03_Data_formats/#sequence","title":"Sequence","text":""},{"location":"03_Data_formats/#fasta","title":"fasta","text":"text-based format for representing either nucleotide sequences or amino acid (protein) sequences
.fa
or .fasta
>SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n
"},{"location":"03_Data_formats/#fastq","title":"fastq","text":"text-based format for storing both a nucleotide sequence and its corresponding quality scores
.fastq
@SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n+\n!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65\n
Reference: https://en.wikipedia.org/wiki/FASTQ_format"},{"location":"03_Data_formats/#alingment","title":"Alingment","text":""},{"location":"03_Data_formats/#sambam","title":"SAM/BAM","text":"Sequence Alignment/Map Format is a TAB-delimited text file format consisting of a header section and an alignment section.
.sam
@HD VN:1.6 SO:coordinate\n@SQ SN:ref LN:45\nr001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *\nr002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *\nr003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;\nr004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *\nr003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;\nr001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1\n
Reference : https://samtools.github.io/hts-specs/SAMv1.pdf"},{"location":"03_Data_formats/#variant-and-genotype","title":"Variant and genotype","text":""},{"location":"03_Data_formats/#vcf-vcfgz-vcfgztbi","title":"vcf / vcf.gz / vcf.gz.tbi","text":"VCF is a text file format consisting of meta-information lines, a header line, and then data lines. Each data line contains information about a variant in the genome (and the genotype information on samples for each variant).
.vcf
##fileformat=VCFv4.2\n##fileDate=20090805\n##source=myImputationProgramV3.1\n##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta\n##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>\n##phasing=partial\n##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">\n##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">\n##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">\n##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">\n##INFO=<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">\n##INFO=<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">\n##FILTER=<ID=q10,Description=\"Quality below 10\">\n##FILTER=<ID=s50,Description=\"Less than 50% of samples have data\">\n##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">\n##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">\n##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">\n#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003\n20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.\n20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3\n20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4\n20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2\n20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3\n
Reference : https://samtools.github.io/hts-specs/VCFv4.2.pdf "},{"location":"03_Data_formats/#plink-format","title":"PLINK format","text":"The figure shows how genotypes are stored in files.
We have 3 parts of information:
And there are different ways (format sets) to represent this information in PLINK1.9 and PLINK2:
.ped
(PLINK/MERLIN/Haploview text pedigree + genotype table)
Original standard text format for sample pedigree information and genotype calls.Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.
.ped
# check the first 16 rows and 16 columns of the ped file\ncut -d \" \" -f 1-16 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped | head\n0 HG00403 0 0 0 -9 G G T T A A G A C C\n0 HG00404 0 0 0 -9 G G T T A A G A T C\n0 HG00406 0 0 0 -9 G G T T A A G A T C\n0 HG00407 0 0 0 -9 G G T T A A A A C C\n0 HG00409 0 0 0 -9 G G T T A A G A C C\n0 HG00410 0 0 0 -9 G G T T A A G A C C\n0 HG00419 0 0 0 -9 G G T T A A A A T C\n0 HG00421 0 0 0 -9 G G T T A A G A C C\n0 HG00422 0 0 0 -9 G G T T A A G A C C\n0 HG00428 0 0 0 -9 G G T T A A G A C C\n0 HG00436 0 0 0 -9 G G A T G A A A C C\n0 HG00437 0 0 0 -9 C G T T A A G A C C\n0 HG00442 0 0 0 -9 G G T T A A G A C C\n0 HG00443 0 0 0 -9 G G T T A A G A C C\n0 HG00445 0 0 0 -9 G G T T A A G A C C\n0 HG00446 0 0 0 -9 C G T T A A G A T C\n
.map
(PLINK text fileset variant information file)
Variant information file accompanying a .ped text pedigree + genotype table. A text file with no header line, and one line per variant with the following 3-4 fields:
.map
head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n1 1:13273:G:C 0 13273\n1 1:14599:T:A 0 14599\n1 1:14604:A:G 0 14604\n1 1:14930:A:G 0 14930\n1 1:69897:T:C 0 69897\n1 1:86331:A:G 0 86331\n1 1:91581:G:A 0 91581\n1 1:122872:T:G 0 122872\n1 1:135163:C:T 0 135163\n1 1:233473:C:G 0 233473\n
Reference: https://www.cog-genomics.org/plink/1.9/formats
"},{"location":"03_Data_formats/#bed-fam-bim","title":"bed / fam /bim","text":"bed/fam/bim formats are the binary implementation of ped/map formats. bed/bim/fam files contain the same information as ped/map but are much smaller in size.
-rw-r----- 1 yunye yunye 135M Dec 23 11:45 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed\n-rw-r----- 1 yunye yunye 36M Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n-rw-r----- 1 yunye yunye 9.4K Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n-rw-r--r-- 1 yunye yunye 32M Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n-rw-r--r-- 1 yunye yunye 2.2G Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped\n
.fam
head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n0 HG00403 0 0 0 -9\n0 HG00404 0 0 0 -9\n0 HG00406 0 0 0 -9\n0 HG00407 0 0 0 -9\n0 HG00409 0 0 0 -9\n0 HG00410 0 0 0 -9\n0 HG00419 0 0 0 -9\n0 HG00421 0 0 0 -9\n0 HG00422 0 0 0 -9\n0 HG00428 0 0 0 -9\n
.bim
head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n1 1:13273:G:C 0 13273 C G\n1 1:14599:T:A 0 14599 A T\n1 1:14604:A:G 0 14604 G A\n1 1:14930:A:G 0 14930 G A\n1 1:69897:T:C 0 69897 C T\n1 1:86331:A:G 0 86331 G A\n1 1:91581:G:A 0 91581 A G\n1 1:122872:T:G 0 122872 G T\n1 1:135163:C:T 0 135163 T C\n1 1:233473:C:G 0 233473 G C\n
.bed
\"Primary representation of genotype calls at biallelic variants The first three bytes should be 0x6c, 0x1b, and 0x01 in that order. The rest of the file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.\"
hexdump -C 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed | head\n00000000 6c 1b 01 ff ff bf bf ff ff ff ef fb ff ff ff fe |l...............|\n00000010 ff ff ff ff fb ff bb ff ff fb af ff ff fe fb ff |................|\n00000020 ff ff ff fe ff ff ff ff ff bf ff ff ef ff ff ef |................|\n00000030 bb ff ff ff ff ff ff ff fa ff ff ff ff ff ff ff |................|\n00000040 ff ff ff fb ff ff ff ff ff ff ff ff ff ff ff ef |................|\n00000050 ff ff ff fb fe ef fe ff ff ff ff eb ff ff fe fe |................|\n00000060 ff ff fe ff bf ff fa fb fb eb be ff ff 3b ff be |.............;..|\n00000070 fe be bf ef fe ff ef ee ff ff bf ea fe bf fe ff |................|\n00000080 bf ff ff ef ff ff ff ff ff fa ff ff eb ff ff ff |................|\n00000090 ff ff fb fe af ff bf ff ff ff ff ff ff ff ff ff |................|\n
Reference: https://www.cog-genomics.org/plink/1.9/formats
"},{"location":"03_Data_formats/#imputation-dosage","title":"Imputation dosage","text":""},{"location":"03_Data_formats/#bgen-bgi","title":"bgen / bgi","text":"Reference: https://www.well.ox.ac.uk/~gav/bgen_format/
"},{"location":"03_Data_formats/#pgenpsampvar","title":"pgen,psam,pvar","text":"Reference: https://www.cog-genomics.org/plink/2.0/formats#pgen
NOTE: pgen
only saved the dosage for each individual (a scalar ranged from 0 to 2). It could not been converted back to the genotype probability (a vector of length 3) or allele probability (a matrix of dimension 2 x 2) saved in bgen
.
In this module, we will learn the basics of genotype data QC using PLINK, which is one of the most commonly used software in complex trait genomics. (Huge thanks to the developers: PLINK1.9 and PLINK2)
"},{"location":"04_Data_QC/#table-of-contents","title":"Table of Contents","text":"To get prepared for genotype QC, we will need to make directories, download software and add the software to your environment path.
First, we will simply create some directories to keep the tools we need to use.
Create directories
cd ~\nmkdir tools\ncd tools\nmkdir bin\nmkdir plink\nmkdir plink2\n
You can download each tool into its corresponding directories.
The bin
directory here is for keeping all the symbolic links to the executable files of each tool.
In this way, it is much easier to manage and organize the paths and tools. We will only add the bin
directory here to the environment path.
Next, go to the Plink webpage to download the software. We will need both PLINK1.9 and PLINK2.
Download PLINK1.9 and PLINK2 from the following webpage to the corresponding directories:
Info
If you are using Mac or Windows, then please download the Mac or Windows version. In this tutorial, we will use a Linux system and the Linux version of PLINK.
Find the suitable version on the PLINK website, right-click and copy the link address.
Download PLINK2 (Linux AVX2 AMD)
cd ~/tools/plink2\nwget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_amd_avx2_20231212.zip\nunzip plink2_linux_amd_avx2_20231212.zip\n
Then do the same for PLINK1.9
Download PLINK1.9 (Linux 64-bit)
cd ~/tools/plink\nwget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip\nunzip plink_linux_x86_64_20231211.zip\n
"},{"location":"04_Data_QC/#create-symbolic-links","title":"Create symbolic links","text":"After downloading and unzipping, we will create symbolic links for the plink binary files, and then move the link to ~/tools/bin/
.
Create symbolic links
cd ~\nln -s ~/tools/plink2/plink2 ~/tools/bin/plink2\nln -s ~/tools/plink/plink ~/tools/bin/plink\n
"},{"location":"04_Data_QC/#add-paths-to-the-environment-path","title":"Add paths to the environment path","text":"Then add ~/tools/bin/
to the environment path.
Example
export PATH=$PATH:~/tools/bin/\n
This command will add the path to your current shell. If you restart the terminal, it will be lost. So you may need to add it to the Bash configuration file. Then run
echo \"export PATH=$PATH:~/tools/bin/\" >> ~/.bashrc\n
This will add a new line at the end of .bashrc
, which will be run every time you open a new bash shell.
All done. Let's test if we installed PLINK successfully or not.
Check if PLINK is installed successfully.
./plink\nPLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\n\nplink <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink --help [flag name(s)...]\n\nCommands include --make-bed, --recode, --flip-scan, --merge-list,\n--write-snplist, --list-duplicate-vars, --freqx, --missing, --test-mishap,\n--hardy, --mendel, --ibc, --impute-sex, --indep-pairphase, --r2, --show-tags,\n--blocks, --distance, --genome, --homozyg, --make-rel, --make-grm-gz,\n--rel-cutoff, --cluster, --pca, --neighbour, --ibs-test, --regress-distance,\n--model, --bd, --gxe, --logistic, --dosage, --lasso, --test-missing,\n--make-perm-pheno, --tdt, --qfam, --annotate, --clump, --gene-report,\n--meta-analysis, --epistasis, --fast-epistasis, and --score.\n\n\"plink --help | more\" describes all functions (warning: long).\n
./plink2\nPLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\n\nplink2 <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink2 --help [flag name(s)...]\n\nCommands include --rm-dup list, --make-bpgen, --export, --freq, --geno-counts,\n--sample-counts, --missing, --hardy, --het, --fst, --indep-pairwise, --ld,\n--sample-diff, --make-king, --king-cutoff, --pmerge, --pgen-diff,\n--write-samples, --write-snplist, --make-grm-list, --pca, --glm, --adjust-file,\n--gwas-ssf, --clump, --score, --variant-score, --genotyping-rate, --pgen-info,\n--validate, and --zst-decompress.\n\n\"plink2 --help | more\" describes all functions.\n
Well done. We have successfully installed plink1.9 and plink2.
"},{"location":"04_Data_QC/#download-genotype-data","title":"Download genotype data","text":"Next, we need to download the sample genotype data. The way to create the sample data is described [here].(https://cloufield.github.io/GWASTutorial/01_Dataset/) This dataset contains 504 EAS individuals from 1000 Genome Project Phase 3v5 with around 1 million variants.
Simply run download_sampledata.sh
in 01_Dataset to download this dataset (from Dropbox). See here
Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.
Download sample data
cd ../01_Dataset\n./download_sampledata.sh\n
And you will get the following three PLINK files:
-rw-r--r-- 1 yunye yunye 149M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n-rw-r--r-- 1 yunye yunye 40M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n-rw-r--r-- 1 yunye yunye 13K Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n
Check the bim file:
head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1 1:14930:A:G 0 14930 G A\n1 1:15774:G:A 0 15774 A G\n1 1:15777:A:G 0 15777 G A\n1 1:57292:C:T 0 57292 T C\n1 1:77874:G:A 0 77874 A G\n1 1:87360:C:T 0 87360 T C\n1 1:92917:T:A 0 92917 A T\n1 1:104186:T:C 0 104186 T C\n1 1:125271:C:T 0 125271 C T\n1 1:232449:G:A 0 232449 A G\n
Check the fam file:
head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\nHG00403 HG00403 0 0 0 -9\nHG00404 HG00404 0 0 0 -9\nHG00406 HG00406 0 0 0 -9\nHG00407 HG00407 0 0 0 -9\nHG00409 HG00409 0 0 0 -9\nHG00410 HG00410 0 0 0 -9\nHG00419 HG00419 0 0 0 -9\nHG00421 HG00421 0 0 0 -9\nHG00422 HG00422 0 0 0 -9\nHG00428 HG00428 0 0 0 -9\n
"},{"location":"04_Data_QC/#plink-tutorial","title":"PLINK tutorial","text":"Detailed descriptions can be found on plink's website: PLINK1.9 and PLINK2.
The functions we will learn in this tutorial:
All sample codes and results for this module are available in ./04_data_QC
QC Step Summary
QC step Option in PLINK Commonly used threshold to exclude Sample missing rate--geno
, --missing
missing rate > 0.01 (0.02, or 0.05) SNP missing rate --mind
, --missing
missing rate > 0.01 (0.02, or 0.05) Minor allele frequency --freq
, --maf
maf < 0.01 Sample Relatedness --genome
pi_hat > 0.2 to exclude second-degree relatives Hardy-Weinberg equilibrium --hwe
,--hardy
hwe < 1e-6 Inbreeding F coefficient --het
outside of 3 SD from the mean First, we can calculate some basic statistics of our simulated data:
"},{"location":"04_Data_QC/#missing-rate-call-rate","title":"Missing rate (call rate)","text":"The first thing we want to know is the missing rate of our data. Usually, we need to check the missing rate of samples and SNPs to decide a threshold to exclude low-quality samples and SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#missing)
Missing rate and Call rate
Suppose we have N samples and M SNPs for each sample.
For sample \\(j\\) :
\\[Sample\\ Missing\\ Rate_{j} = {{N_{missing\\ SNPs\\ for\\ j}}\\over{M}} = 1 - Call\\ Rate_{sample, j}\\]For SNP \\(i\\) :
\\[SNP\\ Missing\\ Rate_{i} = {{N_{missing\\ samples\\ at\\ i}}\\over{N}} = 1 - Call\\ Rate_{SNP, i}\\]The input is PLINK bed/bim/fam file. Usually, they have the same prefix, and we just need to pass the prefix to --bfile
option.
PLINK syntax
To calculate the missing rate, we need the flag --missing
, which tells PLINK to calculate the missing rate in the dataset specified by --bfile
.
Calculate missing rate
cd ../04_Data_QC\ngenotypeFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" #!!! Please add your own path here. \"1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" is the prefix of PLINK bed file. \n\nplink \\\n --bfile ${genotypeFile} \\\n --missing \\\n --out plink_results\n
Remeber to set the value for ${genotypeFile}
. This code will generate two files plink_results.imiss
and plink_results.lmiss
, which contain the missing rate information for samples and SNPs respectively.
Take a look at the .imiss
file. The last column shows the missing rate for samples. Since we used part of the 1000 Genome Project data this time, there are no missing SNPs in the original datasets. But for educational purposes, we randomly make some of the genotypes missing.
# missing rate for each sample\nhead plink_results.imiss\n FID IID MISS_PHENO N_MISS N_GENO F_MISS\nHG00403 HG00403 Y 10020 1235116 0.008113\nHG00404 HG00404 Y 9192 1235116 0.007442\nHG00406 HG00406 Y 15751 1235116 0.01275\nHG00407 HG00407 Y 14653 1235116 0.01186\nHG00409 HG00409 Y 5667 1235116 0.004588\nHG00410 HG00410 Y 6066 1235116 0.004911\nHG00419 HG00419 Y 20000 1235116 0.01619\nHG00421 HG00421 Y 17542 1235116 0.0142\nHG00422 HG00422 Y 18608 1235116 0.01507\n
# missing rate for each SNP\nhead plink_results.lmiss\n CHR SNP N_MISS N_GENO F_MISS\n 1 1:14930:A:G 2 504 0.003968\n 1 1:15774:G:A 3 504 0.005952\n 1 1:15777:A:G 3 504 0.005952\n 1 1:57292:C:T 6 504 0.0119\n 1 1:77874:G:A 3 504 0.005952\n 1 1:87360:C:T 1 504 0.001984\n 1 1:92917:T:A 7 504 0.01389\n 1 1:104186:T:C 3 504 0.005952\n 1 1:125271:C:T 2 504 0.003968\n
Distribution of sample missing rate and SNP missing rate
Note: The missing values were simulated based on normal distributions for each individual.
Sample missing rate
SNP missing rate
For the meaning of headers, please refer to PLINK documents.
"},{"location":"04_Data_QC/#allele-frequency","title":"Allele Frequency","text":"One of the most important statistics of SNPs is their frequency in a certain population. Many downstream analyses are based on investigating differences in allele frequencies.
Usually, variants can be categorized into 3 groups based on their Minor Allele Frequency (MAF):
How to calculate Minor Allele Frequency (MAF)
Suppose the reference allele(REF) is A and the alternative allele(ALT) is B for a certain SNP. The posible genotypes are AA, AB and BB. In a population of N samples (2N alleles), \\(N = N_{AA} + 2 \\times N_{AB} + N_{BB}\\) :
So we can calculate the allele frequency:
The MAF for this SNP in this specific population is defined as:
\\(MAF = min( AF_{REF}, AF_{ALT} )\\)
For different downstream analyses, we might use different sets of variants. For example, for PCA, we might use only common variants. For gene-based tests, we might use only rare variants.
Using PLINK1.9 we can easily calculate the MAF of variants in the input data.
Calculate the MAF of variants using PLINK1.9
plink \\\n --bfile ${genotypeFile} \\\n --freq \\\n --out plink_results\n
# results from plink1.9\nhead plink_results.frq\nCHR SNP A1 A2 MAF NCHROBS\n1 1:14930:A:G G A 0.4133 1004\n1 1:15774:G:A A G 0.02794 1002\n1 1:15777:A:G G A 0.07385 1002\n1 1:57292:C:T T C 0.1054 996\n1 1:77874:G:A A G 0.01996 1002\n1 1:87360:C:T T C 0.02286 1006\n1 1:92917:T:A A T 0.003018 994\n1 1:104186:T:C T C 0.499 1002\n1 1:125271:C:T C T 0.03088 1004\n
Next, we use plink2 to run the same options to check the difference between the results.
Calculate the alternative allele frequencies of variants using PLINK2
plink2 \\\n --bfile ${genotypeFile} \\\n --freq \\\n --out plink_results\n
# results from plink2\nhead plink_results.afreq\n#CHROM ID REF ALT PROVISIONAL_REF? ALT_FREQS OBS_CT\n1 1:14930:A:G A G Y 0.413347 1004\n1 1:15774:G:A G A Y 0.0279441 1002\n1 1:15777:A:G A G Y 0.0738523 1002\n1 1:57292:C:T C T Y 0.105422 996\n1 1:77874:G:A G A Y 0.0199601 1002\n1 1:87360:C:T C T Y 0.0228628 1006\n1 1:92917:T:A T A Y 0.00301811 994\n1 1:104186:T:C T C Y 0.500998 1002\n1 1:125271:C:T C T Y 0.969124 1004\n
We need to pay attention to the concepts here.
In PLINK1.9, the concept here is minor (A1) and major(A2) allele, while in PLINK2 it is the reference (REF) allele and the alternative (ALT) allele.
For SNP QC, besides checking the missing rate, we also need to check if the SNP is in Hardy-Weinberg equilibrium:
--hardy
will perform Hardy-Weinberg equilibrium exact test for each variant. Variants with low P value usually suggest genotyping errors, or indicate evolutionary selection for these variants.
The following command can calculate the Hardy-Weinberg equilibrium exact test statistics for all SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#hardy)
Info
Suppose we have N unrelated samples (2N alleles). Under HWE, the exact probability of observing \\(n_{AB}\\) sample with genotype AB in N samples is:
\\[P(N_{AB} = n_{AB} | N, n_A) = {{2^{n_{AB}}}N!\\over{n_{AA}!n_{AB}!n_{BB}!}} \\times {{n_A!n_B!}\\over{n_A!n_B!}} \\]To compute the Hardy-Weinberg equilibrium exact test statistics, we will sum up the probabilities of all configurations with probability equal to or less than the observed configuration :
\\[P_{HWE} = \\sum_{n^{*}_AB} I[P(N_{AB} = n_{AB} | N, n_A) \\geqq P(N_{AB} = n^{*}_{AB} | N, n_A)] \\times P(N_{AB} = n^{*}_{AB} | N, n_A)\\]\\(I(x)\\) is the indicator function. If x is true, \\(I(x) = 1\\); otherwise, \\(I(x) = 0\\).
Reference : Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link
Calculate the Hardy-Weinberg equilibrium exact test statistics for a single SNP using Python
This code is converted from here (Jeremy McRae) to python. Orginal citation: Wigginton, JE, Cutler, DJ, and Abecasis, GR (2005) A Note on Exact Tests of Hardy-Weinberg Equilibrium. AJHG 76: 887-893
def snphwe(obs_hets, obs_hom1, obs_hom2):\n obs_homr = min(obs_hom1, obs_hom2)\n obs_homc = max(obs_hom1, obs_hom2)\n\n rare = 2 * obs_homr + obs_hets\n genotypes = obs_hets + obs_homc + obs_homr\n\n probs = [0.0 for i in range(rare +1)]\n\n mid = rare * (2 * genotypes - rare) // (2 * genotypes)\n if mid % 2 != rare%2:\n mid += 1\n\n probs[mid] = 1.0\n sum_p = 1 #probs[mid]\n\n curr_homr = (rare - mid) // 2\n curr_homc = genotypes - mid - curr_homr\n\n for curr_hets in range(mid, 1, -2):\n probs[curr_hets - 2] = probs[curr_hets] * curr_hets * (curr_hets - 1.0)/ (4.0 * (curr_homr + 1.0) * (curr_homc + 1.0))\n sum_p+= probs[curr_hets - 2]\n curr_homr += 1\n curr_homc += 1\n\n curr_homr = (rare - mid) // 2\n curr_homc = genotypes - mid - curr_homr\n\n for curr_hets in range(mid, rare-1, 2):\n probs[curr_hets + 2] = probs[curr_hets] * 4.0 * curr_homr * curr_homc/ ((curr_hets + 2.0) * (curr_hets + 1.0))\n sum_p += probs[curr_hets + 2]\n curr_homr -= 1\n curr_homc -= 1\n\n target = probs[obs_hets]\n p_hwe = 0.0\n for p in probs:\n if p <= target :\n p_hwe += p / sum_p \n\n return min(p_hwe,1)\n
Calculate the Hardy-Weinberg equilibrium exact test statistics using PLINK
plink \\\n --bfile ${genotypeFile} \\\n --hardy \\\n --out plink_results\n
head plink_results.hwe\n CHR SNP TEST A1 A2 GENO O(HET) E(HET) P\n1 1:14930:A:G ALL(NP) G A 4/407/91 0.8108 0.485 4.864e-61\n1 1:15774:G:A ALL(NP) A G 0/28/473 0.05589 0.05433 1\n1 1:15777:A:G ALL(NP) G A 1/72/428 0.1437 0.1368 0.5053\n1 1:57292:C:T ALL(NP) T C 3/99/396 0.1988 0.1886 0.3393\n1 1:77874:G:A ALL(NP) A G 0/20/481 0.03992 0.03912 1\n1 1:87360:C:T ALL(NP) T C 0/23/480 0.04573 0.04468 1\n1 1:92917:T:A ALL(NP) A T 0/3/494 0.006036 0.006018 1\n1 1:104186:T:C ALL(NP) T C 74/352/75 0.7026 0.5 6.418e-20\n1 1:125271:C:T ALL(NP) C T 1/29/472 0.05777 0.05985 0.3798\n
"},{"location":"04_Data_QC/#applying-filters","title":"Applying filters","text":"Previously we calculated the basic statistics using PLINK. But when performing certain analyses, we just want to exclude the bad-quality samples or SNPs instead of calculating the statistics for all samples and SNPs.
In this case we can apply the following filters for example:
--maf 0.01
: exlcude snps with maf<0.01--geno 0.02
:filters out all variants with missing rates exceeding 0.02--mind 0.02
:filters out all samples with missing rates exceeding 0.02--hwe 1e-6
: filters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold. NOTE: With case/control data, cases and missing phenotypes are normally ignored. (see https://www.cog-genomics.org/plink/1.9/filter#hwe)We will apply these filters in the following example if LD-pruning.
"},{"location":"04_Data_QC/#ld-pruning","title":"LD Pruning","text":"There is often strong Linkage disequilibrium(LD) among SNPs, for some analysis we don't need all SNPs and we need to remove the redundant SNPs to avoid bias in genetic estimations. For example, for relatedness estimation, we will use only LD-Pruned SNP set.
We can use --indep-pairwise 50 5 0.2
to filter out those in strong LD and keep only the independent SNPs.
Meaning of --indep-pairwise x y z
x
SNPsz
y
SNPs forward and repeat the procedure.Please check https://www.cog-genomics.org/plink/1.9/ld#indep for details.
Combined with the filters we just introduced, we can run:
Example
plink \\\n --bfile ${genotypeFile} \\\n --maf 0.01 \\\n --geno 0.02 \\\n --mind 0.02 \\\n --hwe 1e-6 \\\n --indep-pairwise 50 5 0.2 \\\n --out plink_results\n
This command generates two outputs: plink_results.prune.in
and plink_results.prune.out
plink_results.prune.in
is the independent set of SNPs we will use in the following analysis. You can check the PLINK log for how many variants were removed based on the filters you applied:
Total genotyping rate in remaining samples is 0.993916.\n108837 variants removed due to missing genotype data (--geno).\n--hwe: 9754 variants removed due to Hardy-Weinberg exact test.\n87149 variants removed due to minor allele threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1029376 variants and 501 people pass filters and QC.\n
Let's take a look at the LD-pruned SNP file. Basically, it just contains one SNP id per line.
head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
"},{"location":"04_Data_QC/#inbreeding-f-coefficient","title":"Inbreeding F coefficient","text":"Next, we can check the heterozygosity F of samples (https://www.cog-genomics.org/plink/1.9/basic_stats#ibc) :
-het
option will compute observed and expected autosomal homozygous genotype counts for each sample. Usually, we need to exclude individuals with high or low heterozygosity coefficients, which suggests that the sample might be contaminated.
Inbreeding F coefficient calculation by PLINK
\\[F = {{O(HOM) - E(HOM)}\\over{ M - E(HOM)}}\\]High F may indicate a relatively high level of inbreeding.
Low F may suggest the sample DNA was contaminated.
Performing LD-pruning beforehand since these calculations do not take LD into account.
Calculate inbreeding F coefficient
plink \\\n --bfile ${genotypeFile} \\\n --extract plink_results.prune.in \\\n --het \\\n --out plink_results\n
Check the output:
head plink_results.het\n FID IID O(HOM) E(HOM) N(NM) F\nHG00403 HG00403 180222 1.796e+05 217363 0.01698\nHG00404 HG00404 180127 1.797e+05 217553 0.01023\nHG00406 HG00406 178891 1.789e+05 216533 -0.0001138\nHG00407 HG00407 178992 1.79e+05 216677 -0.0008034\nHG00409 HG00409 179918 1.801e+05 218045 -0.006049\nHG00410 HG00410 179782 1.801e+05 218028 -0.009268\nHG00419 HG00419 178362 1.783e+05 215849 0.001315\nHG00421 HG00421 178222 1.785e+05 216110 -0.008288\nHG00422 HG00422 178316 1.784e+05 215938 -0.0022\n
A commonly used method is to exclude samples with heterozygosity F deviating more than 3 standard deviations (SD) from the mean. Some studies used a fixed value such as +-0.15 or +-0.2.
Usually we will use only LD-pruned SNPs for the calculation of F.
We can plot the distribution of F:
Distribution of \\(F_{het}\\) in sample data
Here we use +-0.1 as the \\(F_{het}\\) threshold for convenience.
Create sample list of individuals with extreme F using awk
# only one sample\nawk 'NR>1 && $6>0.1 || $6<-0.1 {print $1,$2}' plink_results.het > high_het.sample\n
"},{"location":"04_Data_QC/#sample-snp-filtering-extractexcludekeepremove","title":"Sample & SNP filtering (extract/exclude/keep/remove)","text":"Sometimes we will use only a subset of samples or SNPs included the original dataset. In this case, we can use --extract
or --exclude
to select or exclude SNPs from analysis, --keep
or --remove
to select or exclude samples.
For --keep
or --remove
, the input is the filename of a sample FID and IID file. For --extract
or --exclude
, the input is the filename of an SNP list file.
head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
"},{"location":"04_Data_QC/#ibd-pi_hat-kinship-coefficient","title":"IBD / PI_HAT / kinship coefficient","text":"--genome
will estimate IBS/IBD. Usually, for this analysis, we need to prune our data first since the strong LD will cause bias in the results. (This step is computationally intensive)
Combined with the --extract
, we can run:
How PLINK estimates IBD
The prior probability of IBS sharing can be modeled as:
\\[P(I=i) = \\sum^{z=i}_{z=0}P(I=i|Z=z)P(Z=z)\\]So the proportion of alleles shared IBD (\\(\\hat{\\pi}\\)) can be estimated by:
\\[\\hat{\\pi} = {{P(Z=1)}\\over{2}} + P(Z=2)\\]Estimate IBD
plink \\\n --bfile ${genotypeFile} \\\n --extract plink_results.prune.in \\\n --genome \\\n --out plink_results\n
PI_HAT is the IBD estimation. Please check https://www.cog-genomics.org/plink/1.9/ibd for more details.
head plink_results.genome\n FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO\nHG00403 HG00403 HG00404 HG00404 UN NA 1.0000 0.0000 0.0000 0.0000 -1 0.858562 0.3679 1.9774\nHG00403 HG00403 HG00406 HG00406 UN NA 0.9805 0.0044 0.0151 0.0173 -1 0.858324 0.8183 2.0625\nHG00403 HG00403 HG00407 HG00407 UN NA 0.9790 0.0000 0.0210 0.0210 -1 0.857794 0.8034 2.0587\nHG00403 HG00403 HG00409 HG00409 UN NA 0.9912 0.0000 0.0088 0.0088 -1 0.857024 0.2637 1.9578\nHG00403 HG00403 HG00410 HG00410 UN NA 0.9699 0.0235 0.0066 0.0184 -1 0.858194 0.6889 2.0335\nHG00403 HG00403 HG00419 HG00419 UN NA 1.0000 0.0000 0.0000 0.0000 -1 0.857643 0.8597 2.0745\nHG00403 HG00403 HG00421 HG00421 UN NA 0.9773 0.0218 0.0010 0.0118 -1 0.857276 0.2186 1.9484\nHG00403 HG00403 HG00422 HG00422 UN NA 0.9880 0.0000 0.0120 0.0120 -1 0.857224 0.8277 2.0652\nHG00403 HG00403 HG00428 HG00428 UN NA 0.9801 0.0069 0.0130 0.0164 -1 0.858162 0.9812 2.1471\n
KING-robust kinship estimator
PLINK2 uses KING-robust kinship estimator, which is more robust in the presence of population substructure. See here.
Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W. M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.
Since the samples are unrelated, we do not need to remove any samples at this step. But remember to check this for your dataset.
"},{"location":"04_Data_QC/#ld-calculation","title":"LD calculation","text":"We can also use our data to estimate the LD between a pair of SNPs.
Details on LD can be found here
--chr
option in PLINK allows us to include SNPs on a specific chromosome. To calculate LD r2 for SNPs on chr22 , we can run:
Example
plink \\\n --bfile ${genotypeFile} \\\n --chr 22 \\\n --r2 \\\n --out plink_results\n
head plink_results.ld\n CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2\n22 16069141 22:16069141:C:G 22 16071624 22:16071624:A:G 0.771226\n22 16069784 22:16069784:A:T 22 16149743 22:16149743:T:A 0.217197\n22 16069784 22:16069784:A:T 22 16150589 22:16150589:C:A 0.224992\n22 16069784 22:16069784:A:T 22 16159060 22:16159060:G:A 0.2289\n22 16149743 22:16149743:T:A 22 16150589 22:16150589:C:A 0.965109\n22 16149743 22:16149743:T:A 22 16152606 22:16152606:T:C 0.692157\n22 16149743 22:16149743:T:A 22 16159060 22:16159060:G:A 0.721796\n22 16149743 22:16149743:T:A 22 16193549 22:16193549:C:T 0.336477\n22 16149743 22:16149743:T:A 22 16212542 22:16212542:C:T 0.442424\n
"},{"location":"04_Data_QC/#data-management-make-bedrecode","title":"Data management (make-bed/recode)","text":"By far the input data we use is in binary form, but sometimes we may want the text version.
Info
To convert the formats, we can run:
Convert PLINK formats
#extract the 1000 samples with the pruned SNPs, and make a bed file.\nplink \\\n --bfile ${genotypeFile} \\\n --extract plink_results.prune.in \\\n --make-bed \\\n --out plink_1000_pruned\n\n#convert the bed/bim/fam to ped/map\nplink \\\n --bfile plink_1000_pruned \\\n --recode \\\n --out plink_1000_pruned\n
"},{"location":"04_Data_QC/#apply-all-the-filters-to-obtain-a-clean-dataset","title":"Apply all the filters to obtain a clean dataset","text":"We can then apply the filters and remove samples with high \\(F_{het}\\) to get a clean dataset for later use.
plink \\\n --bfile ${genotypeFile} \\\n --maf 0.01 \\\n --geno 0.02 \\\n --mind 0.02 \\\n --hwe 1e-6 \\\n --remove high_het.sample \\\n --keep-allele-order \\\n --make-bed \\\n --out sample_data.clean\n
1224104 variants and 500 people pass filters and QC.\n
-rw-r--r-- 1 yunye yunye 146M Dec 26 15:40 sample_data.clean.bed\n-rw-r--r-- 1 yunye yunye 39M Dec 26 15:40 sample_data.clean.bim\n-rw-r--r-- 1 yunye yunye 13K Dec 26 15:40 sample_data.clean.fam\n
"},{"location":"04_Data_QC/#other-common-qc-steps-not-included-in-this-tutorial","title":"Other common QC steps not included in this tutorial","text":"Learn the meaning of each QC step.
Visualize the results of QC (using Python or R)
PCA aims to find the orthogonal directions of maximum variance and project the data onto a new subspace with equal or fewer dimensions than the original one. Simply speaking, GRM (genetic relationship matrix; covariance matrix) is first estimated and then PCA is applied to this matrix to generate eigenvectors and eigenvalues. Finally, the \\(k\\) eigenvectors with the largest eigenvalues are used to transform the genotypes to a new feature subspace.
Genetic relationship matrix (GRM)
Citation: Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.
A simple PCA
Source data:
cov = np.array([[6, -3], [-3, 3.5]])\npts = np.random.multivariate_normal([0, 0], cov, size=800)\n
The red arrow shows the first principal component axis (PC1) and the blue arrow shows the second principal component axis (PC2). The two axes are orthogonal.
Interpretation of PCs
The first principal component of a set of p variables, presumed to be jointly normally distributed, is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p iterations until all the variance is explained.
PCA is by far the most commonly used dimension reduction approach used in population genetics which could identify the difference in ancestry among the sample individuals. The population outliers could be excluded from the main cluster. For GWAS we also need to include top PCs to adjust for the population stratification.
Please read the following paper on how we apply PCA to genetic data: Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904\u2013909 (2006). https://doi.org/10.1038/ng1847 https://www.nature.com/articles/ng1847
So before association analysis, we will learn how to run PCA analysis first.
PCA workflow
"},{"location":"05_PCA/#preparation","title":"Preparation","text":""},{"location":"05_PCA/#exclude-snps-in-high-ld-or-hla-regions","title":"Exclude SNPs in high-LD or HLA regions","text":"For PCA, we first exclude SNPs in high-LD or HLA regions from the genotype data.
The reason why we want to exclude such high-LD or HLA regions
You can simply copy the list of high-LD or HLA regions in Genome build version(.bed format) to a text file high-ld.txt
.
High LD regions were obtained from
https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)
High LD regions of hg19
high-ld-hg19.txt1 48000000 52000000 highld\n2 86000000 100500000 highld\n2 134500000 138000000 highld\n2 183000000 190000000 highld\n3 47500000 50000000 highld\n3 83500000 87000000 highld\n3 89000000 97500000 highld\n5 44500000 50500000 highld\n5 98000000 100500000 highld\n5 129000000 132000000 highld\n5 135500000 138500000 highld\n6 25000000 35000000 highld\n6 57000000 64000000 highld\n6 140000000 142500000 highld\n7 55000000 66000000 highld\n8 7000000 13000000 highld\n8 43000000 50000000 highld\n8 112000000 115000000 highld\n10 37000000 43000000 highld\n11 46000000 57000000 highld\n11 87500000 90500000 highld\n12 33000000 40000000 highld\n12 109500000 112000000 highld\n20 32000000 34500000 highld\n
"},{"location":"05_PCA/#create-a-list-of-snps-in-high-ld-or-hla-regions","title":"Create a list of SNPs in high-LD or HLA regions","text":"Next, use high-ld.txt
to extract all SNPs that are located in the regions described in the file using the code as follows:
plink --file ${plinkFile} --make-set high-ld.txt --write-set --out hild\n
Create a list of SNPs in the regions specified in high-ld.txt
plinkFile=\"../04_Data_QC/sample_data.clean\"\n\nplink \\\n --bfile ${plinkFile} \\\n --make-set high-ld-hg19.txt \\\n --write-set \\\n --out hild\n
And all SNPs in the regions will be extracted to hild.set.
$head hild.set\nhighld\n1:48000156:C:G\n1:48002096:C:G\n1:48003081:T:C\n1:48004776:C:T\n1:48006500:A:G\n1:48006546:C:T\n1:48008102:T:G\n1:48009994:C:T\n1:48009997:C:A\n
For downstream analysis, we can exclude these SNPs using --exclude hild.set
.
Steps to perform a typical genomic PCA analysis
MAF filter for LD-pruning and PCA
For LD-pruning and PCA, we usually only use variants with MAF > 0.01 or MAF>0.05 ( --maf 0.01
or --maf 0.05
) for robust estimation.
Sample codes for performing PCA
plinkFile=\"\" #please set this to your own path\noutPrefix=\"plink_results\"\nthreadnum=2\nhildset = hild.set \n\n# LD-pruning, excluding high-LD and HLA regions\nplink2 \\\n --bfile ${plinkFile} \\\n --maf 0.01 \\\n --threads ${threadnum} \\\n --exclude ${hildset} \\ \n --indep-pairwise 500 50 0.2 \\\n --out ${outPrefix}\n\n# Remove related samples using king-cuttoff\nplink2 \\\n --bfile ${plinkFile} \\\n --extract ${outPrefix}.prune.in \\\n --king-cutoff 0.0884 \\\n --threads ${threadnum} \\\n --out ${outPrefix}\n\n# PCA after pruning and removing related samples\nplink2 \\\n --bfile ${plinkFile} \\\n --keep ${outPrefix}.king.cutoff.in.id \\\n --extract ${outPrefix}.prune.in \\\n --freq counts \\\n --threads ${threadnum} \\\n --pca approx allele-wts 10 \\ \n --out ${outPrefix}\n\n# Projection (related and unrelated samples)\nplink2 \\\n --bfile ${plinkFile} \\\n --threads ${threadnum} \\\n --read-freq ${outPrefix}.acount \\\n --score ${outPrefix}.eigenvec.allele 2 5 header-read no-mean-imputation variance-standardize \\\n --score-col-nums 6-15 \\\n --out ${outPrefix}_projected\n
--pca
and --pca approx
For step 3, please note that approx
flag is only recommended for analysis of >5000 samples. (It was applied in the sample code anyway because in real analysis you usually have a much larger sample size, though the sample size of our data is just ~500)
After step 3, the allele-wts 10
modifier requests an additional one-line-per-allele .eigenvec.allele
file with the first 10 PCs
expressed as allele weights instead of sample weights.
We will get the plink_results.eigenvec.allele
file, which will be used to project onto all samples along with an allele count plink_results.acount
file.
In the projection, score ${outPrefix}.eigenvec.allele 2 5
sets the ID
(2nd column) and A1
(5th column), score-col-nums 6-15
sets the first 10 PCs to be projected.
Please check https://www.cog-genomics.org/plink/2.0/score#pca_project for more details on the projection.
Allele weight and count files
plink_results.eigenvec.allele#CHROM ID REF ALT PROVISIONAL_REF? A1 PC1 PC2 PC3 PC4 PC5 PC6 PC7PC8 PC9 PC10\n1 1:15774:G:A G A Y G 0.57834 -1.03002 0.744557 -0.161887 0.389223 -0.0514592 0.133195 -0.0336162 -0.846376 0.0542876\n1 1:15774:G:A G A Y A -0.57834 1.03002 -0.744557 0.161887 -0.389223 0.0514592 -0.133195 0.0336162 0.846376 -0.0542876\n1 1:15777:A:G A G Y A -0.585215 0.401872 -0.393071 -1.79583 0.89579 -0.700882 -0.103729 -0.694495 -0.007313 0.513223\n1 1:15777:A:G A G Y G 0.585215 -0.401872 0.393071 1.79583 -0.89579 0.700882 0.103729 0.694495 0.007313 -0.513223\n1 1:57292:C:T C T Y C -0.123768 0.912046 -0.353606 -0.220148 -0.893017 -0.374505 -0.141002 -0.249335 0.625097 0.206104\n1 1:57292:C:T C T Y T 0.123768 -0.912046 0.353606 0.220148 0.893017 0.374505 0.141002 0.249335 -0.625097 -0.206104\n1 1:77874:G:A G A Y G 1.49202 -1.12567 1.19915 0.0755314 0.401134 -0.015842 0.0452086 0.273072 -0.00716098 0.237545\n1 1:77874:G:A G A Y A -1.49202 1.12567 -1.19915 -0.0755314 -0.401134 0.015842 -0.0452086 -0.273072 0.00716098 -0.237545\n1 1:87360:C:T C T Y C -0.191803 0.600666 -0.513208 -0.0765155 -0.656552 0.0930399 -0.0238774 -0.330449 -0.192037 -0.727729\n
plink_results.acount#CHROM ID REF ALT PROVISIONAL_REF? ALT_CTS OBS_CT\n1 1:15774:G:A G A Y 28 994\n1 1:15777:A:G A G Y 73 994\n1 1:57292:C:T C T Y 104 988\n1 1:77874:G:A G A Y 19 994\n1 1:87360:C:T C T Y 23 998\n1 1:125271:C:T C T Y 967 996\n1 1:232449:G:A G A Y 185 996\n1 1:533113:A:G A G Y 129 992\n1 1:565697:A:G A G Y 334 996\n
Eventually, we will get the PCA results for all samples.
PCA results for all samples
plink_results_projected.sscore#FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG\nHG00403 HG00403 390256 390256 0.00290265 -0.0248649 0.0100408 0.00957591 0.00694349 -0.00222251 0.0082228 -0.00114937 0.00335249 0.00437471\nHG00404 HG00404 390696 390696 -0.000141221 -0.027965 0.025389 -0.00582538 -0.00274707 0.00658501 0.0113803 0.0077766 0.0159976 0.0178927\nHG00406 HG00406 388524 388524 0.00707397 -0.0315445 -0.00437011 -0.0012621 -0.0114932 -0.00539483 -0.00620153 0.00452379 -0.000870627 -0.00227979\nHG00407 HG00407 388808 388808 0.00683977 -0.025073 -0.00652723 0.00679729 -0.0116 -0.0102328 0.0139572 0.00618677 0.0138063 0.00825269\nHG00409 HG00409 391646 391646 0.000398695 -0.0290334 -0.0189352 -0.00135977 0.0290436 0.00942829 -0.0171194 -0.0129637 0.0253596 0.022907\nHG00410 HG00410 391600 391600 0.00277094 -0.0280021 -0.0209991 -0.00799085 0.0318038 -0.00284209 -0.031517 -0.0010026 0.0132541 0.0357565\nHG00419 HG00419 387118 387118 0.00684154 -0.0326244 0.00237159 0.0167284 -0.0119737 -0.0079637 -0.0144339 0.00712756 0.0114292 0.00404426\nHG00421 HG00421 387720 387720 0.00157095 -0.0338115 -0.00690541 0.0121058 0.00111378 0.00530794 -0.0017545 -0.00121793 0.00393407 0.00414204\nHG00422 HG00422 387466 387466 0.00439167 -0.0332386 0.000741526 0.0124843 -0.00362248 -0.00343393 -0.00735112 0.00944759 -0.0107516 0.00376537\n
"},{"location":"05_PCA/#plotting-the-pcs","title":"Plotting the PCs","text":"You can now create scatterplots of the PCs using R or Python.
For plotting using Python: plot_PCA.ipynb
Scatter plot of PC1 and PC2 using 1KG EAS individuals
Note : We only used a small proportion of all available variants. This figure only very roughly shows the population structure in East Asia.
Requirements: - python>3 - numpy,pandas,seaborn,matplotlib
"},{"location":"05_PCA/#pca-umap","title":"PCA-UMAP","text":"(optional) We can also apply another non-linear dimension reduction algorithm called UMAP to the PCs to further identify the local structures. (PCA-UMAP)
For more details, please check: - https://umap-learn.readthedocs.io/en/latest/index.html
An example of PCA and PCA-UMAP for population genetics: - Sakaue, S., Hirata, J., Kanai, M., Suzuki, K., Akiyama, M., Lai Too, C., ... & Okada, Y. (2020). Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nature communications, 11(1), 1-11.
"},{"location":"05_PCA/#references","title":"References","text":"To test the association between a phenotype and genotypes, we need to group the genotypes based on genetic models.
There are three basic genetic models:
Three genetic models
For example, suppose we have a biallelic SNP whose reference allele is A and the alternative allele is G.
There are three possible genotypes for this SNP: AA, AG, and GG.
This table shows how we group different genotypes under each genetic model for association tests using linear or logistic regressions.
Genetic models AA AG GG Additive model 0 1 2 Dominant model 0 1 1 Recessive model 0 0 1Contingency table and non-parametric tests
A simple way to test association is to use the 2x2 or 2x3 contingency table. For dominant and recessive models, Chi-square tests are performed using the 2x2 table. For the additive model, Cochran-Armitage trend tests are performed for the 2x3 table. However, the non-parametric tests do not adjust for the bias caused by other covariates like sex, age and so forth.
"},{"location":"06_Association_tests/#association-testing-basics","title":"Association testing basics","text":"For quantitative traits, we can employ a simple linear regression model to test associations:
\\[ y = G\\beta_G + X\\beta_X + e \\]Interpretation of linear regression
For binary traits, we can utilize the logistic regression model to test associations:
\\[ logit(p) = G\\beta_G + X\\beta_X + e \\]Linear regression and logistic regression
"},{"location":"06_Association_tests/#file-preparation","title":"File Preparation","text":"To perform genome-wide association tests, usually, we need the following files:
Phenotype and covariate files
Phenotype file for a simulated binary trait; B1 is the phenotype name; 1 means the control, 2 means the case.
1kgeas_binary.txtFID IID B1\nHG00403 HG00403 1\nHG00404 HG00404 2\nHG00406 HG00406 1\nHG00407 HG00407 1\nHG00409 HG00409 2\nHG00410 HG00410 2\nHG00419 HG00419 1\nHG00421 HG00421 1\nHG00422 HG00422 1\n\nCovariate file (only top PCs calculated in the previous PCA section)\n\n```txt title=\"plink_results_projected.sscore\"\n#FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVGPC9_AVG PC10_AVG\nHG00403 HG00403 390256 390256 0.00290265 -0.0248649 -0.0100407 0.00957595 0.00694056 0.00222996 0.00823028 0.00116497 -0.00334937 0.00434627\nHG00404 HG00404 390696 390696 -0.000141221 -0.027965 -0.025389 -0.00582553 -0.00274711 -0.00657958 0.0113769 -0.00778919 -0.0159685 0.0180678\nHG00406 HG00406 388524 388524 0.00707397 -0.0315445 0.00437013 -0.00126195 -0.0114938 0.00538932 -0.00619657 -0.00454686 0.000969112 -0.00217617\nHG00407 HG00407 388808 388808 0.00683977 -0.025073 0.00652723 0.00679731 -0.0116001 0.0102403 0.0139674 -0.00621948 -0.013797 0.00827744\nHG00409 HG00409 391646 391646 0.000398695 -0.0290334 0.0189352 -0.00135996 0.0290464 -0.00941851 -0.0171911 0.01293 -0.0252628 0.0230819\nHG00410 HG00410 391600 391600 0.00277094 -0.0280021 0.0209991 -0.00799089 0.0318043 0.00283456 -0.0315157 0.000978664 -0.0133768 0.0356721\nHG00419 HG00419 387118 387118 0.00684154 -0.0326244 -0.00237159 0.0167284 -0.0119684 0.00795149 -0.0144241 -0.00716183 -0.0115059 0.0038652\nHG00421 HG00421 387720 387720 0.00157095 -0.0338115 0.00690542 0.0121058 0.00111448 -0.00531714 -0.00175494 0.00118513 -0.00391494 0.00414682\nHG00422 HG00422 387466 387466 0.00439167 -0.0332386 -0.000741482 0.0124843 -0.00362885 0.00342491 -0.0073205 -0.00939123 0.010718 0.00360906\n
"},{"location":"06_Association_tests/#association-tests-using-plink","title":"Association tests using PLINK","text":"Please check https://www.cog-genomics.org/plink/2.0/assoc for more details.
We will perform logistic regression with firth correction for a simulated binary trait under the additive model using the 1KG East Asian individuals.
Firth correction
Adding a penalty term to the log-likelihood function when fitting the logistic model results in less bias. - Firth, David. \"Bias reduction of maximum likelihood estimates.\" Biometrika 80.1 (1993): 27-38.
Quantitative traits
For quantitative traits, linear regressions will be performed and in this case, we do not need to add firth
(since Firth correction is not appliable).
Sample codes for association test using plink for binary traits
genotypeFile=\"../04_Data_QC/sample_data.clean\" # the clean dataset we generated in previous section\nphenotypeFile=\"../01_Dataset/1kgeas_binary.txt\" # the phenotype file\ncovariateFile=\"../05_PCA/plink_results_projected.sscore\" # the PC score file\n\ncovariateCols=6-10\ncolName=\"B1\"\nthreadnum=2\n\nplink2 \\\n --bfile ${genotypeFile} \\\n --pheno ${phenotypeFile} \\\n --pheno-name ${colName} \\\n --maf 0.01 \\\n --covar ${covariateFile} \\\n --covar-col-nums ${covariateCols} \\\n --glm hide-covar firth firth-residualize single-prec-cc \\\n --threads ${threadnum} \\\n --out 1kgeas\n
Note
Using the latest version of PLINK2, you need to add firth-residualize single-prec-cc
to generate the results. (The algorithm and precision have been changed since 2023 for firth regression)
You will see a similar log like:
Log
1kgeas.logPLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to 1kgeas.log.\nOptions in effect:\n--bfile ../04_Data_QC/sample_data.clean\n--covar ../05_PCA/plink_results_projected.sscore\n--covar-col-nums 6-10\n--glm hide-covar firth firth-residualize single-prec-cc\n--maf 0.01\n--out 1kgeas\n--pheno ../01_Dataset/1kgeas_binary.txt\n--pheno-name B1\n--threads 2\n\nStart time: Tue Dec 26 15:52:10 2023\n31934 MiB RAM detected, ~30479 available; reserving 15967 MiB for main\nworkspace.\nUsing up to 2 compute threads.\n500 samples (0 females, 0 males, 500 ambiguous; 500 founders) loaded from\n../04_Data_QC/sample_data.clean.fam.\n1224104 variants loaded from ../04_Data_QC/sample_data.clean.bim.\n1 binary phenotype loaded (248 cases, 250 controls).\n5 covariates loaded from ../05_PCA/plink_results_projected.sscore.\nCalculating allele frequencies... done.\n95372 variants removed due to allele frequency threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1128732 variants remaining after main filters.\n--glm Firth regression on phenotype 'B1': done.\nResults written to 1kgeas.B1.glm.firth .\nEnd time: Tue Dec 26 15:53:49 2023\n
Let's check the first lines of the output:
Association test results
1kgeas.B1.glm.firth #CHROM POS ID REF ALT PROVISIONAL_REF? A1 OMITTED A1_FREQ TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 15774 1:15774:G:A G A Y A G 0.0282828 ADD 495 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 15777 1:15777:A:G A G Y G A 0.0737374 ADD 495 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 57292 1:57292:C:T C T Y T C 0.104675 ADD 492 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 77874 1:77874:G:A G A Y A G 0.0191532 ADD 496 1.12228 0.46275 0.249299 0.80313 .\n1 87360 1:87360:C:T C T Y T C 0.0231388 ADD 497 NA NA NA NA FIRTH_CONVERGE_FAIL\n1 125271 1:125271:C:T C T Y C T 0.0292339 ADD 496 1.53387 0.373358 1.1458 0.25188 .\n1 232449 1:232449:G:A G A Y A G 0.185484 ADD 496 0.884097 0.168961 -0.729096 0.465943 .\n1 533113 1:533113:A:G A G Y G A 0.129555 ADD 494 0.90593 0.196631 -0.50243 0.615365 .\n1 565697 1:565697:A:G A G Y G A 0.334677 ADD 496 1.04653 0.15286 0.297509 0.766078 .\n
Usually, other options are added to enhance the sumstats
cols=
requests the following columns in the sumstats: here are allele1 frequency and (MaCH)Rsq, firth-fallback
will test the common variants without firth correction, which could improve the speed, omit-ref
will force the ALT==A1==effect allele, otherwise the minor allele would be tested (see the above result, which ALT may not equal A1).Genomic control (GC) is a basic method for controlling for confounding factors including population stratification.
We will calculate the genomic control factor (lambda GC) to evaluate the inflation. The genomic control factor is calculated by dividing the median of observed Chi square statistics by the median of Chi square distribution with the degree of freedom being 1 (which is approximately 0.455).
\\[ \\lambda_{GC} = {median(\\chi^{2}_{observed}) \\over median(\\chi^{2}_1)} \\]Then, we can used the genomic control factor to correct observed Chi suqare statistics.
\\[ \\chi^{2}_{corrected} = {\\chi^{2}_{observed} \\over \\lambda_{GC}} \\]Genomic inflation is based on the idea that most of the variants are not associated, thus no deviation between the observed and expected Chi square distribution, except the spikes at the end. However, if the trait is highly polygenic, this assumption may be violated.
Reference: Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997-1004.
"},{"location":"06_Association_tests/#significant-loci","title":"Significant loci","text":"Please check Visualization using gwaslab
Loci that reached genome-wide significance threshold (P value < 5e-8) :
SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT\n1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A\n2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T\n7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G\n20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C\n
Warning
This is just to show the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result is meaningless here.
Allele frequency and Effect size
"},{"location":"06_Association_tests/#visualization","title":"Visualization","text":"To visualize the sumstats, we will create the Manhattan plot, QQ plot and regional plot.
Please check for codes : Visualization using gwaslab
"},{"location":"06_Association_tests/#manhattan-plot","title":"Manhattan plot","text":"Manhattan plot is the most classic visualization of GWAS summary statistics. It is a form of scatter plot. Each dot represents the test result for a variant. variants are sorted by their genome coordinates and are aligned along the X axis. Y axis shows the -log10(P value) for tests of variants in GWAS.
Note
This kind of plot was named after Manhattan in New York City since it resembles the Manhattan skyline.
A real Manhattan plot
I took this photo in 2020 just before the COVID-19 pandemic. It was a cloudy and misty day. Those birds formed a significance threshold line. And the skyscrapers above that line resembled the significant signals in your GWAS. I believe you could easily get how the GWAS Manhattan plot was named.
Data we need from sumstats to create Manhattan plots:
Steps to create Manhattan plot
Quantile-quantile plot (also known as Q-Q plot), is commonly used to compare an observed distribution with its expected distribution. For a specific point (x,y) on Q-Q plot, its y coordinate corresponds to one of the quantiles of the observed distribution, while its x coordinate corresponds to the same quantile of the expected distribution.
Quantile-quantile plot is used to check if there is any significant inflation in P value distribution, which usually indicates population stratification or cryptic relatedness.
Data we need from sumstats to create the Manhattan plot:
Steps to create Q-Q plot
Suppose we have n
variants in our sumstats,
n
P value to -log10(P).n
numbers from (0,1)
with equal intervals.n
numbers to -log10(P) and sort in ascending order.Note
The expected distribution of P value is a Uniform distribution from 0 to 1.
\\[P_{expected} \\sim U(0,1)\\]"},{"location":"06_Association_tests/#regional-plot","title":"Regional plot","text":"Manhattan plot is very useful to check the overview of our sumstats. But if we want to check a specific genomic locus, we need a plot with finer resolution. This kind of plot is called a regional plot. It is basically the Manhattan plot of only a small region on the genome, with points colored by its LD r2 with the lead variant in this region.
Such a plot is especially helpful to understand the signal and loci, e.g., LD structure, independent signals, and genes.
The regional plot for the loci of 2:55513738:C:T.
Please check Visualization using gwaslab
"},{"location":"06_Association_tests/#gwas-ssf","title":"GWAS-SSF","text":"To standardize the format of GWAS summary statistics for sharing, GWAS-SSF format was proposed in 2022. This format is now used as the standard format for GWAS Catalog.
GWAS-SSF consists of :
Schematic representation of GWAS-SSF data file
GWAS-SSF
Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv, 2022-07.
For details, please check:
ANNOVAR is a simple and efficient command line tool for variant annotation.
In this tutorial, we will use ANNOVAR to annotate the variants in our summary statistics (hg19).
"},{"location":"07_Annotation/#install","title":"Install","text":"Download ANNOVAR from here (registration required; freely available to personal, academic and non-profit use only.)
You will receive an email with the download link after registration. Download it and decompress:
tar -xvzf annovar.latest.tar.gz\n
For refGene annotation for hg19, we do not need to download additional files.
"},{"location":"07_Annotation/#format-input-file","title":"Format input file","text":"The default input file for ANNOVAR is a 1-based coordinate file.
We will only use the first 100000 variants as an example.
annovar_input
awk 'NR>1 && NR<100000 {print $1,$2,$2,$4,$5}' ../06_Association_tests/1kgeas.B1.glm.logistic. hybrid > annovar_input.txt\n
head annovar_input.txt \n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n
With -vcfinput
option, ANNOVAR can accept input files in VCF format.
Annotate the variants with gene information.
A minimal example of annotation using refGene
input=annovar_input.txt\nhumandb=/home/he/tools/annovar/annovar/humandb\ntable_annovar.pl ${input} ${humandb} -buildver hg19 -out myannotation -remove -protocol refGene -operation g -nastring . -polish\n
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange. refGene\n1 13273 13273 G C ncRNA_exonic DDX11L1;LOC102725121 . . .\n1 14599 14599 T A ncRNA_exonic WASH7P . . .\n1 14604 14604 A G ncRNA_exonic WASH7P . . .\n1 14930 14930 A G ncRNA_intronic WASH7P . . .\n1 69897 69897 T C exonic OR4F5 . synonymous SNV OR4F5:NM_001005484:exon1:c.T807C:p.S269S\n1 86331 86331 A G intergenic OR4F5;LOC729737 dist=16323;dist=48442 . .\n1 91581 91581 G A intergenic OR4F5;LOC729737 dist=21573;dist=43192 . .\n1 122872 122872 T G intergenic OR4F5;LOC729737 dist=52864;dist=11901 . .\n1 135163 135163 C T ncRNA_exonic LOC729737 . . .\n
"},{"location":"07_Annotation/#additional-databases","title":"Additional databases","text":"ANNOVAR supports a wide range of commonly used databases including dbsnp
, dbnsfp
, clinvar
, gnomad
, 1000g
, cadd
and so forth. For details, please check ANNOVAR's official documents
You can check the Table Name listed in the link above and download the database you need using the following command.
Example: Downloading avsnp150 for hg19 from ANNOVAR
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb/\n
An example of annotation using multiple databases
# input file is in vcf format\ntable_annovar.pl \\\n ${in_vcf} \\\n ${humandb} \\\n -buildver hg19 \\\n -protocol refGene,avsnp150,clinvar_20200316,gnomad211_exome \\\n -operation g,f,f,f \\\n -remove \\\n -out ${out_prefix} \\ \n -vcfinput\n
"},{"location":"07_Annotation/#vep-under-construction","title":"VEP (under construction)","text":""},{"location":"07_Annotation/#install_1","title":"Install","text":"git clone https://github.com/Ensembl/ensembl-vep.git\ncd ensembl-vep\nperl INSTALL.pl\n
Hello! This installer is configured to install v108 of the Ensembl API for use by the VEP.\nIt will not affect any existing installations of the Ensembl API that you may have.\n\nIt will also download and install cache files from Ensembl's FTP server.\n\nChecking for installed versions of the Ensembl API...done\n\nSetting up directories\nDestination directory ./Bio already exists.\nDo you want to overwrite it (if updating VEP this is probably OK) (y/n)? y\n - fetching BioPerl\n - unpacking ./Bio/tmp/release-1-6-924.zip\n - moving files\n\nDownloading required Ensembl API files\n - fetching ensembl\n - unpacking ./Bio/tmp/ensembl.zip\n - moving files\n - getting version information\n - fetching ensembl-variation\n - unpacking ./Bio/tmp/ensembl-variation.zip\n - moving files\n - getting version information\n - fetching ensembl-funcgen\n - unpacking ./Bio/tmp/ensembl-funcgen.zip\n - moving files\n - getting version information\n - fetching ensembl-io\n - unpacking ./Bio/tmp/ensembl-io.zip\n - moving files\n - getting version information\n\nTesting VEP installation\n - OK!\n\nThe VEP can either connect to remote or local databases, or use local cache files.\nUsing local cache files is the fastest and most efficient way to run the VEP\nCache files will be stored in /home/he/.vep\nDo you want to install any cache files (y/n)? y\n\nThe following species/files are available; which do you want (specify multiple separated by spaces or 0 for all): \n1 : acanthochromis_polyacanthus_vep_108_ASM210954v1.tar.gz (69 MB)\n2 : accipiter_nisus_vep_108_Accipiter_nisus_ver1.0.tar.gz (55 MB)\n...\n466 : homo_sapiens_merged_vep_108_GRCh37.tar.gz (16 GB)\n467 : homo_sapiens_merged_vep_108_GRCh38.tar.gz (26 GB)\n468 : homo_sapiens_refseq_vep_108_GRCh37.tar.gz (13 GB)\n469 : homo_sapiens_refseq_vep_108_GRCh38.tar.gz (22 GB)\n470 : homo_sapiens_vep_108_GRCh37.tar.gz (14 GB)\n471 : homo_sapiens_vep_108_GRCh38.tar.gz (22 GB)\n\n Total: 221 GB for all 471 files\n\n? 470\n - downloading https://ftp.ensembl.org/pub/release-108/variation/indexed_vep_cache/homo_sapiens_vep_108_GRCh37.tar.gz\n
"},{"location":"08_LDSC/","title":"LD score regression","text":""},{"location":"08_LDSC/#table-of-contents","title":"Table of Contents","text":"LDSC is one of the most commonly used command line tool to estimate inflation, hertability, genetic correlation and cell/tissue type specificity from GWAS summary statistics.
"},{"location":"08_LDSC/#ld-linkage-disequilibrium","title":"LD: Linkage disequilibrium","text":"Linkage disequilibrium (LD) : non-random association of alleles at different loci in a given population. (Wiki)
"},{"location":"08_LDSC/#ld-score","title":"LD score","text":"LD score \\(l_j\\) for a SNP \\(j\\) is defined as the sum of \\(r^2\\) for the SNP and other SNPs in a region.
\\[ l_j= \\Sigma_k{r^2_{j,k}} \\]"},{"location":"08_LDSC/#ld-score-regression_1","title":"LD score regression","text":"Key idea: A variant will have higher test statistics if it is in LD with causal variant, and the elevation is proportional to the correlation ( \\(r^2\\) ) with the causal variant.
\\[ E[\\chi^2|l_j] = {{Nh^2l_j}\\over{M}} + Na + 1 \\]For more details of LD score regression, please refer to : - Bulik-Sullivan, Brendan K., et al. \"LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.\" Nature genetics 47.3 (2015): 291-295.
"},{"location":"08_LDSC/#install-ldsc","title":"Install LDSC","text":"LDSC can be downloaded from github (GPL-3.0 license): https://github.com/bulik/ldsc
For ldsc, we need anaconda to create virtual environment (for python2). If you haven't installed Anaconda, please check how to install anaconda.
# change to your directory for tools\ncd ~/tools\n\n# clone the ldsc github repository\ngit clone https://github.com/bulik/ldsc.git\n\n# create a virtual environment for ldsc (python2)\ncd ldsc\nconda env create --file environment.yml \n\n# activate ldsc environment\nconda activate ldsc\n
"},{"location":"08_LDSC/#data-preparation","title":"Data Preparation","text":"In this tutoial, we will use sample summary statistics for HDLC and LDLC from Jenger. - Kanai, Masahiro, et al. \"Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases.\" Nature genetics 50.3 (2018): 390-400.
The Miami plot for the two traits:
"},{"location":"08_LDSC/#download-sample-summary-statistics","title":"Download sample summary statistics","text":"# HDL-c and LDL-c in Biobank Japan\nwget -O BBJ_LDLC.txt.gz http://jenger.riken.jp/61analysisresult_qtl_download/\nwget -O BBJ_HDLC.txt.gz http://jenger.riken.jp/47analysisresult_qtl_download/\n
"},{"location":"08_LDSC/#download-reference-files","title":"Download reference files","text":"# change to your ldsc directory\ncd ~/tools/ldsc\nmkdir resource\ncd ./resource\n\n# snplist\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/w_hm3.snplist.bz2\n\n# EAS ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/eas_ldscores.tar.bz2\n\n# EAS weight\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_weights_hm3_no_MHC.tgz\n\n# EAS frequency\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_plinkfiles.tgz\n\n# EAS baseline model\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_baseline_v1.2_ldscores.tgz\n\n# Cell type ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/LDSC_SEG_ldscores/Cahoy_EAS_1000Gv3_ldscores.tar.gz\n
You can then decompress the files and organize them."},{"location":"08_LDSC/#munge-sumstats","title":"Munge sumstats","text":"Before the analysis, we need to format and clean the raw sumstats.
Note
Rsid is used here. If the sumstats only contained id like CHR:POS:REF:ALT, annotate it first.
snplist=~/tools/ldsc/resource/w_hm3.snplist\nmunge_sumstats.py \\\n --sumstats BBJ_HDLC.txt.gz \\\n --merge-alleles $snplist \\\n --a1 ALT \\\n --a2 REF \\\n --chunksize 500000 \\\n --out BBJ_HDLC\nmunge_sumstats.py \\\n --sumstats BBJ_LDLC.txt.gz \\\n --a1 ALT \\\n --a2 REF \\\n --chunksize 500000 \\\n --merge-alleles $snplist \\\n --out BBJ_LDLC\n
After munging, you will get two munged and formatted files:
BBJ_HDLC.sumstats.gz\nBBJ_LDLC.sumstats.gz\n
And these are the files we will use to run LD score regression."},{"location":"08_LDSC/#ld-score-regression_2","title":"LD score regression","text":"Univariate LD score regression is utilized to estimate heritbility and confuding factors (cryptic relateness and population stratification) of a certain trait.
Using the munged sumstats, we can run:
ldsc.py \\\n --h2 BBJ_HDLC.sumstats.gz \\\n --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --out BBJ_HDLC\n\nldsc.py \\\n --h2 BBJ_LDLC.sumstats.gz \\\n --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --out BBJ_LDLC\n
Lest's check the results for HDLC:
cat BBJ_HDLC.log\n*********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--h2 BBJ_HDLC.sumstats.gz \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Sat Dec 24 20:40:34 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nUsing two-step estimator with cutoff at 30.\nTotal Observed scale h2: 0.1583 (0.0281)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.0563 (0.0114)\nRatio: 0.1981 (0.0402)\nAnalysis finished at Sat Dec 24 20:40:41 2022\nTotal time elapsed: 6.57s\n
We can see that from the log:
According to LDSC documents, Ratio measures the proportion of the inflation in the mean chi^2 that the LD Score regression intercept ascribes to causes other than polygenic heritability. The value of ratio should be close to zero, though in practice values of 10-20% are not uncommon.
\\[ Ratio = {{intercept-1}\\over{mean(\\chi^2)-1}} \\]"},{"location":"08_LDSC/#distribution-of-h2-and-intercept-across-traits-in-ukb","title":"Distribution of h2 and intercept across traits in UKB","text":"The Neale Lab estimated SNP heritability using LDSC across more than 4,000 primary GWAS in UKB. You can check the distributions of SNP heritability and intercept estimates using the following link to get the idea of what you can expect from LD score regresion:
https://nealelab.github.io/UKBB_ldsc/viz_h2.html
"},{"location":"08_LDSC/#cross-trait-ld-score-regression","title":"Cross-trait LD score regression","text":"Cross-trait LD score regression is employed to estimate the genetic correlation between a pair of traits.
Key idea: replace \\chi^2
in univariate LD score regression and the relationship (SNPs with high LD ) still holds.
Then we can get the genetic correlation by :
\\[ r_g = {{\\rho_g}\\over{\\sqrt{h_1^2h_2^2}}} \\]ldsc.py \\\n --rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n --out BBJ_HDLC_LDLC\n
Let's check the results: *********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC_LDLC \\\n--rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Thu Dec 29 21:02:37 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nComputing rg for phenotype 2/2\nReading summary statistics from BBJ_LDLC.sumstats.gz ...\nRead summary statistics for 1217311 SNPs.\nAfter merging with summary statistics, 1012040 SNPs remain.\n1012040 SNPs with valid alleles.\n\nHeritability of phenotype 1\n---------------------------\nTotal Observed scale h2: 0.1054 (0.0383)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.1234 (0.0607)\nRatio: 0.4342 (0.2134)\n\nHeritability of phenotype 2/2\n-----------------------------\nTotal Observed scale h2: 0.0543 (0.0211)\nLambda GC: 1.0833\nMean Chi^2: 1.1465\nIntercept: 1.0583 (0.0335)\nRatio: 0.398 (0.2286)\n\nGenetic Covariance\n------------------\nTotal Observed scale gencov: 0.0121 (0.0106)\nMean z1*z2: -0.001\nIntercept: -0.0198 (0.0121)\n\nGenetic Correlation\n-------------------\nGenetic Correlation: 0.1601 (0.1821)\nZ-score: 0.8794\nP: 0.3792\n\n\nSummary of Genetic Correlation Results\np1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se\nBBJ_HDLC.sumstats.gz BBJ_LDLC.sumstats.gz 0.1601 0.1821 0.8794 0.3792 0.0543 0.0211 1.0583 0.0335 -0.0198 0.0121\n\nAnalysis finished at Thu Dec 29 21:02:47 2022\nTotal time elapsed: 10.39s\n
"},{"location":"08_LDSC/#partitioned-ld-regression","title":"Partitioned LD regression","text":"Partitioned LD regression is utilized to evaluate the contribution of each functional group to the total SNP heriatbility.
\\[ E[\\chi^2] = N \\sum\\limits_C \\tau_C l(j,C) + Na + 1 \\]\\(\\tau_C\\) : per-SNP contribution of category C to heritability
Reference: Finucane, Hilary K., et al. \"Partitioning heritability by functional annotation using genome-wide association summary statistics.\" Nature genetics 47.11 (2015): 1228-1235.
ldsc.py \\\n --h2 BBJ_HDLC.sumstats.gz \\\n --overlap-annot \\\n --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n --frqfile-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_plinkfiles/1000G.EAS.QC. \\\n --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n --out BBJ_HDLC_baseline\n
"},{"location":"08_LDSC/#celltype-specificity-ld-regression","title":"Celltype specificity LD regression","text":"LDSC-SEG : LD score regression applied to specifically expressed genes
An extension of Partitioned LD regression. Categories are defined by tissue or cell-type specific genes.
ldsc.py \\\n --h2-cts BBJ_HDLC.sumstats.gz \\\n --ref-ld-chr-cts ~/tools/ldsc/resource/Cahoy_EAS_1000Gv3_ldscores/Cahoy.EAS.ldcts \\\n --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n --out BBJ_HDLC_baseline_cts\n
"},{"location":"08_LDSC/#reference","title":"Reference","text":"MAGMA is one the most commonly used tools for gene-based and gene-set analysis.
Gene-level analysis in MAGMA uses two models:
1.Multiple linear principal components regression
MAGMA employs a multiple linear principal components regression, and F test to obtain P values for genes. The multiple linear principal components regression:
\\[ Y = \\alpha_{0,g} + X_g \\alpha_g + W \\beta_g + \\epsilon_g \\]\\(X_g\\) is obtained by first projecting the variant matrix of a gene onto its PC, and removing PCs with samll eigenvalues.
Note
The linear principal components regression model requires raw genotype data.
2.SNP-wise models
SNP-wise Mean: perform tests on mean SNP association
Note
SNP-wise models use summary statistics and reference LD panel
Gene-set analysis
Quote
Competitive gene-set analysis tests whether the genes in a gene-set are more strongly associated with the phenotype of interest than other genes.
P values for each gene were converted to Z scores to perform gene-set level analysis.
\\[ Z = \\beta_{0,S} + S_S \\beta_S + \\epsilon \\]Dowload MAGMA for your operating system from the following url:
MAGMA: https://ctg.cncr.nl/software/magma
For example:
cd ~/tools\nmkdir MAGMA\ncd MAGMA\nwget https://ctg.cncr.nl/software/MAGMA/prog/magma_v1.10.zip\nunzip magma_v1.10.zip\n
Add magma to your environment path. Test if it is successfully installed.
$ magma --version\nMAGMA version: v1.10 (linux)\n
"},{"location":"09_Gene_based_analysis/#download-reference-files","title":"Download reference files","text":"We nedd the following reference files:
The gene location files and LD reference panel can be downloaded from magma website.
-> https://ctg.cncr.nl/software/magma
The third one can be downloaded form MsigDB.
-> https://www.gsea-msigdb.org/gsea/msigdb/
"},{"location":"09_Gene_based_analysis/#format-input-files","title":"Format input files","text":"zcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,$2,$3}' > HDLC_chr3.magma.input.snp.chr.pos.txt\nzcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,10^(-$11)}' > HDLC_chr3.magma.input.p.txt\n
"},{"location":"09_Gene_based_analysis/#annotate-snps","title":"Annotate SNPs","text":"snploc=./HDLC_chr3.magma.input.snp.chr.pos.txt\nncbi37=~/tools/magma/NCBI37/NCBI37.3.gene.loc\nmagma --annotate \\\n --snp-loc ${snploc} \\\n --gene-loc ${ncbi37} \\\n --out HDLC_chr3\n
Tip
Usually to capture the variants in the regulatory regions, we will add windows upstream and downstream of the genes with --annotate window
.
For example, --annotate window=35,10
set a 35 kilobase pair(kb) upstream and 10kb downstream window.
ref=~/tools/magma/g1000_eas/g1000_eas\nmagma \\\n --bfile $ref \\\n --pval ./HDLC_chr3.magma.input.p.txt N=70657 \\\n --gene-annot HDLC_chr3.genes.annot \\\n --out HDLC_chr3\n
"},{"location":"09_Gene_based_analysis/#gene-set-level-analysis","title":"Gene-set level analysis","text":"geneset=/home/he/tools/magma/MSigDB/msigdb_v2022.1.Hs_files_to_download_locally/msigdb_v2022.1.Hs_GMTs/msigdb.v2022.1.Hs.entrez.gmt\nmagma \\\n --gene-results HDLC_chr3.genes.raw \\\n --set-annot ${geneset} \\\n --out HDLC_chr3\n
"},{"location":"09_Gene_based_analysis/#reference","title":"Reference","text":"Polygenic risk score(PRS), as known as polygenic score (PGS) or genetic risk score (GRS), is a score that summarizes the effect sizes of genetic variants on a certain disease or trait (weighted sum of disease/trait-associated alleles).
To calculate the PRS for sample j,
\\[PRS_j = \\sum_{i=0}^{i=M} x_{i,j} \\beta_{i}\\]In this tutorial, we will first briefly introduce how to develop PRS model using the sample data and then demonstrate how we can download PRS models from PGS Catalog and apply to our sample genotype data.
"},{"location":"10_PRS/#ctpt-using-plink","title":"C+T/P+T using PLINK","text":"P+T stands for Pruning + Thresholding, also known as Clumping and Thresholding(C+T), which is a very simple and straightforward approach to constructing PRS models.
Clumping
Clumping: LD-pruning based on P value. It is a approach to select variants when there are multiple significant associations in high LD in the same region.
The three important parameters for clumping in PLINK are:
Clumping using PLINK
#!/bin/bash\n\nplinkFile=../04_Data_QC/sample_data.clean\nsumStats=../06_Association_tests/1kgeas.B1.glm.firth\n\nplink \\\n --bfile ${plinkFile} \\\n --clump-p1 0.0001 \\\n --clump-r2 0.1 \\\n --clump-kb 250 \\\n --clump ${sumStats} \\\n --clump-snp-field ID \\\n --clump-field P \\\n --out 1kg_eas\n
log
--clump: 40 clumps formed from 307 top variants.\n
check only the header and the first \"clump\" of SNPs. head -n 2 1kg_eas.clumped\n CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001 SP2\n2 1 2:55513738:C:T 55513738 1.69e-15 52 0 3 1 6 42 2:55305475:A:T(1),2:55338196:T:C(1),2:55347135:G:A(1),2:55351853:A:G(1),2:55363460:G:A(1),2:55395372:A:G(1),2:55395578:G:A(1),2:55395807:C:T(1),2:55405847:C:A(1),2:55408556:C:A(1),2:55410835:C:T(1),2:55413644:C:G(1),2:55435439:C:T(1),2:55449464:T:C(1),2:55469819:A:T(1),2:55492154:G:A(1),2:55500529:A:G(1),2:55502651:A:G(1),2:55508333:G:C(1),2:55563020:A:G(1),2:55572944:T:C(1),2:55585915:A:G(1),2:55599810:C:T(1),2:55605943:A:G(1),2:55611766:T:C(1),2:55612986:G:C(1),2:55619923:C:T(1),2:55622624:G:A(1),2:55624520:C:T(1),2:55628936:G:C(1),2:55638830:T:C(1),2:55639023:A:T(1),2:55639980:C:T(1),2:55640649:G:A(1),2:55641045:G:A(1),2:55642887:C:T(1),2:55647729:A:G(1),2:55650512:G:A(1),2:55659155:A:G(1),2:55665620:A:G(1),2:55667476:G:T(1),2:55670729:A:G(1),2:55676257:C:T(1),2:55685927:C:A(1),2:55689569:A:T(1),2:55689913:T:C(1),2:55693097:C:G(1),2:55707583:T:C(1),2:55720135:C:G(1)\n
"},{"location":"10_PRS/#beta-shrinkage-using-prs-cs","title":"Beta shrinkage using PRS-CS","text":"\\[ \\beta_j | \\Phi_j \\sim N(0,\\phi\\Phi_j) , \\Phi_j \\sim g \\] Reference: Ge, T., Chen, C. Y., Ni, Y., Feng, Y. C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications, 10(1), 1-10.
"},{"location":"10_PRS/#parameter-tuning","title":"Parameter tuning","text":"Method Description Cross-validation 10-fold cross validation. This method usually requires large-scale genotype dataset. Independent population Perform validation in an independent population of the same ancestry. Pseudo-validation A few methods can estimate a single optimal shrinkage parameter using only the base GWAS summary statistics."},{"location":"10_PRS/#pgs-catalog","title":"PGS Catalog","text":"Just like GWAS Catalog, you can now download published PRS models from PGS catalog.
URL: http://www.pgscatalog.org/
Reference: Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.
"},{"location":"10_PRS/#calculate-prs-using-plink","title":"Calculate PRS using PLINK","text":"plink --score <score_filename> [variant ID col.] [allele col.] [score col.] ['header']\n
<score_filename>
: the score file[variant ID col.]
: the column number for variant IDs[allele col.]
: the column number for effect alleles[score col.]
: the column number for betas['header']
: skip the first header linePlease check here for detailed documents on plink --score
.
Example
# genotype data\nplinkFile=../04_Data_QC/sample_data.clean\n# summary statistics for scoring\nsumStats=./t2d_plink_reduced.txt\n# SNPs after clumpping\nawk 'NR!=1{print $3}' 1kgeas.clumped > 1kgeas.valid.snp\n\nplink \\\n --bfile ${plinkFile} \\\n --score ${sumStats} 1 2 3 header \\\n --extract 1kgeas.valid.snp \\\n --out 1kgeas\n
For thresholding using P values, we can create a range file and a p-value file.
The options we use:
--q-score-range <range file> <data file> [variant ID col.] [data col.] ['header']\n
Example
# SNP - P value file for thresholding\nawk '{print $1,$4}' ${sumStats} > SNP.pvalue\n\n# create a range file with 3 columns: range label, p-value lower bound, p-value upper bound\nhead range_list\npT0.001 0 0.001\npT0.05 0 0.05\npT0.1 0 0.1\npT0.2 0 0.2\npT0.3 0 0.3\npT0.4 0 0.4\npT0.5 0 0.5\n
and then calculate the scores using the p-value ranges:
plink2 \\\n--bfile ${plinkFile} \\\n--score ${sumStats} 1 2 3 header cols=nallele,scoreavgs,denom,scoresums\\\n--q-score-range range_list SNP.pvalue \\\n--extract 1kgeas.valid.snp \\\n--out 1kgeas\n
You will get the following files:
1kgeas.pT0.001.sscore\n1kgeas.pT0.05.sscore\n1kgeas.pT0.1.sscore\n1kgeas.pT0.2.sscore\n1kgeas.pT0.3.sscore\n1kgeas.pT0.4.sscore\n1kgeas.pT0.5.sscore\n
Take a look at the files:
head 1kgeas.pT0.1.sscore\n#IID ALLELE_CT DENOM SCORE1_AVG SCORE1_SUM\nHG00403 54554 54976 2.84455e-05 1.56382\nHG00404 54574 54976 5.65172e-05 3.10709\nHG00406 54284 54976 -3.91872e-05 -2.15436\nHG00407 54348 54976 -9.87606e-05 -5.42946\nHG00409 54760 54976 1.67157e-05 0.918963\nHG00410 54656 54976 3.74405e-05 2.05833\nHG00419 54052 54976 -6.4035e-05 -3.52039\nHG00421 54210 54976 -1.55942e-05 -0.857305\nHG00422 54102 54976 5.28824e-05 2.90726\n
"},{"location":"10_PRS/#meta-scoring-methods-for-prs","title":"Meta-scoring methods for PRS","text":"It has been shown recently that the PRS models generated from multiple traits using a meta-scoring method potentially outperforms PRS models generated from a single trait. Inouye et al. first used this approach for generating a PRS model for CAD from multiple PRS models.
Potential advantages of meta-score for PRS generation
Reference: Inouye, M., Abraham, G., Nelson, C. P., Wood, A. M., Sweeting, M. J., Dudbridge, F., ... & UK Biobank CardioMetabolic Consortium CHD Working Group. (2018). Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology, 72(16), 1883-1893.
elastic net
Elastic net is a common approach for variable selection when there are highly correlated variables (for example, PRS of correlated diseases are often highly correlated.). When fitting linear or logistic models, L1 and L2 penalties are added (regularization).
\\[ \\hat{\\beta} \\equiv argmin({\\parallel y- X \\beta \\parallel}^2 + \\lambda_2{\\parallel \\beta \\parallel}^2 + \\lambda_1{\\parallel \\beta \\parallel} ) \\]After validation, PRS can be generated from distinct PRS for other genetically correlated diseases :
\\[PRS_{meta} = {w_1}PRS_{Trait1} + {w_2}PRS_{Trait2} + {w_3}PRS_{Trait3} + ... \\]An example: Abraham, G., Malik, R., Yonova-Doing, E., Salim, A., Wang, T., Danesh, J., ... & Dichgans, M. (2019). Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nature communications, 10(1), 1-10.
"},{"location":"10_PRS/#reference","title":"Reference","text":"Meta-analysis is one of the most commonly used statistical methods to combine the evidence from multiple studies into a single result.
Potential problems for small-scale genome-wide association studies
To address these problems, meta-analysis is a powerful approach to integrate multiple GWAS summary statistics, especially when more and more summary statistics are publicly available. . This method allows us to obtain increases in statistical power as sample size increases.
What we could achieve by conducting meta-analysis
Before performing any type of meta-analysis, we need to make sure our datasets contain sufficient information and the datasets are QCed and harmonized. It is important to perform this step to avoid any unexpected errors and heterogeneity.
Key points for Dataset selection
Key points for Quality control
Key points for Harmonization
Simply speaking, the fixed effects we mentioned here mean that the between-study variance is zero. Under the fixed effect model, we assume a common effect size across studies for a certain SNP.
Fixed effect model
\\[ \\bar{\\beta_{ij}} = {{\\sum_{i=1}^{k} {w_{ij} \\beta_{ij}}}\\over{\\sum_{i=1}^{k} {w_{ij}}}} \\]Cochran's Q test and \\(I^2\\)
\\[ Q = \\sum_{i=1}^{k} {w_i (\\beta_i - \\bar{\\beta})^2} \\] \\[ I_j^2 = {{Q_j - df_j}\\over{Q_j}}\\times 100% = {{Q - (k - 1)}\\over{Q}}\\times 100% \\]"},{"location":"11_meta_analysis/#metal","title":"METAL","text":"METAL is one of the most commonly used tools for GWA meta-analysis. Its official documentation can be found here. METAL supports two models: (1) Sample size based approach and (2) Inverse variance based approach.
A minimal example of meta-analysis using the IVW method
metal_script.txt# classical approach, uses effect size estimates and standard errors\nSCHEME STDERR \n\n# === DESCRIBE AND PROCESS THE FIRST INPUT FILE ===\nMARKER SNP\nALLELE REF_ALLELE OTHER_ALLELE\nEFFECT BETA\nPVALUE PVALUE \nSTDERR SE \nPROCESS inputfile1.txt\n\n# === THE SECOND INPUT FILE HAS THE SAME FORMAT AND CAN BE PROCESSED IMMEDIATELY ===\nPROCESS inputfile2.txt\n\nANALYZE\n
Then, just run the following command to execute the metal script.
metal meta_input.txt\n
"},{"location":"11_meta_analysis/#random-effects-meta-analysis","title":"Random effects meta-analysis","text":"On the other hand, random effects mean that we need to model the between-study variance, which is not zero in this case. Under the random effect model, we assume the true effect size for a certain SNP varies across studies.
If heterogeneity of effects exists across studies, we need to model the between-study variance to correct for the deflation of variance in fixed-effect estimates.
"},{"location":"11_meta_analysis/#gwama","title":"GWAMA","text":"Random effect model
The random effect variance component can be estimated by:
\\[ r_j^2 = max\\left(0, {{Q_j - (N_j -1)}\\over{\\sum_iw_{ij} - ({{\\sum_iw_{ij}^2} \\over {\\sum_iw_ {ij}}})}}\\right)\\]Then the effect size for SNP j can be obtained by:
\\[ \\bar{\\beta_j}^* = {{\\sum_{i=1}^{k} {w_{ij}^* \\beta_i}}\\over{\\sum_{i=1}^{k} {w_{ij}^*}}} \\]The weights are estimated by:
\\[w_{ij}^* = {{1}\\over{r_j^2 + Var(\\beta_{ij})}} \\]The random effect model was implemented in GWAMA, which is another very popular GWA meta-analysis tool. Its official documentation can be found here.
A minimal example of random effect meta-analysis using GWAMA
The input file for GWAMA contains the path to each sumstats. Column names need to be standardized.
GWAMA_script.inPop1.txt\nPop2.txt\nPop3.txt\n
GWAMA \\\n -i GWAMA_script.in \\\n --random \\\n -o myresults\n
"},{"location":"11_meta_analysis/#cross-ancestry-meta-analysis","title":"Cross-ancestry meta-analysis","text":""},{"location":"11_meta_analysis/#mantra","title":"MANTRA","text":"MANTRA (Meta-ANalysis of Transethnic Association studies) is one of the early efforts to address the heterogeneity for cross-ancestry meta-analysis.
MANTRA implements a Bayesian partition model where GWASs were clustered into ancestry clusters based on a prior model of similarity between them. MANTRA then uses Markov chain Monte Carlo (MCMC) algorithms to approximate the posterior distribution of parameters (which might be quite computationally intensive). MANTRA has been shown to increase power and mapping resolution over random-effects meta-analysis over a range of models of heterogeneity situations.
"},{"location":"11_meta_analysis/#mr-mega","title":"MR-MEGA","text":"MR-MEGA employs meta-regression to model the heterogeneity in effect sizes across ancestries. Its official documentation can be found here (The same first author as GWAMA).
Meta-regression implemented in MR-MEGA
It will first construct a matrix \\(D\\) of pairwise Euclidean distances between GWAS across autosomal variants. The elements of D , $d_{k'k} $ for a pair of studies can be expressed as the following. For each variant \\(j\\), \\(p_{kj}\\) is the allele frequency of j in study k, then:
\\[d_{k'k} = {{\\sum_jI_j(p_{kj}-p_{k'j})^2}\\over{\\sum_jI_j}}\\]Then multi-dimensional scaling (MDS) will be performed to derive T axes of genetic variation (\\(x_k\\) for study k)
For each variant j, the effect size of the reference allele can be modeled in a linear regression model as :
\\[E[\\beta_{kj}] = \\beta_j + \\sum_{t=1}^T\\beta_{tj}x_{kj}\\]A minimal example of meta-analysis using MR-MEGA
The input file for MR-MEGA contains the path to each sumstats. Column names need to be standardized like GWAMA.
MRMEGA_script.inPop1.txt.gz\nPop2.txt.gz\nPop3.txt.gz\nPop4.txt.gz\nPop5.txt.gz\nPop6.txt.gz\nPop7.txt.gz\nPop8.txt.gz\n
MR-MEGA \\\n -i MRMEGA_script.in \\\n --pc 4 \\\n -o myresults\n
"},{"location":"11_meta_analysis/#global-biobank-meta-analysis-initiative-gbmi","title":"Global Biobank Meta-analysis Initiative (GBMI)","text":"As a recent success achieved by meta-analysis, GBMI showed an example of the improvement of our understanding of diseases by taking advantage of large-scale meta-analyses.
For more details, you check check here.
"},{"location":"11_meta_analysis/#reference","title":"Reference","text":"Fine-mapping : Fine-mapping aims to identify the causal variant(s) within a locus for a disease, given the evidence of the significant association of the locus (or genomic region) in GWAS of a disease.
Fine-mapping using individual data is usually performed by fitting the multiple linear regression model:
\\[y = Xb + e\\]Fine-mapping (using Bayesian methods) aims to estimate the PIP (posterior inclusion probability), which indicates the evidence for SNP j having a non-zero effect (namely, causal).
PIP(Posterior Inclusion Probability)
PIP is often calculated by the sum of the posterior probabilities over all models that include variant j as causal.
\\[ PIP_j:=Pr(b_j\\neq0|X,y) \\]Bayesian methods and Posterior probability
\\[ Pr(M_m | O) = {{Pr(O | M_m) Pr(M_m)}\\over{\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}} \\]\\(O\\) : Observed data
\\(M\\) : Models (the configurations of causal variants in the context of fine-mapping).
\\(Pr(M_m | O)\\): Posterior Probability of Model m
\\(Pr(O | M_m)\\): Likelihood (the probability of observing your dataset given Model m is true.)
\\(Pr(M_m)\\): Prior distribution of Model m (the probability of Model m being true)
\\({\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}\\): Evidence (the probability of observing your dataset), namely \\(Pr(O)\\)
Credible sets
A credible set refers to the minimum set of variants that contains all causal SNPs with probability \\(\u03b1\\). (Under the single-causal-variant-per-locus assumption, the credible set is calculated by ranking variants based on their posterior probabilities, and then summing these until the cumulative sum is \\(>\u03b1\\)). We usually report 95% credible sets (\u03b1=95%) for fine-mapping analysis.
Commonly used tools for fine-mapping
Methods assuming only one causal variant in the locus
Methods assuming multiple causal variants in the locus
Methods assuming a small number of larger causal effects with a large number of infinitesimal effects
Methods for Cross-ancestry fine-mapping
You can check here for more information.
In this tutorial, we will introduce SuSiE as an example. SuSiE stands for Sum of Single Effects\u201d model.
The key idea behind SuSiE is :
\\[b = \\sum_{l=1}^L b_l \\]where each vector \\(b_l = (b_{l1}, \u2026, b_{lJ})^T\\) is a so-called single effect vector (a vector with only one non-zero element). L is the upper bound of number of causal variants. And this model could be fitted using Iterative Bayesian Stepwise Selection (IBSS).
For fine-mapping with summary statistics using Susie (SuSiE-RSS), IBSS was modified (IBSS-ss) to take sufficient statistics (which can be computed from other combinations of summary statistics) as input. SuSie will then approximate the sufficient statistics to run fine-mapping.
Quote
For details of SuSiE and SuSiE-RSS, please check : Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the \u201cSum of Single Effects\u201d model. PLoS Genetics, 18(7), e1010299. Link
"},{"location":"12_fine_mapping/#file-preparation","title":"File Preparation","text":"Using python to check novel loci and extract the files.
import gwaslab as gl\nimport pandas as pd\nimport numpy as np\n\nsumstats = gl.Sumstats(\"../06_Association_tests/1kgeas.B1.glm.firth\",fmt=\"plink2\")\n...\n\nsumstats.basic_check()\n...\n\nsumstats.get_lead()\n\nFri Jan 13 23:31:43 2023 Start to extract lead variants...\nFri Jan 13 23:31:43 2023 -Processing 1122285 variants...\nFri Jan 13 23:31:43 2023 -Significance threshold : 5e-08\nFri Jan 13 23:31:43 2023 -Sliding window size: 500 kb\nFri Jan 13 23:31:44 2023 -Found 59 significant variants in total...\nFri Jan 13 23:31:44 2023 -Identified 3 lead variants!\nFri Jan 13 23:31:44 2023 Finished extracting lead variants successfully!\n\nSNPID CHR POS EA NEA SE Z P OR N STATUS\n110723 2:55574452:G:C 2 55574452 C G 0.160948 -5.98392 2.178320e-09 0.381707 503 9960099\n424615 6:29919659:T:C 6 29919659 T C 0.155457 -5.89341 3.782970e-09 0.400048 503 9960099\n635128 9:36660672:A:G 9 36660672 G A 0.160275 5.63422 1.758540e-08 2.467060 503 9960099\n
We will perform fine-mapping for the first significant loci whose lead variant is 2:55574452:G:C
. # filter in the variants in the this locus.\n\nlocus = sumstats.filter_value('CHR==2 & POS>55074452 & POS<56074452')\nlocus.fill_data(to_fill=[\"BETA\"])\nlocus.harmonize(basic_check=False, ref_seq=\"/Users/he/mydata/Reference/Genome/human_g1k_v37.fasta\")\nlocus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\n
check in terminal:
head sig_locus.tsv\nSNPID CHR POS EA NEA BETA SE Z P OR N STATUS\n2:54535206:C:T 2 54535206 T C 0.30028978 0.142461 2.10786 0.0350429 1.35025 503 9960099\n2:54536167:C:G 2 54536167 G C 0.14885099 0.246871 0.602952 0.546541 1.1605 503 9960099\n2:54539096:A:G 2 54539096 G A -0.0038474211 0.288489 -0.0133355 0.98936 0.99616 503 9960099\n2:54540264:G:A 2 54540264 A G -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n2:54540614:G:T 2 54540614 T G -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n2:54540621:A:G 2 54540621 G A -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n2:54540970:T:C 2 54540970 C T -0.049506452 0.149053 -0.332144 0.739781 0.951699 503 9960099\n2:54544229:T:C 2 54544229 C T -0.14338203 0.151172 -0.948468 0.342891 0.866423 503 9960099\n2:54545593:T:C 2 54545593 C T -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099\n\nhead sig_locus.snplist\n2:54535206:C:T\n2:54536167:C:G\n2:54539096:A:G\n2:54540264:G:A\n2:54540614:G:T\n2:54540621:A:G\n2:54540970:T:C\n2:54544229:T:C\n2:54545593:T:C\n2:54546032:C:G\n
"},{"location":"12_fine_mapping/#ld-matrix-calculation","title":"LD Matrix Calculation","text":"Example
#!/bin/bash\n\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\n\n# LD r matrix\nplink \\\n --bfile ${plinkFile} \\\n --keep-allele-order \\\n --r square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt\n\n# LD r2 matrix\nplink \\\n --bfile ${plinkFile} \\\n --keep-allele-order \\\n --r2 square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt_r2\n
Take a look at the LD matrix (first 5 rows and columns) head -5 sig_locus_mt.ld | cut -f 1-5\n1 -0.145634 0.252616 -0.0876317 -0.0876317\n-0.145634 1 -0.0916734 -0.159635 -0.159635\n0.252616 -0.0916734 1 0.452333 0.452333\n-0.0876317 -0.159635 0.452333 1 1\n-0.0876317 -0.159635 0.452333 1 1\n\nhead -5 sig_locus_mt_r2.ld | cut -f 1-5\n1 0.0212091 0.0638148 0.00767931 0.00767931\n0.0212091 1 0.00840401 0.0254833 0.0254833\n0.0638148 0.00840401 1 0.204605 0.204605\n0.00767931 0.0254833 0.204605 1 1\n0.00767931 0.0254833 0.204605 1 1\n
Heatmap of the LD matrix: "},{"location":"12_fine_mapping/#fine-mapping-with-summary-statistics-using-susier","title":"Fine-mapping with summary statistics using SusieR","text":"Note
install.packages(\"susieR\")\n\n# Fine-mapping with summary statistics\nfitted_rss2 = susie_rss(bhat = sumstats$betahat, shat = sumstats$sebetahat, R = R, n = n, L = 10)\n
R
: a p
x p
LD r matrix. N
: Sample size. bhat
: Alternative summary data giving the estimated effects (a vector of length p
). This, together with shat, may be provided instead of z. shat
: Alternative summary data giving the standard errors of the estimated effects (a vector of length p
). This, together with bhat, may be provided instead of z. L
: Maximum number of non-zero effects in the susie regression model. (defaul : L = 10
)
Quote
For deatils, please check SusieR tutorial - Fine-mapping with susieR using summary statistics
Use susieR in jupyter notebook (with Python):
Please check : https://github.com/Cloufield/GWASTutorial/blob/main/12_fine_mapping/finemapping_susie.ipynb
"},{"location":"12_fine_mapping/#reference","title":"Reference","text":"Heritability is a term used in genetics to describe how much phenotypic variation can be explained by genetic variation.
For any phenotype, its variation \\(Var(P)\\) can be modeled as the combination of genetic effects \\(Var(G)\\) and environmental effects \\(Var(E)\\).
\\[ Var(P) = Var(G) + Var(E) \\]"},{"location":"13_heritability/#broad-sense-heritability","title":"Broad-sense Heritability","text":"The broad-sense heritability \\(H^2_{broad-sense}\\) is mathmatically defined as :
\\[ H^2_{broad-sense} = {Var(G)\\over{Var(P)}} \\]"},{"location":"13_heritability/#narrow-sense-heritability","title":"Narrow-sense Heritability","text":"Genetic effects \\(Var(G)\\) is composed of multiple effects including additive effects \\(Var(A)\\), dominant effects, recessive effects, epistatic effects and so forth.
Narrrow-sense heritability is defined as:
\\[ h^2_{narrow-sense} = {Var(A)\\over{Var(P)}} \\]"},{"location":"13_heritability/#snp-heritability","title":"SNP Heritability","text":"SNP heritability \\(h^2_{SNP}\\) : the proportion of phenotypic variance explained by tested SNPs in a GWAS.
Common methods to estimate SNP heritability includes:
Issue for binary traits :
The scale issue for binary traits
Conversion formula (Equation 23 from Lee. 2011):
\\[ h^2_{liability-scale} = h^2_{observed-scale} * {{K(1-K)}\\over{Z^2}} * {{K(1-K)}\\over{P(1-P)}} \\]scipy.stats.norm.pdf(T, loc=0, scale=1)
.scipy.stats.norm.ppf(1 - K, loc=0, scale=1)
or scipy.stats.norm.isf(K)
.The basic model behind GCTA-GREML is the linear mixed model (LMM):
\\[y = X\\beta + Wu + e\\] \\[ Var(y) = V = WW^{'}\\delta^2_u + I \\delta^2_e\\]GCTA defines \\(A = WW^{'}/N\\) and \\(\\delta^2_g\\) as the variance explained by SNPs.
So the oringinal model can be written as:
\\[y = X\\beta + g + e\\]Quote
For details, please check Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82. link.
"},{"location":"14_gcta_greml/#donwload","title":"Donwload","text":"Download the version of GCTA for your system from : https://yanglab.westlake.edu.cn/software/gcta/#Download
Example
wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip\nunzip gcta-1.94.1-linux-kernel-3-x86_64.zip\ncd gcta-1.94.1-linux-kernel-3-x86_64.zip\n\n./gcta-1.94.1\n*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 12:22:19 JST on Sun Jan 15 2023.\nHostname: Home-Desktop\n\nError: no analysis has been launched by the option(s)\nPlease see online documentation at https://yanglab.westlake.edu.cn/software/gcta/\n
Tip
Add GCTA to your environment
"},{"location":"14_gcta_greml/#make-grm","title":"Make GRM","text":"#!/bin/bash\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\ngcta \\\n --bfile ${plinkFile} \\\n --autosome \\\n --maf 0.01 \\\n --make-grm \\\n --out 1kg_eas\n
*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:21:24 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nOptions:\n\n--bfile ../04_Data_QC/sample_data.clean\n--autosome\n--maf 0.01\n--make-grm\n--out 1kg_eas\n\nNote: GRM is computed using the SNPs on the autosomes.\nReading PLINK FAM file from [../04_Data_QC/sample_data.clean.fam]...\n500 individuals to be included from FAM file.\n500 individuals to be included. 0 males, 0 females, 500 unknown.\nReading PLINK BIM file from [../04_Data_QC/sample_data.clean.bim]...\n1224104 SNPs to be included from BIM file(s).\nThreshold to filter variants: MAF > 0.010000.\nComputing the genetic relationship matrix (GRM) v2 ...\nSubset 1/1, no. subject 1-500\n 500 samples, 1224104 markers, 125250 GRM elements\nIDs for the GRM file have been saved in the file [1kg_eas.grm.id]\nComputing GRM...\n 100% finished in 7.4 sec\n1224104 SNPs have been processed.\n Used 1128732 valid SNPs.\nThe GRM computation is completed.\nSaving GRM...\nGRM has been saved in the file [1kg_eas.grm.bin]\nNumber of SNPs in each pair of individuals has been saved in the file [1kg_eas.grm.N.bin]\n\nAnalysis finished at 17:21:32 JST on Tue Dec 26 2023\nOverall computational time: 8.51 sec.\n
"},{"location":"14_gcta_greml/#estimation","title":"Estimation","text":"#!/bin/bash\n\n#the grm we calculated in step1\nGRM=1kg_eas\n\n# phenotype file\nphenotypeFile=../01_Dataset/1kgeas_binary_gcta.txt\n\n# disease prevalence used for conversion to liability-scale heritability\nprevalence=0.5\n\n# use 5PCs as covariates \nawk '{print $1,$2,$5,$6,$7,$8,$9}' ../05_PCA/plink_results_projected.sscore > 5PCs.txt\n\ngcta \\\n --grm ${GRM} \\\n --pheno ${phenotypeFIile} \\\n --prevalence ${prevalence} \\\n --qcovar 5PCs.txt \\\n --reml \\\n --out 1kg_eas\n
"},{"location":"14_gcta_greml/#results","title":"Results","text":"Warning
This is just to show the analysis pipeline. The trait was simulated under an unreal condition (effect size is extremely large) so the result is meaningless here.
For real analysis, you need a larger sample size to get robust estimation. Please see the GCTA FAQ
*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:36:37 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nAccepted options:\n--grm 1kg_eas\n--pheno ../01_Dataset/1kgeas_binary_gcta.txt\n--prevalence 0.5\n--qcovar 5PCs.txt\n--reml\n--out 1kg_eas\n\nNote: This is a multi-thread program. You could specify the number of threads by the --thread-num option to speed up the computation if there are multiple processors in your machine.\n\nReading IDs of the GRM from [1kg_eas.grm.id].\n500 IDs are read from [1kg_eas.grm.id].\nReading the GRM from [1kg_eas.grm.bin].\nGRM for 500 individuals are included from [1kg_eas.grm.bin].\nReading phenotypes from [../01_Dataset/1kgeas_binary_gcta.txt].\nNon-missing phenotypes of 503 individuals are included from [../01_Dataset/1kgeas_binary_gcta.txt].\nReading quantitative covariate(s) from [5PCs.txt].\n5 quantitative covariate(s) of 501 individuals are included from [5PCs.txt].\nAssuming a disease phenotype for a case-control study: 248 cases and 250 controls\n5 quantitative variable(s) included as covariate(s).\n498 individuals are in common in these files.\n\nPerforming REML analysis ... (Note: may take hours depending on sample size).\n498 observations, 6 fixed effect(s), and 2 variance component(s)(including residual variance).\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values: 0.12498 0.124846\nlogL: 95.34\nRunning AI-REML algorithm ...\nIter. logL V(G) V(e)\n1 95.34 0.14264 0.10708\n2 95.37 0.18079 0.06875\n3 95.40 0.18071 0.06888\n4 95.40 0.18071 0.06888\nLog-likelihood ratio converged.\n\nCalculating the logLikelihood for the reduced model ...\n(variance component 1 is dropped from the model)\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values: 0.24901\nlogL: 94.78319\nRunning AI-REML algorithm ...\nIter. logL V(e)\n1 94.79 0.24900\n2 94.79 0.24899\nLog-likelihood ratio converged.\n\nSummary result of REML analysis:\nSource Variance SE\nV(G) 0.180708 0.164863\nV(e) 0.068882 0.162848\nVp 0.249590 0.016001\nV(G)/Vp 0.724021 0.654075\nThe estimate of variance explained on the observed scale is transformed to that on the underlying liability scale:\n(Proportion of cases in the sample = 0.497992; User-specified disease prevalence = 0.500000)\nV(G)/Vp_L 1.137308 1.027434\n\nSampling variance/covariance of the estimates of variance components:\n2.717990e-02 -2.672171e-02\n-2.672171e-02 2.651955e-02\n\nSummary result of REML analysis has been saved in the file [1kg_eas.hsq].\n\nAnalysis finished at 17:36:38 JST on Tue Dec 26 2023\nOverall computational time: 0.08 sec.\n
"},{"location":"14_gcta_greml/#reference","title":"Reference","text":"Winner's curse refers to the phenomenon that genetic effects are systematically overestimated by thresholding or selection process in genetic association studies.
Winner's curse in auctions
This term was initially used to describe a phenomenon that occurs in auctions. The winning bid is very likely to overestimate the intrinsic value of an item even if all the bids are unbiased (the auctioned item is of equal value to all bidders). The thresholding process in GWAS resembles auctions, where the lead variants are the winning bids.
Reference:
The asymptotic distribution of \\(\\beta_{Observed}\\) is:
\\[\\beta_{Observed} \\sim N(\\beta_{True},\\sigma^2)\\]An example of distribution of \\(\\beta_{Observed}\\)
It is equivalent to:
\\[{{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}} \\sim N(0,1)\\]An example of distribution of \\({{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}}\\)
We can obtain the asymptotic sampling distribution (which is a truncated normal distribution) for \\(\\beta_{Observed}\\) by:
\\[f(x,\\beta_{True}) ={{1}\\over{\\sigma}} {{\\phi({{{x - \\beta_{True}}\\over{\\sigma}}})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]when
\\[|{{x}\\over{\\sigma}}|\\geq c\\]From the asymptotic sampling distribution, the expectation of effect sizes for the selected variants can then be approximated by:
\\[ E(\\beta_{Observed}; \\beta_{True}) = \\beta_{True} + \\sigma {{\\phi({{{\\beta_{True}}\\over{\\sigma}}-c}) - \\phi({{{-\\beta_{True}}\\over{\\sigma}}-c})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]Derivation of this equation can be found in the Appendix A of Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074.
Reference:
Also see reference: https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html
"},{"location":"16_mendelian_randomization/","title":"Mendelian randomization","text":""},{"location":"16_mendelian_randomization/#mendelian-randomization-introduction","title":"Mendelian randomization introduction","text":"Comparison between RCT and MR
"},{"location":"16_mendelian_randomization/#fundamental-assumption-gene-environment-equivalence","title":"Fundamental assumption: gene-environment equivalence","text":"(cited from George Davey Smith Mendelian Randomization - 25th April 2024)
The fundamental assumption of mendelian randomization (MR) is of gene-environment equivalence. MR reflects the phenocopy/ genocopy dialectic (Goldschmidt, Schmalhausen). The idea here is that all environmental effects can be mimicked by one or several mutations. (Zuckerkandl and Villet, PNAS 1988)
Gene-environment equivalence
If we consider BMI as the outcome, let's think about whether genetic variants related to the following exposures meet the gene-environment equivalence assumption:
Instrumental variable (IV) can be defined as a variable that is correlated with the exposure X and uncorrelated with the error \\(\\epsilon\\) in the following regression:
\\[ Y = X\\beta + \\epsilon \\]Key Assumptions
Assumptions Description Relevance Instrumental variables are strongly associated with the exposure.(IVs are not independent of X) Exclusion restriction Instrumental variables do not affect the outcome except through the exposure.(IV is independent of Y, conditional on X and C) Independence There are no confounders of the instrumental variables and the outcome.(IV is independent of C) Monotonicity Variants affect the exposure in the same direction for all individuals No assortative mating Assortative mating might cause bias in MR"},{"location":"16_mendelian_randomization/#two-stage-least-squares-2sls","title":"Two-stage least-squares (2SLS)","text":"\\[ X = \\mu_1 + \\beta_{IV} IV + \\epsilon_1 \\] \\[ Y = \\mu_2 + \\beta_{2SLS} \\hat{X} + \\epsilon_2 \\]"},{"location":"16_mendelian_randomization/#two-sample-mr","title":"Two-sample MR","text":"Two-sample MR refers to the approach that the genetic effects of the instruments on the exposure can be estimated in an independent sample other than that used to estimate effects between instruments on the outcome. As more and more GWAS summary statistics become publicly available, the scope of MR also expands with Two-sample MR methods.
\\[ \\hat{\\beta}_{X,Y} = {{\\hat{\\beta}_{IV,Y}}\\over{\\hat{\\beta}_{IV,X}}} \\]Caveats
For two-sample MR, there is an additional key assumption:
The two samples used for MR are from the same underlying populations. (The effect size of instruments on exposure should be the same in both samples.)
Therefore, for two-sample MR, we usually use datasets from similar non-overlapping populations in terms of not only ancestry but also contextual factors.
"},{"location":"16_mendelian_randomization/#iv-selection","title":"IV selection","text":"One of the first things to do when you plan to perform any type of MR is to check the associations of instrumental variables with the exposure to avoid bias caused by weak IVs.
The most commonly used method here is the F-statistic, which tests the association of instrumental variables with the exposure.
"},{"location":"16_mendelian_randomization/#practice","title":"Practice","text":"In this tutorial, we will walk you through how to perform a minimal TwoSampleMR analysis. We will use the R package TwoSampleMR, which provides easy-to-use functions for formatting, clumping and harmonizing GWAS summary statistics.
This package integrates a variety of commonly used MR methods for analysis, including:
> mr_method_list()\n obj\n1 mr_wald_ratio\n2 mr_two_sample_ml\n3 mr_egger_regression\n4 mr_egger_regression_bootstrap\n5 mr_simple_median\n6 mr_weighted_median\n7 mr_penalised_weighted_median\n8 mr_ivw\n9 mr_ivw_radial\n10 mr_ivw_mre\n11 mr_ivw_fe\n12 mr_simple_mode\n13 mr_weighted_mode\n14 mr_weighted_mode_nome\n15 mr_simple_mode_nome\n16 mr_raps\n17 mr_sign\n18 mr_uwr\n\n name PubmedID\n1 Wald ratio\n2 Maximum likelihood\n3 MR Egger 26050253\n4 MR Egger (bootstrap) 26050253\n5 Simple median\n6 Weighted median\n7 Penalised weighted median\n8 Inverse variance weighted\n9 IVW radial\n10 Inverse variance weighted (multiplicative random effects)\n11 Inverse variance weighted (fixed effects)\n12 Simple mode\n13 Weighted mode\n14 Weighted mode (NOME)\n15 Simple mode (NOME)\n16 Robust adjusted profile score (RAPS)\n17 Sign concordance test\n18 Unweighted regression\n
"},{"location":"16_mendelian_randomization/#inverse-variance-weighted-fixed-effects","title":"Inverse variance weighted (fixed effects)","text":"Assumption: the underlying 'true' effect is fixed across variants
Weight for the effect of ith variant:
\\[W_i = {1 \\over Var(\\beta_i)}\\]Effect size:
\\[\\beta = {{\\sum_{i=1}^N{w_i \\beta_i}}\\over{\\sum_{i=1}^Nw_i}}\\]SE:
\\[SE = {\\sqrt{{1}\\over{\\sum_{i=1}^Nw_i}}}\\]"},{"location":"16_mendelian_randomization/#file-preparation","title":"File Preparation","text":"To perform two-sample MR analysis, we need summary statistics for exposure and outcome generated from independent populations with the same ancestry.
In this tutorial, we will use sumstats from Biobank Japan pheweb and KoGES pheweb.
wget -O bbj_t2d.zip https://pheweb.jp/download/T2D
wget -O koges_bmi.txt.gz https://koges.leelabsg.org/download/KoGES_BMI
First, to use TwosampleMR, we need R>= 4.1. To install the package, run:
library(remotes)\ninstall_github(\"MRCIEU/TwoSampleMR\")\n
"},{"location":"16_mendelian_randomization/#loading-package","title":"Loading package","text":"library(TwoSampleMR)\n
"},{"location":"16_mendelian_randomization/#reading-exposure-sumstats","title":"Reading exposure sumstats","text":"#format exposures dataset\n\nexp_raw <- fread(\"koges_bmi.txt.gz\")\n
"},{"location":"16_mendelian_randomization/#extracting-instrumental-variables","title":"Extracting instrumental variables","text":"# select only significant variants\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_dat <- format_data( exp_raw,\n type = \"exposure\",\n snp_col = \"rsids\",\n beta_col = \"beta\",\n se_col = \"sebeta\",\n effect_allele_col = \"alt\",\n other_allele_col = \"ref\",\n eaf_col = \"af\",\n pval_col = \"pval\",\n)\n
"},{"location":"16_mendelian_randomization/#clumping-exposure-variables","title":"Clumping exposure variables","text":"clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\") \n
"},{"location":"16_mendelian_randomization/#outcome","title":"outcome","text":"out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\"))\nout_dat <- format_data( out_raw,\n type = \"outcome\",\n snp_col = \"SNPID\",\n beta_col = \"BETA\",\n se_col = \"SE\",\n effect_allele_col = \"Allele2\",\n other_allele_col = \"Allele1\",\n pval_col = \"p.value\",\n)\n
"},{"location":"16_mendelian_randomization/#harmonizing-data","title":"Harmonizing data","text":"harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\n
"},{"location":"16_mendelian_randomization/#perform-mr-analysis","title":"Perform MR analysis","text":"res <- mr(harmonized_data)\n\nid.exposure id.outcome outcome exposure method nsnp b se pval\n<chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>\n9J8pv4 IyUv6b outcome exposure MR Egger 28 1.3337580 0.69485260 6.596064e-02\n9J8pv4 IyUv6b outcome exposure Weighted median 28 0.6298980 0.09401352 2.083081e-11\n9J8pv4 IyUv6b outcome exposure Inverse variance weighted 28 0.5598956 0.23225806 1.592361e-02\n9J8pv4 IyUv6b outcome exposure Simple mode 28 0.6097842 0.15180476 4.232158e-04\n9J8pv4 IyUv6b outcome exposure Weighted mode 28 0.5946778 0.12820220 8.044488e-05\n
"},{"location":"16_mendelian_randomization/#sensitivity-analysis","title":"Sensitivity analysis","text":""},{"location":"16_mendelian_randomization/#heterogeneity","title":"Heterogeneity","text":"Test if there is heterogeneity among the causal effect of x on y estimated from each variants.
mr_heterogeneity(harmonized_data)\n\nid.exposure id.outcome outcome exposure method Q Q_df Q_pval\n<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>\n9J8pv4 IyUv6b outcome exposure MR Egger 670.7022 26 1.000684e-124\n9J8pv4 IyUv6b outcome exposure Inverse variance weighted 706.6579 27 1.534239e-131\n
"},{"location":"16_mendelian_randomization/#horizontal-pleiotropy","title":"Horizontal Pleiotropy","text":"Intercept in MR-Egger
mr_pleiotropy_test(harmonized_data)\n\nid.exposure id.outcome outcome exposure egger_intercept se pval\n<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>\n9J8pv4 IyUv6b outcome exposure -0.03603697 0.0305241 0.2484472\n
"},{"location":"16_mendelian_randomization/#single-snp-mr-and-leave-one-out-mr","title":"Single SNP MR and leave-one-out MR","text":"Single SNP MR
res_single <- mr_singlesnp(harmonized_data)\nres_single\n\nexposure outcome id.exposure id.outcome samplesize SNP b se p\n<chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <dbl> <dbl>\n1 exposure outcome 9J8pv4 IyUv6b NA rs10198356 0.6323140 0.2082837 2.398742e-03\n2 exposure outcome 9J8pv4 IyUv6b NA rs10209994 0.9477808 0.3225814 3.302164e-03\n3 exposure outcome 9J8pv4 IyUv6b NA rs10824329 0.6281765 0.3246214 5.297739e-02\n4 exposure outcome 9J8pv4 IyUv6b NA rs10938397 1.2376316 0.2775854 8.251150e-06\n5 exposure outcome 9J8pv4 IyUv6b NA rs11066132 0.6024303 0.2232401 6.963693e-03\n6 exposure outcome 9J8pv4 IyUv6b NA rs12522139 0.2905201 0.2890240 3.148119e-01\n7 exposure outcome 9J8pv4 IyUv6b NA rs12591730 0.8930490 0.3076687 3.700413e-03\n8 exposure outcome 9J8pv4 IyUv6b NA rs13013021 1.4867889 0.2207777 1.646925e-11\n9 exposure outcome 9J8pv4 IyUv6b NA rs1955337 0.5442640 0.2994146 6.910079e-02\n10 exposure outcome 9J8pv4 IyUv6b NA rs2076308 1.1176226 0.2657969 2.613132e-05\n11 exposure outcome 9J8pv4 IyUv6b NA rs2278557 0.6238587 0.2968184 3.556906e-02\n12 exposure outcome 9J8pv4 IyUv6b NA rs2304608 1.5054682 0.2968905 3.961740e-07\n13 exposure outcome 9J8pv4 IyUv6b NA rs2531995 1.3972908 0.3130157 8.045689e-06\n14 exposure outcome 9J8pv4 IyUv6b NA rs261967 1.5303384 0.2921192 1.616714e-07\n15 exposure outcome 9J8pv4 IyUv6b NA rs35332469 -0.2307314 0.3479219 5.072217e-01\n16 exposure outcome 9J8pv4 IyUv6b NA rs35560038 -1.5730870 0.2018968 6.619637e-15\n17 exposure outcome 9J8pv4 IyUv6b NA rs3755804 0.5314915 0.2325073 2.225933e-02\n18 exposure outcome 9J8pv4 IyUv6b NA rs4470425 0.6948046 0.3079944 2.407689e-02\n19 exposure outcome 9J8pv4 IyUv6b NA rs476828 1.1739083 0.1568550 7.207355e-14\n20 exposure outcome 9J8pv4 IyUv6b NA rs4883723 0.5479721 0.2855004 5.494141e-02\n21 exposure outcome 9J8pv4 IyUv6b NA rs509325 0.5491040 0.1598196 5.908641e-04\n22 exposure outcome 9J8pv4 IyUv6b NA rs55872725 1.3501891 0.1259791 8.419325e-27\n23 exposure outcome 9J8pv4 IyUv6b NA rs6089309 0.5657525 0.3347009 9.096620e-02\n24 exposure outcome 9J8pv4 IyUv6b NA rs6265 0.6457693 0.1901871 6.851804e-04\n25 exposure outcome 9J8pv4 IyUv6b NA rs6736712 0.5606962 0.3448784 1.039966e-01\n26 exposure outcome 9J8pv4 IyUv6b NA rs7560832 0.6032080 0.2904972 3.785077e-02\n27 exposure outcome 9J8pv4 IyUv6b NA rs825486 -0.6152759 0.3500334 7.878772e-02\n28 exposure outcome 9J8pv4 IyUv6b NA rs9348441 -4.9786332 0.2572782 1.992909e-83\n29 exposure outcome 9J8pv4 IyUv6b NA All - Inverse variance weighted 0.5598956 0.2322581 1.592361e-02\n30 exposure outcome 9J8pv4 IyUv6b NA All - MR Egger 1.3337580 0.6948526 6.596064e-02\n
leave-one-out MR
res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\n\nexposure outcome id.exposure id.outcome samplesize SNP b se p\n<chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <dbl> <dbl>\n1 exposure outcome 9J8pv4 IyUv6b NA rs10198356 0.5562834 0.2424917 2.178871e-02\n2 exposure outcome 9J8pv4 IyUv6b NA rs10209994 0.5520576 0.2388122 2.079526e-02\n3 exposure outcome 9J8pv4 IyUv6b NA rs10824329 0.5585335 0.2390239 1.945341e-02\n4 exposure outcome 9J8pv4 IyUv6b NA rs10938397 0.5412688 0.2388709 2.345460e-02\n5 exposure outcome 9J8pv4 IyUv6b NA rs11066132 0.5580606 0.2417275 2.096381e-02\n6 exposure outcome 9J8pv4 IyUv6b NA rs12522139 0.5667102 0.2395064 1.797373e-02\n7 exposure outcome 9J8pv4 IyUv6b NA rs12591730 0.5524802 0.2390990 2.085075e-02\n8 exposure outcome 9J8pv4 IyUv6b NA rs13013021 0.5189715 0.2386808 2.968017e-02\n9 exposure outcome 9J8pv4 IyUv6b NA rs1955337 0.5602635 0.2394505 1.929468e-02\n10 exposure outcome 9J8pv4 IyUv6b NA rs2076308 0.5431355 0.2394403 2.330758e-02\n11 exposure outcome 9J8pv4 IyUv6b NA rs2278557 0.5583634 0.2394924 1.972992e-02\n12 exposure outcome 9J8pv4 IyUv6b NA rs2304608 0.5372557 0.2377325 2.382639e-02\n13 exposure outcome 9J8pv4 IyUv6b NA rs2531995 0.5419016 0.2379712 2.277590e-02\n14 exposure outcome 9J8pv4 IyUv6b NA rs261967 0.5358761 0.2376686 2.415093e-02\n15 exposure outcome 9J8pv4 IyUv6b NA rs35332469 0.5735907 0.2378345 1.587739e-02\n16 exposure outcome 9J8pv4 IyUv6b NA rs35560038 0.6734906 0.2217804 2.391474e-03\n17 exposure outcome 9J8pv4 IyUv6b NA rs3755804 0.5610215 0.2413249 2.008503e-02\n18 exposure outcome 9J8pv4 IyUv6b NA rs4470425 0.5568993 0.2392632 1.993549e-02\n19 exposure outcome 9J8pv4 IyUv6b NA rs476828 0.5037555 0.2443224 3.922224e-02\n20 exposure outcome 9J8pv4 IyUv6b NA rs4883723 0.5602050 0.2397325 1.945000e-02\n21 exposure outcome 9J8pv4 IyUv6b NA rs509325 0.5608429 0.2468506 2.308693e-02\n22 exposure outcome 9J8pv4 IyUv6b NA rs55872725 0.4419446 0.2454771 7.180543e-02\n23 exposure outcome 9J8pv4 IyUv6b NA rs6089309 0.5597859 0.2388902 1.911519e-02\n24 exposure outcome 9J8pv4 IyUv6b NA rs6265 0.5547068 0.2436910 2.282978e-02\n25 exposure outcome 9J8pv4 IyUv6b NA rs6736712 0.5598815 0.2387602 1.902944e-02\n26 exposure outcome 9J8pv4 IyUv6b NA rs7560832 0.5588113 0.2396229 1.969836e-02\n27 exposure outcome 9J8pv4 IyUv6b NA rs825486 0.5800026 0.2367545 1.429330e-02\n28 exposure outcome 9J8pv4 IyUv6b NA rs9348441 0.7378967 0.1366838 6.717515e-08\n29 exposure outcome 9J8pv4 IyUv6b NA All 0.5598956 0.2322581 1.592361e-02\n
"},{"location":"16_mendelian_randomization/#visualization","title":"Visualization","text":""},{"location":"16_mendelian_randomization/#scatter-plot","title":"Scatter plot","text":"res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\n
"},{"location":"16_mendelian_randomization/#single-snp","title":"Single SNP","text":"res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\n
"},{"location":"16_mendelian_randomization/#leave-one-out","title":"Leave-one-out","text":"res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\n
"},{"location":"16_mendelian_randomization/#funnel-plot","title":"Funnel plot","text":"res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\n
"},{"location":"16_mendelian_randomization/#mr-steiger-directionality-test","title":"MR Steiger directionality test","text":"MR Steiger directionality test is a method to test the causal direction.
Steiger test: test whether the SNP-outcome correlation is greater than the SNP-exposure correlation.
harmonized_data$\"r.outcome\" <- get_r_from_lor(\n harmonized_data$\"beta.outcome\",\n harmonized_data$\"eaf.outcome\",\n 45383,\n 132032,\n 0.26,\n model = \"logit\",\n correction = FALSE\n)\n\nout <- directionality_test(harmonized_data)\nout\n\nid.exposure id.outcome exposure outcome snp_r2.exposure snp_r2.outcome correct_causal_direction steiger_pval\n<chr> <chr> <chr> <chr> <dbl> <dbl> <lgl> <dbl>\nrvi6Om ETcv15 BMI T2D 0.02125453 0.005496427 TRUE NA\n
Reference: Hemani, G., Tilling, K., & Davey Smith, G. (2017). Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS genetics, 13(11), e1007081.
"},{"location":"16_mendelian_randomization/#mr-base-web-app","title":"MR-Base (web app)","text":"MR-Base web app
"},{"location":"16_mendelian_randomization/#strobe-mr","title":"STROBE-MR","text":"Before reporting any MR results, please check the STROBE-MR Checklist first, which consists of 20 things that should be addressed when reporting a mendelian randomization study.
Coloc
uses the assumption of 0 or 1 causal variant in each trait, and tests for whether they share the same causal variant.
Note
Actually such a assumption is different from fine-mapping. In fine-mapping, the aim is to find the putative causal variants, which is determined at birth. In colocalization, the aim is to find the \"signal overlapping\" to support the causality inference, like eQTL --> A trait. It is possible that the causal variants are different in two traits.
Datasets used:
coloc
requires \"beta\", \"varbeta\", and \"snp\". For quantitative traits, the trait standard deviation \"sdY\" is required to estimate the scale of estimated beta.Result interpretation:
Basically, five configurations are calculated,
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf \n## 1.73e-08 7.16e-07 2.61e-05 8.20e-05 1.00e+00 \n## [1] \"PP abf for shared variant: 100%\"\n
\\(H_0\\): neither trait has a genetic association in the region
\\(H_1\\): only trait 1 has a genetic association in the region
\\(H_2\\): only trait 2 has a genetic association in the region
\\(H_3\\): both traits are associated, but with different causal variants
\\(H_4\\): both traits are associated and share a single causal variant
PP.H4.abf
is the posterior probability that two traits share a same causal variant.
Then based on H4
is true, a 95% credible set could be constructed (as a shared causal variant does not necessarily mean a specific variant).
o <- order(my.res$results$SNP.PP.H4,decreasing=TRUE)\ncs <- cumsum(my.res$results$SNP.PP.H4[o])\nw <- which(cs > 0.95)[1]\nmy.res$results[o,][1:w,]$snp\n
References:
Coloc: a package for colocalisation analyses
"},{"location":"17_colocalization/#coloc-assuming-multiple-causal-variants-or-multiple-signals","title":"Coloc assuming multiple causal variants or multiple signals","text":"When the single-causal variant assumption is violeted, several ways could be used to relieve it.
Assuming multiple causal variants in SuSiE-Coloc pipeline. In this pipeline, putative causal variants are fine-mapped, then each signal is passed to the coloc engine.
Conditioning analysis using GCTA-COJO-Coloc pipeline. In this pipeline, signals are segregated, then passed to the coloc engine.
Many other strategies and pipelines are available for colocalization and prioritize the variants/genes/traits. For example: * HyPrColoc * OpenTargets *
"},{"location":"18_Conditioning_analysis/","title":"Conditioning analysis","text":"Multiple association signals could exist in one locus, especially when observing complex LD structures in the regional plot. Conditioning on one signal allows the separation of independent signals.
Several ways to perform the conditioning analysis:
First, extract the individual genotype (dosage) to the text file. Then add it to covariates.
plink2 \\\n --pfile chr1.dose.Rsq0.3 vzs \\\n --extract chr1.list \\\n --threads 1 \\\n --export A \\\n --out genotype/chr1\n
The exported format could be found in Export non-PLINK 2 fileset.
Note
Major allele dosage would be outputted. If adding ref-first
, REF allele would be outputted. It does not matter as a covariate.
Then just paste it to the covariates table and run the association test.
Note
Some association test software will also provide options for condition analysis. For example, in PLINK, you can use --condition <variant ID>
for condition analysis. You can simply provide a list of variant IDs to run the condition analysis.
If raw genotypes and phenotypes are not available, GCTA-COJO performs conditioning analysis using sumstats and external LD reference.
cojo-top-SNPs 10
will perform a step-wise model selection to select 10 independently associated SNPs (including non-significant ones).
gcta \\\n --bfile chr1 \\\n --chr 1 \\\n --maf 0.001 \\\n --cojo-file chr1_cojo.input \\\n --cojo-top-SNPs 10 \\\n --extract-region-bp 1 152383617 5000 \\\n --out chr1_cojo.output\n
Note
bfile
is used to generate LD. A size of > 4000 unrelated samples is suggested. Estimation of LD in GATC is based on the hard-call genotype.
Input file format less chr1_cojo.input
:
ID ALLELE1 ALLELE0 A1FREQ BETA SE P N\nchr1:11171:CCTTG:C C CCTTG 0.0831407 -0.0459889 0.0710074 0.5172 180590\nchr1:13024:G:A A G 1.63957e-05 -3.2714 3.26302 0.3161 180590\n
Here A1
is the effect allele. Then --cojo-cond
could be used to generate new sumstats conditioned on the above-selected variant(s).
Reference:
In meiosis, homologous chromosomes are recombined. Recombination rates at different DNA regions are not equal. The fragments can be detected after tens of generations, causing Linkage disequilibrium, which refers to the non-random association of alleles of different loci.
Factors affecting LD
Suppose we have two SNPs whose alleles are \\(A/a\\) and \\(B/b\\).
The haplotype frequencies are:
Haplotype Frequency AB \\(p_{AB}\\) Ab \\(p_{Ab}\\) aB \\(p_{aB}\\) ab \\(p_{ab}\\)The allele frequencies are:
Allele Frequency A \\(p_A=p_{AB}+p_{Ab}\\) a \\(p_A=p_{aB}+p_{ab}\\) B \\(p_A=p_{AB}+p_{aB}\\) b \\(p_A=p_{Ab}+p_{ab}\\)D : the level of LD between A and B can be estimated using coefficient of linkage disequilibrium (D), which is defined as:
\\[D_{AB} = p_{AB} - p_Ap_B\\]If A and B are in linkage equilibrium, we can get
\\[D_{AB} = p_{AB} - p_Ap_B = 0\\]which means the coefficient of linkage disequilibrium is 0 in this case.
D can be calculated for each pair of alleles and their relationships can be expressed as:
\\[D_{AB} = -D_{Ab} = -D_{aB} = D_{ab} \\]So we can simply denote \\(D = D_{AB}\\), and the relationship between haplotype frequencies and allele frequencies can be summarized in the following table.
Allele A a Total B \\(p_{AB}=p_Ap_B+D\\) \\(p_{aB}=p_ap_B-D\\) \\(p_B\\) b \\(p_{AB}=p_Ap_b-D\\) \\(p_{AB}=p_ap_b+D\\) \\(p_b\\) Total \\(p_A\\) \\(p_a\\) 1The range of possible values of D depends on the allele frequencies, which is not suitable for comparison between different pairs of alleles.
Lewontin suggested a method for the normalization of D :
\\[D_{normalized} = {{D}\\over{D_{max}}}\\]where
\\[ D_{max} = \\begin{cases} max\\{-p_Ap_B, -(1-p_A)(1-p_B)\\} & \\text{when } D \\lt 0 \\\\ min\\{ p_A(1-p_B), p_B(1-p_A) \\} & \\text{when } D \\gt 0 \\\\ \\end{cases} \\]It measures how much proportion of the haplotypes had undergone recombination.
In practice, the most commonly used alternative metric to \\(D_{normalized}\\) is \\(r^2\\), the correlation coefficient, which can be obtained by:
\\[ r^2 = {{D^2}\\over{p_A(1-p_A)p_B(1-p_B)}} \\]Reference: Slatkin, M. (2008). Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477-485.
"},{"location":"19_ld/#ld-calculation-using-software","title":"LD Calculation using software","text":""},{"location":"19_ld/#ldstore2","title":"LDstore2","text":"LDstore2: http://www.christianbenner.com/#
Reference: Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).
"},{"location":"19_ld/#plink-ld","title":"PLINK LD","text":"Please check Calculate LD using PLINK.
"},{"location":"19_ld/#ld-lookup-using-ldlink","title":"LD Lookup using LDlink","text":"LDlink
LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups. Each included application is specialized for querying and displaying unique aspects of linkage disequilibrium.
https://ldlink.nci.nih.gov/?tab=home
Reference: Machiela, M. J., & Chanock, S. J. (2015). LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 31(21), 3555-3557.
LDlink is a very useful tool for quick lookups of any information related to LD.
"},{"location":"19_ld/#ldlink-ldpair","title":"LDlink-LDpair","text":"LDpair
"},{"location":"19_ld/#ldlink-ldproxy","title":"LDlink-LDproxy","text":"LDproxy for rs671
"},{"location":"19_ld/#query-in-batch-using-ldlink-api","title":"Query in batch using LDlink API","text":"LDlink provides API for queries using command line.
You need to register and get a token first.
https://ldlink.nci.nih.gov/?tab=apiaccess
Query LD proxies for variants using LDproxy API
curl -k -X GET 'https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs3&pop=MXL&r2_d=r2&window=500000& genome_build=grch37&token=faketoken123'\n
"},{"location":"19_ld/#ldlinkr","title":"LDlinkR","text":"There is also a related R package for LDlink.
Query LD proxies for variants using LDlinkR
install.packages(\"LDlinkR\")\n\nlibrary(LDlinkR)\n\nmy_proxies <- LDproxy(snp = \"rs671\", \n pop = \"EAS\", \n r2d = \"r2\", \n token = \"YourTokenHere123\",\n genome_build = \"grch38\"\n )\n
Reference: Myers, T. A., Chanock, S. J., & Machiela, M. J. (2020). LDlinkR: an R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Frontiers in genetics, 11, 157.
"},{"location":"19_ld/#ld-pruning","title":"LD-pruning","text":"Please check LD-pruning
"},{"location":"19_ld/#ld-clumping","title":"LD-clumping","text":"Please check LD-clumping
"},{"location":"19_ld/#ld-score","title":"LD score","text":"Definition: https://cloufield.github.io/GWASTutorial/08_LDSC/#ld-score
"},{"location":"19_ld/#ldsc","title":"LDSC","text":"LD score can be estimated with LDSC using PLINK format genotype data as the reference panel.
plinkPrefix=chr22\n\npython ldsc.py \\\n --bfile ${plinkPrefix}\n --l2 \\\n --ld-wind-cm 1\\\n --out ${plinkPrefix}\n
Check here for details.
"},{"location":"19_ld/#gcta","title":"GCTA","text":"GCTA also provides a function to estimate LD scores using PLINK format genotype data.
plinkPrefix=chr22\n\ngcta64 \\\n --bfile ${plinkPrefix} \\\n --ld-score \\\n --ld-wind 1000 \\\n --ld-rsq-cutoff 0.01 \\\n --out ${plinkPrefix}\n
Check here for details.
"},{"location":"19_ld/#ld-score-regression","title":"LD score regression","text":"Please check LD score regression
"},{"location":"19_ld/#reference","title":"Reference","text":"This table shows the relationship between the null hypothesis \\(H_0\\) and the results of a statistical test (whether or not to reject the null hypothesis \\(H_0\\) ).
H0 is True H0 is False Do Not Reject True negative : \\(1 - \\alpha\\) Type II error (false negative) : \\(\\beta\\) Reject Type I error (false positive) : \\(\\alpha\\) True positive : \\(1 - \\beta\\)\\(\\alpha\\) : significance level
By definition, the statistical power of a test refers to the probability that the test will correctly reject the null hypothesis, namely the True positive rate in the table above.
\\(Power = Pr ( Reject\\ | H_0\\ is\\ False) = 1 - \\beta\\)
Power
Factors affecting power
NCP describes the degree of difference between the alternative hypothesis \\(H_1\\) and the null hypothesis \\(H_0\\) values.
Consider a simple linear regression model:
\\[y = \\mu +\\beta x + \\epsilon\\]The variance of the error term:
\\[\\sigma^2 = Var(y) - Var(x)\\beta^2\\]Usually, the phenotypic variance that a single SNP could explain is very limited, so we can approximate \\(\\sigma^2\\) by:
\\[ \\sigma^2 \\thickapprox Var(y)\\]Under Hardy-Weinberg equilibrium, we can get:
\\[Var(x) = 2f(1-f)\\]So the Non-centrality parameter(NCP) \\(\\lambda\\) for \\(\\chi^2\\) distribution with degree of freedom 1:
\\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2\\]"},{"location":"20_power_analysis/#power-for-quantitative-traits","title":"Power for quantitative traits","text":"\\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2 \\thickapprox N \\times {{Var(x)\\beta^2}\\over{\\sigma^2}} \\thickapprox N \\times {{2f(1-f) \\beta^2 }\\over {Var(y)}} \\]Significance threshold: \\(C = CDF_{\\chi^2}^{-1}(1 - \\alpha,df=1)\\)
Denote :
Null hypothesis : \\(P_{case} = P_{control}\\)
To test whether one proportion \\(P_{case}\\) equals the other proportion \\(P_{control}\\), the test statistic is:
\\[z = {{P_{case} - P_{control}}\\over {\\sqrt{ {{P_{case}(1 - P_{case})}\\over{2N_{case}}} + {{P_{control}(1 - P_{control})}\\over{2N_{control}}} }}}\\]Significance threshold: \\(C = \\Phi^{-1}(1 - \\alpha / 2 )\\)
\\[ Power = Pr(|Z|>C) = 1 - \\Phi(-C-z) + \\Phi(C-z)\\]GAS power calculator
GAS power calculator implemented this method, and you can easily calculate the power using their website
"},{"location":"20_power_analysis/#reference","title":"Reference:","text":"Most variants identified in GWAS are located in regulatory regions, and these genetic variants could potentially affect complex traits through gene expression.
However, due to the limitation of samples and high cost, it is difficult to measure gene expression at a large scale. Consequently, many expression-trait associations have not been detected, especially for those with small effect sizes.
To address these issues, alternative approaches have been proposed and transcriptome-wide association study (TWAS) has become a common and easy-to-perform approach to identify genes whose expression is significantly associated with complex traits in individuals without directly measured expression levels.
GWAS and TWAS
"},{"location":"21_twas/#definition","title":"Definition","text":"TWAS is a method to identify significant expression-trait associations using expression imputation from genetic data or summary statistics.
Individual-level and summary-level TWAS
"},{"location":"21_twas/#fusion","title":"FUSION","text":"In this tutorial, we will introduce FUSION, which is one of the most commonly used tools for performing transcriptome-wide association studies (TWAS) using summary-level data.
url : http://gusevlab.org/projects/fusion/
FUSION trains predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. (http://gusevlab.org/projects/fusion/)
Quote
Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., ... & Pasaniuc, B. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3), 245-252.
"},{"location":"21_twas/#algorithm-for-imputing-expression-into-gwas-summary-statistics","title":"Algorithm for imputing expression into GWAS summary statistics","text":"ImpG-Summary algorithm was extended to impute the Z scores for the cis genetic component of expression.
FUSION statistical model
\\(Z\\) : a vector of standardized effect sizes (z scores) of SNPs for the target trait at a given locus
We impute the Z score of the expression and trait as a linear combination of elements of \\(Z\\) with weights \\(W\\).
\\[ W = \\Sigma_{e,s}\\Sigma_{s,s}^{-1} \\]\\(\\Sigma_{e,s}\\) : covariance matrix between all SNPs and gene expression
\\(\\Sigma_{s,s}\\) : covariance among all SNPs (LD)
Both \\(\\Sigma_{e,s}\\) and \\(\\Sigma_{s,s}\\) are estimated from reference datsets.
\\[ Z \\sim N(0, \\Sigma_{s,s} ) \\]The variance of \\(WZ\\) (imputed z score of expression and trait)
\\[ Var(WZ) = W\\Sigma_{s,s}W^t \\]The imputation Z score can be obtained by:
\\[ {{WZ}\\over{W\\Sigma_{s,s}W^t}^{1/2}} \\]ImpG-Summary algorithm
Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., ... & Price, A. L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906-2914.
"},{"location":"21_twas/#installation","title":"Installation","text":"Download FUSION from github and install
wget https://github.com/gusevlab/fusion_twas/archive/master.zip\nunzip master.zip\ncd fusion_twas-master\n
Download and unzip the LD reference data (1000 genome)
wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2\ntar xjvf LDREF.tar.bz2\n
Download and unzip plink2R
wget https://github.com/gabraham/plink2R/archive/master.zip\nunzip master.zip\n
Install R packages
# R >= 4.0\nR\n\ninstall.packages(c('optparse','RColorBrewer'))\ninstall.packages('plink2R-master/plink2R/',repos=NULL)\n
"},{"location":"21_twas/#example","title":"Example","text":"FUSION framework
Input:
Input GWAS sumstats fromat
Example:
SNP A1 A2 N CHISQ Z\nrs6671356 C T 70100.0 0.172612905312 0.415467092935\nrs6604968 G A 70100.0 0.291125788806 0.539560736902\nrs4970405 A G 70100.0 0.102204513891 0.319694407037\nrs12726255 G A 70100.0 0.312418295691 0.558943911042\nrs4970409 G A 70100.0 0.0524226849517 0.228960007319\n
Get sample sumstats and weights
wget https://data.broadinstitute.org/alkesgroup/FUSION/SUM/PGC2.SCZ.sumstats\n\nmkdir WEIGHTS\ncd WEIGHTS\nwget https://data.broadinstitute.org/alkesgroup/FUSION/WGT/GTEx.Whole_Blood.tar.bz2\ntar xjf GTEx.Whole_Blood.tar.bz2\n
WEIGHTS
files in each WEIGHTS folder
RDat weight files for each gene in a tissue type
GTEx.Whole_Blood.ENSG00000002549.8.LAP3.wgt.RDat GTEx.Whole_Blood.ENSG00000166394.10.CYB5R2.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002822.11.MAD1L1.wgt.RDat GTEx.Whole_Blood.ENSG00000166435.11.XRRA1.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002919.10.SNX11.wgt.RDat GTEx.Whole_Blood.ENSG00000166436.11.TRIM66.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002933.3.TMEM176A.wgt.RDat GTEx.Whole_Blood.ENSG00000166444.13.ST5.wgt.RDat\nGTEx.Whole_Blood.ENSG00000003137.4.CYP26B1.wgt.RDat GTEx.Whole_Blood.ENSG00000166471.6.TMEM41B.wgt.RDat\n...\n
Expression imputation
Rscript FUSION.assoc_test.R \\\n--sumstats PGC2.SCZ.sumstats \\\n--weights ./WEIGHTS/GTEx.Whole_Blood.pos \\\n--weights_dir ./WEIGHTS/ \\\n--ref_ld_chr ./LDREF/1000G.EUR. \\\n--chr 22 \\\n--out PGC2.SCZ.22.dat\n
Results
head PGC2.SCZ.22.dat\nPANEL FILE ID CHR P0 P1 HSQ BEST.GWAS.ID BEST.GWAS.Z EQTL.ID EQTL.R2 EQTL.Z EQTL.GWAS.Z NSNP NWGT MODEL MODELCV.R2 MODELCV.PV TWAS.Z TWAS.P\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000273311.1.DGCR11.wgt.RDat DGCR11 22 19033675 19035888 0.0551 rs2238767 -2.98 rs2283641 0.013728 4.33 2.5818 408 1 top1 0.014 0.018 2.5818 9.83e-03\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000100075.5.SLC25A1.wgt.RDat SLC25A1 22 19163095 19166343 0.0740 rs2238767 -2.98 rs762523 0.080367 5.36 -1.8211 406 1 top1 0.08 7.2e-08 -1.8216.86e-02\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000070371.11.CLTCL1.wgt.RDat CLTCL1 22 19166986 19279239 0.1620 rs4819843 3.04 rs809901 0.072193 5.53 -1.9928 456 19 enet 0.085 2.8e-08 -1.8806.00e-02\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000232926.1.AC000078.5.wgt.RDat AC000078.5 22 19874812 19875493 0.2226 rs5748555 -3.15 rs13057784 0.052796 5.60 -0.1652 514 44 enet 0.099 2e-09 0.0524 9.58e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000185252.13.ZNF74.wgt.RDat ZNF74 22 20748405 20762745 0.1120 rs595272 4.09 rs1005640 0.001422 3.44 -1.3677 301 8 enet 0.008 0.054 -0.8550 3.93e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000099940.7.SNAP29.wgt.RDat SNAP29 22 21213771 21245506 0.1286 rs595272 4.09 rs4820575 0.061763 5.94 -1.1978 416 27 enet 0.079 9.4e-08 -1.0354 3.00e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000272600.1.AC007308.7.wgt.RDat AC007308.7 22 21243494 21245502 0.2076 rs595272 4.09 rs165783 0.100625 6.79 -0.8871 408 12 lasso 0.16 5.4e-1-1.2049 2.28e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000183773.11.AIFM3.wgt.RDat AIFM3 22 21319396 21335649 0.0676 rs595272 4.09 rs565979 0.036672 4.50 -0.4474 362 1 top1 0.037 0.00024 -0.4474 6.55e-01\nNA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000230513.1.THAP7-AS1.wgt.RDat THAP7-AS1 22 21356175 21357118 0.2382 rs595272 4.09 rs2239961 0.105307 -7.04 -0.3783 347 5 lasso 0.15 7.6e-1 0.2292 8.19e-01\n
Descriptions of the output (cited from http://gusevlab.org/projects/fusion/ )
Colume number Column header Value Usage 1 FILE \u2026 Full path to the reference weight file used 2 ID FAM109B Feature/gene identifier, taken from --weights file 3 CHR 22 Chromosome 4 P0 42470255 Gene start (from --weights) 5 P1 42475445 Gene end (from --weights) 6 HSQ 0.0447 Heritability of the gene 7 BEST.GWAS.ID rs1023500 rsID of the most significant GWAS SNP in locus 8 BEST.GWAS.Z -5.94 Z-score of the most significant GWAS SNP in locus 9 EQTL.ID rs5758566 rsID of the best eQTL in the locus 10 EQTL.R2 0.058680 cross-validation R2 of the best eQTL in the locus 11 EQTL.Z -5.16 Z-score of the best eQTL in the locus 12 EQTL.GWAS.Z -5.0835 GWAS Z-score for this eQTL 13 NSNP 327 Number of SNPs in the locus 14 MODEL lasso Best performing model 15 MODELCV.R2 0.058870 cross-validation R2 of the best performing model 16 MODELCV.PV 3.94e-06 cross-validation P-value of the best performing model 17 TWAS.Z 5.1100 TWAS Z-score (our primary statistic of interest) 18 TWAS.P 3.22e-07 TWAS P-value"},{"location":"21_twas/#limitations","title":"Limitations","text":"Significant loci identified in TWAS also contain multiple tarit-associated genes. GWAS often identifies multiple variants in LD. Similarly, TWAS frequently identifies multiple genes in a locus.
Co-regulation may cause false positive results. Just like SNPs are correlated due to LD, gene expressions are often correlated due to co-regulation.
Sometimes even when co-regulation is not captured, the shared variants (or variants in strong LD) in different expression prediction models may cause false positive results.
Predicted expression account for only a limited portion of total gene expression. Total expression is affected not only by genetic components like cis-eQTL but also by other factors like environmental and technical components.
Other factors. For example, the window size for selecting variants may affect association results.
TWAS aims to test the relationship of the phenotype with the genetic component of the gene expression. But under current framework, TWAS only test the relationship of the phenotype with the predicted gene expression without accounting for the uncertainty in that prediction. The key point here is that the current framework omits the fact that the gene expression data is also the result of a sampling process from the analysis.
\"Consequently, the test of association between that predicted genetic component and a phenotype reduces to merely a (weighted) test of joint association of the SNPs with the phenotype, which means that they cannot be used to infer a genetic relationship between gene expression and the phenotype on a population level.\"
Quote
de Leeuw, C., Werme, J., Savage, J. E., Peyrot, W. J., & Posthuma, D. (2021). On the interpretation of transcriptome-wide association studies. bioRxiv, 2021-08.
"},{"location":"21_twas/#reference","title":"Reference","text":"Overview of REGENIE
Reference: https://rgcgithub.github.io/regenie/overview/
"},{"location":"32_whole_genome_regression/#whole-genome-model","title":"Whole genome model","text":""},{"location":"32_whole_genome_regression/#stacked-regressions","title":"Stacked regressions","text":""},{"location":"32_whole_genome_regression/#firth-correction","title":"Firth correction","text":""},{"location":"32_whole_genome_regression/#tutorial","title":"Tutorial","text":""},{"location":"32_whole_genome_regression/#installation","title":"Installation","text":"Please check here
"},{"location":"32_whole_genome_regression/#step1","title":"Step1","text":"Sample codes for running step 1
plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\n# revise the header of covariate file\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n --step 1 \\\n --bed ${plinkFile} \\\n --extract ${extract} \\\n --phenoFile ${phenoFile} \\\n --covarFile ${covarFile} \\\n --covarColList ${covarList} \\\n --bt \\\n --bsize 1000 \\\n --lowmem \\\n --lowmem-prefix tmpdir/regenie_tmp_preds \\\n --out 1kg_eas_step1_BT\n
"},{"location":"32_whole_genome_regression/#step2","title":"Step2","text":"Sample codes for running step 2
plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n --step 2 \\\n --bed ${plinkFile} \\\n --ref-first \\\n --phenoFile ${phenoFile} \\\n --covarFile ${covarFile} \\\n --covarColList ${covarList} \\\n --bt \\\n --bsize 400 \\\n --firth --approx --pThresh 0.01 \\\n --pred 1kg_eas_step1_BT_pred.list \\\n --out 1kg_eas_step1_BT\n
"},{"location":"32_whole_genome_regression/#visualization","title":"Visualization","text":""},{"location":"32_whole_genome_regression/#reference","title":"Reference","text":"Risk: the probability that a subject within a population will develop a given disease, or other health outcome, over a specified follow-up period.
\\[ R = {{E}\\over{E + N}} \\]Odds: the likelihood of a new event occurring rather than not occurring. It is the probability that an event will occur divided by the probability that the event will not occur.
\\[ Odds = {E \\over N } \\]"},{"location":"55_measure_of_effect/#hazard","title":"Hazard","text":"Hazard function \\(h(t)\\): the event rate at time \\(t\\) conditional on survival until time \\(t\\) (namely, \\(T\u2265t\\))
\\[ h(t) = Pr(t<=T<t_{+1} | T>=t ) \\]T\u00a0is a discrete random variable indicating the time of occurrence of the event.
"},{"location":"55_measure_of_effect/#relative-risk-rr-and-odds-ratio-or","title":"Relative risk (RR) and Odds ratio (OR)","text":""},{"location":"55_measure_of_effect/#22-contingency-table","title":"2\u00d72 Contingency Table","text":"Intervention I Control C Events E IE CE Non-events N IN CN"},{"location":"55_measure_of_effect/#relative-risk-rr","title":"Relative risk (RR)","text":"RR: relative risk (risk ratio), usually used in cohort studies.
\\[ RR = {{R_{Intervention}}\\over{R_{ conrol}}}={{IE/(IE+IN)}\\over{CE/(CE+CN)}} \\]"},{"location":"55_measure_of_effect/#odds-ratio-or","title":"Odds ratio (OR)","text":"OR: usually used in case control studies.
\\[ OR = {{Odds_{Intervention}}\\over{Odds_{ conrol}}}={{IE/IN}\\over{CE/CN}} = {{IE * CN}\\over{CE * IN}} \\]When the event occurs in less than 10% of the unexposed population, the OR provides a reasonable approximation of the RR.
"},{"location":"55_measure_of_effect/#hazard-ratios-hr","title":"Hazard ratios (HR)","text":"Hazard ratios (relative hazard) are usually estimated from Cox proportional hazards model:
\\[ h_i(t) = h_0(t) \\times e^{\\beta_0 + \\beta_1X_{i1} + ... + \\beta_nX_{in} } = h_0(t) \\times e^{X_i\\beta } \\]HR: the ratio of the hazard rates corresponding to the conditions characterised by two distinct levels of a treatment variable of interest.
\\[ HR = {{h(t | X_i)}\\over{h(t|X_j)}} = {{h_0(t) \\times e^{X_i\\beta }}\\over{h_0(t) \\times e^{X_j\\beta }}} = e^{(X_i-X_j)\\beta} \\]"},{"location":"60_awk/","title":"AWK","text":""},{"location":"60_awk/#awk-introduction","title":"AWK Introduction","text":"'awk' is one of the most powerful text processing tools for tabular text files.
"},{"location":"60_awk/#awk-syntax","title":"AWK syntax","text":"awk OPTION 'CONDITION {PROCESS}' FILENAME\n
Some special variables in awk:
$0
: all columns$n
: column n. For example, $1 means the first column. $4 means column 4.NR
: Row number.Using the sample sumstats, we will demonstrate some simple but useful one-liners.
# sample sumstats\nhead ../02_Linux_basics/sumstats.txt \n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"60_awk/#example-1","title":"Example 1","text":"Select variants on chromosome 2 (keeping the headers)
awk 'NR==1 || $1==2 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n2 22398 2:22398:C:T C T T ADD 503 1.287540.161017 1.56962 0.116503 .\n2 24839 2:24839:C:T C T T ADD 503 1.318170.179754 1.53679 0.124344 .\n2 26844 2:26844:C:T C T T ADD 503 1.3173 0.161302 1.70851 0.0875413 .\n2 28786 2:28786:T:C T C C ADD 503 1.3043 0.161184 1.64822 0.0993082 .\n2 30091 2:30091:C:G C G G ADD 503 1.3043 0.161184 1.64822 0.0993082 .\n2 30762 2:30762:A:G A G A ADD 503 1.099560.158614 0.598369 0.549594 .\n2 34503 2:34503:G:T G T T ADD 503 1.323720.179789 1.55988 0.118789 .\n2 39340 2:39340:A:G A G G ADD 503 1.3043 0.161184 1.64822 0.0993082 .\n2 55237 2:55237:T:C T C C ADD 503 1.314860.161988 1.68983 0.0910614 .\n
The NR
here means row number. The condition here NR==1 || $1==2
means if it is the first row or the first column is equal to 2, conduct the process print $0
, which mean print all columns.
Select all genome-wide significant variants (p<5e-8)
awk 'NR==1 || $13 <5e-8 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"60_awk/#example-3","title":"Example 3","text":"Create a bed-like format for annotation
awk 'NR>1 {print $1,$2,$2,$4,$5}' ../02_Linux_basics/sumstats.txt | head\n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n
"},{"location":"60_awk/#awk-workflow","title":"AWK workflow","text":"The workflow of awk can be summarized in the following figure:
awk workflow
"},{"location":"60_awk/#awk-variables","title":"AWK variables","text":"Frequently used awk variables
Variable Desciption NR The number of input records NF The number of input fields FS The input field separator. The default value is\" \"
OFS The output field separator. The default value is \" \"
RS The input record separator. The default value is \"\\n\"
ORS The output record separator.The default value is \"\\n\"
FILENAME The name of the current input file. FNR The current record number in the current file Handle csv and tsv files
head ../03_Data_formats/sample_data.csv\n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
awk -v FS=',' -v OFS=\"\\t\" '{print $1,$2}' sample_data.csv\n#CHROM POS\n1 13273\n1 14599\n1 14604\n1 14930\n1 69897\n1 86331\n1 91581\n1 122872\n1 135163\n
convert csv to tsv
awk 'BEGIN { FS=\",\"; OFS=\"\\t\" } {$1=$1; print}' sample_data.csv\n
Skip and replace headers
awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"CHR\\tPOS\"} NR>1 {print $1,$2}' sample_data.csv\n\nCHR POS\n1 13273\n1 14599\n1 14604\n1 14930\n1 69897\n1 86331\n1 91581\n1 122872\n1 135163\n
Extract a line
awk 'NR==4' sample_data.csv\n\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n
Print the last two columns
awk -v FS=',' '{print $(NF-1),$(NF)}' sample_data.csv\nP ERRCODE\n0.305961 .\n0.0104299 .\n0.0104299 .\n0.0269602 .\n0.0188466 .\n0.102694 .\n0.522847 .\n0.703856 .\n0.155079 .\n
"},{"location":"60_awk/#awk-operators","title":"AWK operators","text":"Arithmetic Operators
Arithmetic Operators Desciption+
add -
subtract *
multiply \\
divide %
modulus division **
x**y : x raised to the y-th power Logical Operators
Logical Operators Desciption\\|\\|
or &&
and !
not"},{"location":"60_awk/#awk-functions","title":"AWK functions","text":"Numeric functions in awk
Convert OR and P to BETA and -log10(P)
awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"SNPID\\tBETA\\tMLOG10P\"}NR>1{print $3,log($10),-log($13)/log(10)}' sample_data.csv\nSNPID BETA MLOG10P\n1:13273:G:C -0.287458 0.514334\n1:14599:T:A 0.593172 1.98172\n1:14604:A:G 0.593172 1.98172\n1:14930:A:G 0.531446 1.56928\n1:69897:T:C 0.457438 1.72477\n1:86331:A:G 0.385303 0.988455\n1:91581:G:A -0.0785866 0.281625\n1:122872:T:G 0.0687142 0.152516\n1:135163:C:T -0.339927 0.809447\n
String manipulating functions in awk
$ awk --help\nUsage: awk [POSIX or GNU style options] -f progfile [--] file ...\nUsage: awk [POSIX or GNU style options] [--] 'program' file ...\nPOSIX options: GNU long options: (standard)\n -f progfile --file=progfile\n -F fs --field-separator=fs\n -v var=val --assign=var=val\nShort options: GNU long options: (extensions)\n -b --characters-as-bytes\n -c --traditional\n -C --copyright\n -d[file] --dump-variables[=file]\n -D[file] --debug[=file]\n -e 'program-text' --source='program-text'\n -E file --exec=file\n -g --gen-pot\n -h --help\n -i includefile --include=includefile\n -l library --load=library\n -L[fatal|invalid] --lint[=fatal|invalid]\n -M --bignum\n -N --use-lc-numeric\n -n --non-decimal-data\n -o[file] --pretty-print[=file]\n -O --optimize\n -p[file] --profile[=file]\n -P --posix\n -r --re-interval\n -S --sandbox\n -t --lint-old\n -V --version\n\nTo report bugs, see node `Bugs' in `gawk.info', which is\nsection `Reporting Problems and Bugs' in the printed version.\n\ngawk is a pattern scanning and processing language.\nBy default it reads standard input and writes standard output.\n\nExamples:\n gawk '{ sum += $1 }; END { print sum }' file\n gawk -F: '{ print $1 }' /etc/passwd\n
"},{"location":"60_awk/#reference","title":"Reference","text":"sed
is also one of the most commonly used test-editing command in Linux, which is short for stream editor. sed
command edits the text from standard input in a line-by-line approach.
sed [OPTIONS] PROCESS [FILENAME]\n
"},{"location":"61_sed/#examples","title":"Examples","text":""},{"location":"61_sed/#sample-input","title":"sample input","text":"head ../02_Linux_basics/sumstats.txt\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"61_sed/#example-1-replacing-strings","title":"Example 1: Replacing strings","text":"s
for substitute g
for global
Replacing strings
\"Replace the separator from :
to _
\"
head 02_Linux_basics/sumstats.txt | sed 's/:/_/g'\n#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE\n1 13273 1_13273_G_C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1_14599_T_A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1_14604_A_G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1_14930_A_G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1_69897_T_C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1_86331_A_G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1_91581_G_A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1_122872_T_G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1_135163_C_T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"61_sed/#example-2-delete-headerthe-first-line","title":"Example 2: Delete header(the first line)","text":"-d
for deletion
Delete header(the first line)
head 02_Linux_basics/sumstats.txt | sed '1d'\n1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .\n1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .\n1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .\n1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .\n1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .\n1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .\n1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .\n1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .\n
"},{"location":"69_resources/","title":"Resources","text":""},{"location":"69_resources/#sandbox","title":"Sandbox","text":"Sandbox provides tutorials for you to learn how to use bioinformatics tools right from your browser. Everything runs in a sandbox, so you can experiment all you want.
explainshell is a tool (with a web interface) capable of parsing man pages, extracting options and explain a given command-line by matching each argument to the relevant help text in the man page.
R can be downloaded from its official website CRAN (The Comprehensive R Archive Network).
CRAN
https://cran.r-project.org/
"},{"location":"75_R_basics/#install-r-using-conda","title":"Install R using conda","text":"It is convenient to use conda to manage your R environment.
conda install -c conda-forge r-base=4.x.x\n
"},{"location":"75_R_basics/#ide-for-r-positrstudio","title":"IDE for R: Posit(Rstudio)","text":"Posit(Rstudio) is one of the most commonly used Integrated development environment(IDE) for R.
https://posit.co/
"},{"location":"75_R_basics/#use-r-in-interactive-mode","title":"Use R in interactive mode","text":"R\n
"},{"location":"75_R_basics/#run-r-script","title":"Run R script","text":"Rscript mycode.R\n
"},{"location":"75_R_basics/#installing-and-using-r-packages","title":"Installing and Using R packages","text":"install.packages(\"package_name\")\n\nlibrary(package_name)\n
"},{"location":"75_R_basics/#basic-syntax","title":"Basic syntax","text":""},{"location":"75_R_basics/#assignment-and-evaluation","title":"Assignment and Evaluation","text":"> x <- 1\n\n> x\n[1] 1\n\n> print(x)\n[1] 1\n
"},{"location":"75_R_basics/#data-types","title":"Data types","text":""},{"location":"75_R_basics/#atomic-data-types","title":"Atomic data types","text":"logical, integer, real, complex, string (or character)
Atomic data types Description Examples logical booleanTRUE
, FALSE
integer integer 1
,2
numeric float number 0.01
complex complex number 1+0i
string string or chracter abc
"},{"location":"75_R_basics/#vectors","title":"Vectors","text":"myvector <- c(1,2,3)\nmyvector < 1:3\n\nmyvector <- c(TRUE,FALSE)\nmyvector <- c(0.01, 0.02)\nmyvector <- c(1+0i, 2+3i)\nmyvector <- c(\"a\",\"bc\")\n
"},{"location":"75_R_basics/#matrices","title":"Matrices","text":"> mymatrix <- matrix(1:6, nrow = 2, ncol = 3)\n> mymatrix\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n\n> ncol(mymatrix)\n[1] 3\n> nrow(mymatrix)\n[1] 2\n> dim(mymatrix)\n[1] 2 3\n> length(mymatrix)\n[1] 6\n
"},{"location":"75_R_basics/#list","title":"List","text":"list()
is a special vector-like data type that can contain different data types.
> mylist <- list(1, 0.02, \"a\", FALSE, c(1,2,3), matrix(1:6,nrow=2,ncol=3))\n> mylist\n[[1]]\n[1] 1\n\n[[2]]\n[1] 0.02\n\n[[3]]\n[1] \"a\"\n\n[[4]]\n[1] FALSE\n\n[[5]]\n[1] 1 2 3\n\n[[6]]\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n
"},{"location":"75_R_basics/#dataframe","title":"Dataframe","text":"> df <- data.frame(score = c(90,80,70,60), rank = c(\"a\", \"b\", \"c\", \"d\"))\n> df\n score rank\n1 90 a\n2 80 b\n3 70 c\n4 60 d\n
"},{"location":"75_R_basics/#subsetting","title":"Subsetting","text":"myvector\n[1] 1 2 3\n> myvector[0]\ninteger(0)\n> myvector[1]\n[1] 1\nmyvector[1:2]\n[1] 1 2\n> myvector[-1]\n[1] 2 3\n> myvector[-1:-2]\n[1] 3\n
> mymatrix\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n> mymatrix[0]\ninteger(0)\n> mymatrix[1]\n[1] 1\n> mymatrix[1,]\n[1] 1 3 5\n> mymatrix[1,2]\n[1] 3\n> mymatrix[1:2,2]\n[1] 3 4\n> mymatrix[,2]\n[1] 3 4\n
> df\n score rank\n1 90 a\n2 80 b\n3 70 c\n4 60 d\n> df[score]\nError in `[.data.frame`(df, score) : object 'score' not found\n> df[[score]]\nError in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :\n object 'score' not found\n> df[[\"score\"]]\n[1] 90 80 70 60\n> df[\"score\"]\n score\n1 90\n2 80\n3 70\n4 60\n> df[1, \"score\"]\n[1] 90\n> df[1:2, \"score\"]\n[1] 90 80\n> df[1:2,2]\n[1] \"a\" \"b\"\n> df[1:2,1]\n[1] 90 80\n> df[,c(\"rank\",\"score\")]\n rank score\n1 a 90\n2 b 80\n3 c 70\n4 d 60\n
"},{"location":"75_R_basics/#data-input-and-output","title":"Data Input and Output","text":"mydata <- read.table(\"data.txt\", header=T)\n\nwrite.table(mydata, \"data.txt\")\n
"},{"location":"75_R_basics/#control-flow","title":"Control flow","text":""},{"location":"75_R_basics/#if","title":"if","text":"if (x > y){\n print (\"x\")\n} else if (x < y){\n print (\"y\")\n} else {\n print(\"tie\")\n}\n
"},{"location":"75_R_basics/#for","title":"for","text":"> for (x in 1:5) {\n print(x)\n}\n\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n
"},{"location":"75_R_basics/#while","title":"while","text":"x<-0\nwhile (x<5)\n{\n x<-x+1\n print(\"Hello world\")\n}\n\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n
"},{"location":"75_R_basics/#functions","title":"Functions","text":"myfunction <- function(x){\n // actual code here\n return(result)\n}\n\n> my_add_function <- function(x,y){\n c = x + y\n return(c)\n}\n> my_add_function(1,3)\n[1] 4\n
"},{"location":"75_R_basics/#statistical-functions","title":"Statistical functions","text":""},{"location":"75_R_basics/#normal-distribution","title":"Normal distribution","text":"Function Description dnorm(x, mean = 0, sd = 1, log = FALSE) probability density function pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) cumulative density function qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) quantile function rnorm(n, mean = 0, sd = 1) generate random values from normal distribution > dnorm(1.96)\n[1] 0.05844094\n\n> pnorm(1.96)\n[1] 0.9750021\n\n> pnorm(1.96, lower.tail=FALSE)\n[1] 0.0249979\n\n> qnorm(0.975)\n[1] 1.959964\n\n> rnorm(10)\n [1] -0.05595019 0.83176199 0.58362601 -0.89434812 0.85722843 0.96199308\n [7] 0.47782706 -0.46322066 0.03525421 -1.00715141\n
"},{"location":"75_R_basics/#chi-square-distribution","title":"Chi-square distribution","text":"Function Description dchisq(x, df, ncp = 0, log = FALSE) probability density function pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) cumulative density function qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) quantile function rchisq(n, df, ncp = 0) generate random values from normal distribution"},{"location":"75_R_basics/#regression","title":"Regression","text":"lm(formula, data, subset, weights, na.action,\n method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n singular.ok = TRUE, contrasts = NULL, offset, \u2026)\n\n# linear regression\nresults <- lm(formula = y ~ x1 + x2)\n\n# logistic regression\nresults <- lm(formula = y ~ x1 + x2, family = \"binomial\")\n
Reference: - https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
"},{"location":"76_R_resources/","title":"R Resources","text":"Conda is an open-source package and environment management system.
It is a very handy tool when you need to manage python packages.
"},{"location":"80_anaconda/#download","title":"Download","text":"https://www.anaconda.com/products/distribution
For example, download the latest linux version:
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh\n
"},{"location":"80_anaconda/#install","title":"Install","text":"# give it permission to execute\nchmod +x Anaconda3-2021.11-Linux-x86_64.sh \n\n# install\nbash ./Anaconda3-2021.11-Linux-x86_64.sh\n
Follow the instructions on : https://docs.anaconda.com/anaconda/install/linux/
If everything goes well, then you can see the (base)
before the prompt, which indicate the base environment:
(base) [heyunye@gc019 ~]$\n
For how to use conda, please check : https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html
Examples:
# install a specific version of python package\nconda install pandas==1.5.2\n\n#create a new python 3.9 virtual environment with the name \"mypython39\"\nconda create -n mypython39 python=3.9\n\n#use environment.yml to create a virtual environment\nconda env create --file environment.yml\n\n# activate a virtual environment called ldsc\nconda activate ldsc\n\n# change back to base environment\nconda deactivate\n\n# list all packages in your current environment \nconda list\n\n# list all your current environments \nconda env list\n
"},{"location":"81_jupyter_notebook/","title":"Jupyter notebook","text":"Usyally, the conda will install the jupyter notebook (and the ipykernel) by default.
If not, using conda to install it:
conda install jupyter\n
"},{"location":"81_jupyter_notebook/#using-jupyter-notebook-on-a-local-or-remote-server","title":"Using Jupyter notebook on a local or remote server","text":""},{"location":"81_jupyter_notebook/#using-the-default-configuration","title":"Using the default configuration","text":""},{"location":"81_jupyter_notebook/#local-machine","title":"Local machine","text":"You could open it in the Anaconda interface or some other IDE.
If using the terminal, just typing:
jupyter-lab --port 9000 & \n
Then open the link in the browser.
http://localhost:9000/lab?token=???\nhttp://127.0.0.1:9000/lab?token=???\n
"},{"location":"81_jupyter_notebook/#remote-server","title":"Remote server","text":"Start in the command line of the remote server, adding a port.
jupyter-lab --ip 0.0.0.0 --port 9000 --no-browser &\n
It will generate an address the same as above. Then, on the local machine, using ssh to listen to the port.
ssh -NfL localhost:9000:localhost:9000 user@host\n
Note that the localhost:9000:localhost:9000
is localmachine:localport:remotemachine:remotehost
and user@host
is the user id and address of the remote server. When this is finished, open the above in the browser.
"},{"location":"81_jupyter_notebook/#using-customized-configuration","title":"Using customized configuration","text":"Steps:
Create a jupyter notebook configuration file if there is no such file
jupyter notebook --generate-config\n
The file is usually stored at:
~/.jupyter/jupyter_notebook_config.py\n
What the first few lines of Configuration file look like:
head ~/.jupyter/jupyter_notebook_config.py\n# Configuration file for jupyter-notebook.\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n
"},{"location":"81_jupyter_notebook/#add-the-port-information","title":"Add the port information","text":"Simply add c.NotebookApp.port =8889
to the configuration file and then save. Note: you can change the port you want to use.
# Configuration file for jupyter-notebook.\n\nc.NotebookApp.port = 8889\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n
"},{"location":"81_jupyter_notebook/#run-jupyter-notebook-server-on-remote-host","title":"Run jupyter notebook server on remote host","text":"On host side, set up the jupyter notebook server:
jupyter notebook\n
"},{"location":"81_jupyter_notebook/#use-ssh-tunnel-to-connect-to-the-remote-server-from-your-local-machine","title":"Use ssh tunnel to connect to the remote server from your local machine","text":"On your local machine, use ssh tunnel to connect to the jupyter notebook server:
ssh -N -f -L localhost:8889:localhost:8889 username@your_remote_host_name\n
"},{"location":"81_jupyter_notebook/#use-jupyter-notebook-in-your-browser","title":"Use jupyter notebook in your browser","text":"Then you can access juptyer notebook on your local browser using the link generated by jupyter notebook server. http://127.0.0.1:8889/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
In this section, we will briefly demostrate how to install a linux subsystem on windows.
"},{"location":"82_windows_linux_subsystem/#official-documents","title":"Official Documents","text":"\"You must be running Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11.\"
"},{"location":"82_windows_linux_subsystem/#steps","title":"Steps","text":"Step 3 : Reboot
Step 4 : Run the subsystem
Git is very powerful version control software. Git can track the changes in all the files of your projects and allow collarboration of multiple contributors.
For details, please check: https://git-scm.com/
"},{"location":"83_git_and_github/#github","title":"Github","text":"Github is an online platform, offering a cloud-based Git repository.
https://github.com/
"},{"location":"83_git_and_github/#create-a-new-id","title":"Create a new id","text":"Github signup page:
https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home
"},{"location":"83_git_and_github/#clone-a-repository","title":"Clone a repository","text":"Syntax: git colne <the url you just copied>
Example: git clone https://github.com/Cloufield/GWASTutorial.git
git pull
$ git config --global user.name \"myusername\"\n$ git config --global user.email myusername@myemail.com\n
"},{"location":"83_git_and_github/#create-access-tokens","title":"Create access tokens","text":"Please see github official documents on how to create a personal token:
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
Useful Resources
SSH stands for Secure Shell Protocol, which enables you to connect to remote server safely.
"},{"location":"84_ssh/#login-to-remote-server","title":"Login to remote server","text":"ssh <username>@<host>\n
Before you login in, you need to generate keys for ssh connection:
"},{"location":"84_ssh/#keys","title":"Keys","text":"ssh-keygen -t rsa -b 4096\n
You will get two keys, a public one and a private one. ~/.ssh/id_rsa.pub
~/.ssh/id_rsa
Warning
Don't share your private key with others.
What you need to do is just add you local public key to ~/.ssh/authorized_keys
on host server.
Suppose you are using a local machine:
Donwload files from remote host to local machine
scp <username>@<host>:remote_path local_path\n
Upload files from local machine to remote host
scp local_path <username>@<host>:remote_path\n
Info
-r
: copy recursively. This option is needed when you want to transfer an entire directory.
Example
Copy the local work directory to remote home directory
$ scp -r /home/gwaslab/work gwaslab@remote.com:/home/gwaslab \n
"},{"location":"84_ssh/#ssh-tunneling","title":"SSH Tunneling","text":"Quote
In this forwarding type, the SSH client listens on a given port and tunnels any connection to that port to the specified port on the remote SSH server, which then connects to a port on the destination machine. The destination machine can be the remote SSH server or any other machine. https://linuxize.com/post/how-to-setup-ssh-tunneling/
-L
: Local port forwarding
ssh -L [local_IP:]local_PORT:destination:destination_PORT <username>@<host>\n
"},{"location":"84_ssh/#other-ssh-options","title":"other SSH options","text":"-f
: send to background.-p
: port for connenction (default:22).-N
: not to execute any commands on the remote host. (so you will not open a remote shell but just forward ports.)(If needed) Try to use job scheduling system to run a simple script:
Two of the most commonly used job scheduling systems:
In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.
Aim
Aim:
Here is a brief overview of this mini project.
The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.
Tip
Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.
"},{"location":"95_Assignment/#1-download-the-publicly-available-1000-genome-vcf","title":"1. Download the publicly available 1000 Genome VCF","text":"Download the files we need from 1000 Genomes Project FTP site:
Tip
Note
If it takes too long or if you are using your local laptop, you can just download the files for chr1.
Sample shell script for downloading the files
#!/bin/bash\nfor chr in $(seq 1 22) #Note: If it takes too long, you can download just chr1.\ndo\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi\ndone\n\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai\n\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed\n
"},{"location":"95_Assignment/#2-re-align-normalize-and-remove-duplication","title":"2. Re-align, normalize and remove duplication","text":"We need to use bcftools to process the raw vcf files.
Install bcftools
http://www.htslib.org/download/
Since the variants are not normalized and also have many duplications, we need to clean the vcf files.
Re-align with the reference genome, normalize variants and remove duplications
#!/bin/bash\nfor chr in $(seq 1 22)\ndo\n bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \\\n ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \\\n bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \\\n bcftools norm -Ob --rm-dup both \\\n > ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \n bcftools index ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf\ndone\n
"},{"location":"95_Assignment/#3-convert-vcf-files-to-plink-binary-format","title":"3. Convert VCF files to plink binary format","text":"Example
#!/bin/bash\nfor chr in $(seq 1 22)\ndo\nplink \\\n --bcf ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \\\n --keep-allele-order \\\n --vcf-idspace-to _ \\\n --const-fid \\\n --allow-extra-chr 0 \\\n --split-x b37 no-fail \\\n --make-bed \\\n --out ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes\ndone\n
"},{"location":"95_Assignment/#4-using-snps-only-in-strict-masks","title":"4. Using SNPs only in strict masks","text":"Strict masks are in this directory.
Strict mask
The overlapped region with this mask is \u201ccallable\u201d (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/
Tip
Use plink --make-set
option with the BED
files to extract SNPs in the strict mask.
Tip
Use PLINK.
QC: only SNPs (exclude indels), MAF>0.1
Pruning: plink --indep-pariwise
Tip
plink --pca
Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.
Tip
You can use R, python, or any other tools you like (even Excel can do the job.)
(If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)
"},{"location":"95_Assignment/#checklist","title":"Checklist","text":"Note
(Just an example, there is no need to strictly follow this.)
Fundamental Exercise II
","text":"This tutorial is provided by the Laboratory of Complex Trait Genomics (Kamatani Lab) in the Deparment of Computational Biology and Medical Sciences at the Univerty of Tokyo. This tutorial is designed for the graduate course Fundamental Exercise II
.
This repository is currently maintained by Yunye He.
If you have any questions or suggestions, please feel free to contact gwaslab@gmail.com.
Enjoy this real \"Manhattan plot\"!
"},{"location":"Imputation/","title":"Imputation","text":"The missing data imputation is not a task specific to genetic studies. By comparing the genotyping array (generally 500k\u20131M markers) to the reference panel (WGSed), missing markers on the array are filled. The tabular data imputation methods could be used to impute the genotype data. However, haplotypes are coalesced from the ancestors, and the recombination events during gametogenesis, each individual's haplotype is a mosaic of all haplotypes in a population. Given these properties, hidden Markov model (HMM) based methods usually outperform tabular data-based ones.
This HMM was first described in Li & Stephens 2003. Here we will not go through tools over the past 20 years. We will introduce the concept and the usage of Minimac.
"},{"location":"Imputation/#figure-illustration","title":"Figure illustration","text":"In the figure, each row in the above panel represents a reference haplotype. The middle panel shows the genotyping array. Genotyped markers are squared and WGS-only markers are circled. The two colors represent the ref and alt alleles. You could also think they represent different haplotype fragments. The red triangles indicate the recombination hot spots, which a crossover between the reference haplotypes is more likely to happen.
Given the genotyped marker, matching probabilities are calculated for all potential paths through reference haplotypes. Then, in this example (the real case is not this simple), we assumed at each recombination hotspot, there is a free recombination. You will see that all paths chained by dark blue match 2 of the 4 genotyped markers. So these paths have equal probability.
Finally, missing markers are filled with the probability-weighted alleles on each path. For the left three circles, two paths are cyan and one path is orange, the imputation result will be 1/3 orange and 2/3 cyan.
"},{"location":"Imputation/#how-to-do-imputation","title":"How to do imputation","text":"The simplest way is to use the Michigan or TOPMed imputation server, if you don't have resources of WGS data. Just make your vcf, submit it to the server, and select the favored reference panel. There are built-in phasing, liftover, and QC on the server, but we would strongly suggest checking the data and doing these steps by yourself. For example:
Another way is to run the job locally. Recent tools are memory and computation efficient, you may run it in a small in-house server or even PC.
A typical workflow of Minimac is:
Parameter estimation (this step will create a m3vcf reference panel file):
Minimac3 \\\n --refHaps ./phased_reference.vcf.gz \\\n --processReference \\\n --prefix ./phased_reference \\\n --log\n
Imputation:
minimac4 \\\n --refHaps ./phased_reference.m3vcf \\\n --haps ./phased_target.vcf.gz \\\n --prefix ./result \\\n --format GT,DS,HDS,GP,SD \\\n --meta \\\n --log \\\n --cpus 10\n
Details of the options.
"},{"location":"Imputation/#after-imputation","title":"After imputation","text":"The output is a vcf file. First, we need to examine the imputation quality. It can be a long long story and I will not explain it in detail. Most of the time, when the following criteria meet,
The standard imputation quality metric, named Rsq
, efficiently discriminates the well-imputed variants at a threshold 0.7 (may loosen it to 0.3 to allow more variants in the GWAS).
Three types of genotypes are widely used in GWAS -- best-guess genotype, allelic dosage, and genotype probability. Using Dosage (DS) keeps the dataset smallest while most association test software only requires this information.
"},{"location":"PRS_evaluation/","title":"Polygenic risk scores evaluation","text":""},{"location":"PRS_evaluation/#regressions-for-evaluation-of-prs","title":"Regressions for evaluation of PRS","text":"\\[Phenotype \\sim PRS_{phenotype} + Covariates\\] \\[logit(P) \\sim PRS_{phenotype} + Covariates\\]Covariates usually include sex, age and top 10 PCs.
"},{"location":"PRS_evaluation/#evaluation","title":"Evaluation","text":""},{"location":"PRS_evaluation/#roc-aic-auc-and-c-index","title":"ROC, AIC, AUC, and C-index","text":"ROC
ROC: receiver operating characteristic curve shows the performance of a classification model at all thresholds.
AUC
AUC: area under the ROC Curve, a common measure for the performance of a classification model.
AIC
Akaike Information Criterion (AIC): a measure for comparison of different statistical models.
\\[AIC = 2k - 2ln(\\hat{L})\\]C-index
C-index: Harrell\u2019s C-index (concordance index), which is a metric to evaluate the predictive performance of models and is commonly used in survival analysis. It is a measure of the probability that the predicted scores \\(M_i\\) and \\(M_j\\) by a model of two randomly selected individuals \\(i\\) and \\(j\\), have the reverse relative order as their true event times \\(T_i, T_j\\).
\\[ C = Pr (M_j > M_i | T_j < T_i) \\]Interpretation: Individuals with higher scores should have higher risks of the disease events
Reference: Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L., & Rosati, R. A. (1982). Evaluating the yield of medical tests. Jama, 247(18), 2543-2546.
Reference: Longato, E., Vettoretti, M., & Di Camillo, B. (2020). A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. Journal of Biomedical Informatics, 108, 103496.
Coefficient of determination
\\(R^2\\) : coefficient of determination, which measures the amount of variance explained by the regression model.
In linear regression:
\\[ R^2 = 1 - {{RSS}\\over{TSS}} \\]Pseudo-R2 (Nagelkerke)
In logistic regression,
One of the most commonly used Pseudo-R2 for PRS analysis is Nagelkerke's \\(R^2\\)
\\[R^2_{Nagelkerke} = {{1 - ({{L_0}\\over{L_M}})^{2/n}}\\over{1 - L_0^{2/n}}}\\]R2 on liability scale
\\(R^2\\) on the liability scale for ascertained case-control studies
\\[ R^2_l = {{R_o^2 C}\\over{1 + R_o^2 \\theta C }} \\]\\(\\theta = m {{P-K}\\over{1-K}} ( m{{P-K}\\over{1-K}} - t)\\)
\\(K\\) : population disease prevalence
Reference : Lee, S. H., Goddard, M. E., Wray, N. R., & Visscher, P. M. (2012). A better coefficient of determination for genetic profile analysis. Genetic epidemiology, 36(3), 214-224.
The authors also provided R codes for calculation (removed unrelated codes for simplicity)
# R2 on the liability scale using the transformation\n\nnt = total number of the sample\nncase = number of cases\nncont = number of controls\nthd = the threshold on the normal distribution which truncates the proportion of disease prevalence\nK = population prevalence\nP = proportion of cases in the case-control samples\n\n#threshold\nthd = -qnorm(K,0,1)\n\n#value of standard normal density function at thd\nzv = dnorm(thd) \n\n#mean liability for case\nmv = zv/K \n\n#linear model\nlmv = lm(y\u223cg) \n\n#R20 : R2 on the observed scale\nR2O = var(lmv$fitted.values)/(ncase/nt*ncont/nt)\n\n# calculate correction factors\ntheta = mv*(P-K)/(1-K)*(mv*(P-K)/(1-K)-thd) \ncv = K*(1-K)/zv^2*K*(1-K)/(P*(1-P)) \n\n# convert to R2 on the liability scale\nR2 = R2O*cv/(1+R2O*theta*cv)\n
"},{"location":"PRS_evaluation/#bootstrap-confidence-interval-methods-for-r2","title":"Bootstrap Confidence Interval Methods for R2","text":"Bootstrap is a commonly used resampling method to generate a sampling distribution from the known sample dataset. It repeatedly takes random samples with replacement from the known sample dataset.
Steps:
The percentile bootstrap interval is then defined as the interval between \\(100 \\times \\alpha /2\\) and \\(100 \\times (1 - \\alpha /2)\\) percentiles of the parameters estimated by bootstrapping. We can use this method to estimate the bootstrap interval for \\(R^2\\).
"},{"location":"PRS_evaluation/#reference","title":"Reference","text":"Human genome is diploid. Distribution of variants between homologous chromosomes can affect the interpretation of genotype data, such as allele specific expression, context-informed annotation, loss-of-function compound heterozygous events.
Example
( SHAPEIT5 )
In the above illustration, when LoF variants are on both copies of a gene, the gene is thought knocked out
Trio data and long read sequencing can solve the haplotyping problem. That is not always possible. Statistical phasing is based on the Li & Stephens Markov model. The haploid version of this model (see Imputation) is easier to understand. Because the maternal and paternal haplotypes are independent, unphased genotype could be constructed by the addition of two haplotypes.
Recent methods had incopoorates long IBD sharing, local haplotypes, etc, to make it tractable for large datasets. You could read the following methods if you are interested.
In most of the cases, phasing is just a pre-step of imputation, and we do not care about how the phasing goes. But there are several considerations, like reference-based or reference-free, large and small sample size, rare variants cutoff. There is no single method that could best fit all cases.
Here I show one example using EAGLE2.
eagle \\\n --vcf=target.vcf.gz \\\n --geneticMapFile=genetic_map_hg19_withX.txt.gz \\\n --chrom=19 \\\n --outPrefix=target.eagle \\\n --numThreads=10\n
"},{"location":"TwoSampleMR/","title":"TwoSampleMR Tutorial","text":"In\u00a0[1]: Copied! library(data.table)\nlibrary(TwoSampleMR)\nlibrary(data.table) library(TwoSampleMR)
TwoSampleMR version 0.5.6 \n[>] New: Option to use non-European LD reference panels for clumping etc\n[>] Some studies temporarily quarantined to verify effect allele\n[>] See news(package='TwoSampleMR') and https://gwas.mrcieu.ac.uk for further details\n\n\nIn\u00a0[2]: Copied!
exp_raw <- fread(\"koges_bmi.txt.gz\")\n\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_raw$phenotype <- \"BMI\"\n\nexp_raw$n <- 72282\n\nexp_dat <- format_data( exp_raw,\n type = \"exposure\",\n snp_col = \"rsids\",\n beta_col = \"beta\",\n se_col = \"sebeta\",\n effect_allele_col = \"alt\",\n other_allele_col = \"ref\",\n eaf_col = \"af\",\n pval_col = \"pval\",\n phenotype_col = \"phenotype\",\n samplesize_col= \"n\"\n)\nclumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")\nexp_raw <- fread(\"koges_bmi.txt.gz\") exp_raw <- subset(exp_raw,exp_raw$pval<5e-8) exp_raw$phenotype <- \"BMI\" exp_raw$n <- 72282 exp_dat <- format_data( exp_raw, type = \"exposure\", snp_col = \"rsids\", beta_col = \"beta\", se_col = \"sebeta\", effect_allele_col = \"alt\", other_allele_col = \"ref\", eaf_col = \"af\", pval_col = \"pval\", phenotype_col = \"phenotype\", samplesize_col= \"n\" ) clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")
Warning message in .fun(piece, ...):\n\u201cDuplicated SNPs present in exposure data for phenotype 'BMI. Just keeping the first instance:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs4665740\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs7201608\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\u201d\nAPI: public: http://gwas-api.mrcieu.ac.uk/\n\nPlease look at vignettes for options on running this locally if you need to run many instances of this command.\n\nClumping rvi6Om, 2452 variants, using EAS population reference\n\nRemoving 2420 of 2452 variants due to LD with other variants or absence from LD reference panel\n\nIn\u00a0[16]: Copied!
out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\"))\n\nout_raw$phenotype <- \"T2D\"\n\nout_dat <- format_data( out_raw,\n type = \"outcome\",\n snp_col = \"SNPID\",\n beta_col = \"BETA\",\n se_col = \"SE\",\n effect_allele_col = \"Allele2\",\n other_allele_col = \"Allele1\",\n pval_col = \"p.value\",\n phenotype_col = \"phenotype\",\n samplesize_col= \"n\",\n eaf_col=\"AF_Allele2\"\n)\nout_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\", select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\")) out_raw$phenotype <- \"T2D\" out_dat <- format_data( out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", se_col = \"SE\", effect_allele_col = \"Allele2\", other_allele_col = \"Allele1\", pval_col = \"p.value\", phenotype_col = \"phenotype\", samplesize_col= \"n\", eaf_col=\"AF_Allele2\" )
Warning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201ceffect_allele column has some values that are not A/C/T/G or an indel comprising only these characters or D/I. These SNPs will be excluded.\u201d\nWarning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201cThe following SNP(s) are missing required information for the MR tests and will be excluded\n1:1142714:t:<cn0>\n1:4288465:t:<ins:me:alu>\n1:4882232:t:<cn0>\n1:5172414:g:<cn0>\n1:5173809:t:<cn0>\n1:5934301:g:<ins:me:alu>\n1:6814818:a:<ins:me:alu>\n1:7921468:c:<cn2>\n1:8502010:t:<ins:me:alu>\n1:8924066:c:<cn0>\n1:9171841:c:<cn0>\n1:9403667:a:<cn2>\n1:9595360:a:<cn0>\n1:9846036:c:<cn0>\n1:10067190:g:<cn0>\n1:10482499:g:<cn0>\n1:11682873:t:<cn0>\n1:11830220:t:<ins:me:sva>\n1:11988599:c:<cn0>\n1:12475666:t:<ins:me:sva>\n1:12737575:a:<ins:me:alu>\n1:12842004:a:<cn0>\n1:14437074:t:<cn0>\n1:14437868:a:<cn0>\n1:14713511:t:<cn2>\n1:14735732:g:<cn0>\n1:15343948:g:<cn0>\n1:16151682:c:<cn0>\n1:16329336:t:<ins:me:sva>\n1:16358741:g:<cn0>\n1:17676165:a:<cn0>\n1:19486410:c:<ins:me:alu>\n1:19855608:a:<cn2>\n1:20257109:t:<ins:me:alu>\n1:20310746:g:<cn0>\n1:20496899:c:<cn0>\n1:20497183:c:<cn0>\n1:20864015:t:<cn0>\n1:20944751:c:<ins:me:alu>\n1:21346279:a:<cn0>\n1:21492591:c:<ins:me:alu>\n1:21786418:t:<cn0>\n1:22302473:t:<cn0>\n1:22901908:t:<ins:me:alu>\n1:23908383:g:<cn0>\n1:24223580:g:<cn0>\n1:24520350:g:<cn0>\n1:24804603:c:<cn0>\n1:25055152:g:<cn0>\n1:26460095:a:<cn0>\n1:26961278:g:<cn0>\n1:29373390:t:<ins:me:alu>\n1:31090520:t:<ins:me:alu>\n1:31316259:t:<cn0>\n1:31720009:a:<cn0>\n1:32535965:g:<cn0>\n1:32544371:a:<cn0>\n1:33785116:c:<cn0>\n1:35101427:c:<cn0>\n1:35177287:g:<cn0>\n1:35627104:t:<cn0>\n1:36474694:t:<ins:me:alu>\n1:36733282:t:<cn0>\n1:37215810:a:<ins:me:alu>\n1:37816478:a:<cn0>\n1:38132306:t:<cn0>\n1:39084231:a:<cn0>\n1:39677675:t:<ins:me:alu>\n1:40524704:t:<ins:me:alu>\n1:40552356:a:<cn0>\n1:40976681:g:<cn0>\n1:41021684:a:<cn0>\n1:41785500:a:<ins:me:line1>\n1:42390318:c:<ins:me:alu>\n1:43694061:t:<cn0>\n1:44059290:a:<inv>\n1:45021223:t:<cn0>\n1:45708588:a:<cn0>\n1:45822649:t:<cn0>\n1:46333195:a:<ins:me:alu>\n1:46794814:t:<ins:me:alu>\n1:47267517:t:<cn0>\n1:47346571:a:<cn0>\n1:47623401:a:<cn0>\n1:47913001:t:<cn0>\n1:48820285:t:<ins:me:alu>\n1:48972537:g:<ins:me:alu>\n1:49357693:t:<ins:me:alu>\n1:49428756:t:<ins:me:line1>\n1:49861993:g:<ins:me:alu>\n1:50912662:c:<ins:me:alu>\n1:51102445:t:<cn0>\n1:52146313:a:<cn0>\n1:53594175:t:<cn0>\n1:53595112:c:<cn0>\n1:55092043:g:<cn0>\n1:55341923:c:<cn0>\n1:55342224:g:<cn0>\n1:55927718:a:<cn0>\n1:56268665:t:<ins:me:line1>\n1:56405404:t:<ins:me:line1>\n1:56879062:t:<ins:me:alu>\n1:57100960:t:<ins:me:sva>\n1:57208746:a:<cn0>\n1:58722032:t:<cn2>\n1:58743910:a:<cn0>\n1:58795378:a:<cn0>\n1:59205317:t:<ins:me:alu>\n1:59591483:t:<ins:me:alu>\n1:59871876:t:<ins:me:alu>\n1:60046725:a:<cn0>\n1:60048628:c:<cn0>\n1:60470604:t:<ins:me:alu>\n1:60487912:t:<cn0>\n1:60715714:t:<ins:me:line1>\n1:61144594:c:<ins:me:alu>\n1:62082822:a:<cn0>\n1:62113386:c:<cn0>\n1:62479250:t:<cn0>\n1:62622902:g:<cn0>\n1:62654739:c:<cn0>\n1:63841704:c:<ins:me:alu>\n1:64720497:a:<cn0>\n1:64850193:a:<ins:me:sva>\n1:65346960:t:<ins:me:alu>\n1:65412505:a:<cn0>\n1:68375746:a:<cn0>\n1:70061670:g:<ins:me:alu>\n1:70091056:t:<ins:me:alu>\n1:70093557:c:<ins:me:alu>\n1:70412360:t:<ins:me:alu>\n1:70424730:t:<cn2>\n1:70820401:t:<cn0>\n1:70912433:g:<ins:me:alu>\n1:72449620:a:<cn0>\n1:72755694:t:<cn0>\n1:72766343:t:<cn0>\n1:72778537:g:<cn0>\n1:73092779:c:<cn2>\n1:74312425:a:<cn0>\n1:75148055:t:<ins:me:alu>\n1:75192907:c:<ins:me:line1>\n1:75301685:t:<ins:me:alu>\n1:75557174:c:<ins:me:alu>\n1:76392967:t:<ins:me:alu>\n1:76416074:a:<ins:me:alu>\n1:76900598:c:<cn0>\n1:77577928:t:<ins:me:alu>\n1:77634327:a:<ins:me:alu>\n1:77764994:t:<ins:me:alu>\n1:77830614:t:<cn0>\n1:78446240:c:<ins:me:sva>\n1:78607067:t:<ins:me:alu>\n1:78649157:a:<cn0>\n1:78800902:t:<ins:me:line1>\n1:79108845:t:<ins:me:alu>\n1:79331208:c:<ins:me:alu>\n1:79582082:t:<ins:me:alu>\n1:79855600:c:<cn0>\n1:80221781:t:<cn0>\n1:80299106:t:<ins:me:alu>\n1:80504615:t:<cn0>\n1:80554065:t:<cn0>\n1:80955976:t:<ins:me:line1>\n1:81422415:c:<cn0>\n1:82312054:g:<ins:me:alu>\n1:82850409:g:<ins:me:alu>\n1:83041946:t:<cn0>\n1:84056670:a:<cn0>\n1:84388330:g:<cn0>\n1:84517858:a:<cn0>\n1:84712009:g:<cn0>\n1:84913274:c:<ins:me:alu>\n1:85293152:g:<ins:me:alu>\n1:85620127:t:<ins:me:alu>\n1:85910957:g:<cn0>\n1:86400829:t:<cn0>\n1:86696940:a:<ins:me:alu>\n1:87064962:c:<cn2>\n1:87096974:c:<cn0>\n1:87096990:t:<cn0>\n1:88813625:t:<ins:me:alu>\n1:89209563:t:<ins:me:alu>\n1:89733616:t:<ins:me:line1>\n1:89811425:g:<cn0>\n1:90370569:t:<ins:me:alu>\n1:90914512:g:<ins:me:line1>\n1:91878937:g:<cn0>\n1:92131841:g:<inv>\n1:92232051:t:<cn0>\n1:93291972:c:<cn0>\n1:93498232:t:<ins:me:alu>\n1:94288372:c:<cn0>\n1:95192010:a:<ins:me:line1>\n1:95342701:g:<ins:me:alu>\n1:95522242:t:<cn0>\n1:97458273:t:<inv>\n1:98605297:t:<ins:me:alu>\n1:99610528:a:<ins:me:alu>\n1:99698454:g:<ins:me:alu>\n1:100355940:a:<ins:me:alu>\n1:100645536:g:<ins:me:alu>\n1:100994221:g:<ins:me:alu>\n1:101693230:t:<cn0>\n1:101695346:a:<cn0>\n1:101770067:g:<ins:me:alu>\n1:101978980:t:<ins:me:line1>\n1:102568923:g:<ins:me:line1>\n1:102920544:t:<ins:me:alu>\n1:103054499:t:<ins:me:alu>\n1:104359763:g:<cn0>\n1:104443176:t:<cn0>\n1:104574487:t:<ins:me:alu>\n1:105054083:t:<ins:me:alu>\n1:105070244:c:<ins:me:alu>\n1:105138650:t:<ins:me:alu>\n1:105231111:t:<ins:me:alu>\n1:105832823:g:<cn0>\n1:106015797:t:<cn0>\n1:106978443:t:<cn0>\n1:107896853:g:<cn0>\n1:107949843:t:<ins:me:alu>\n1:108142479:t:<ins:me:alu>\n1:108369370:a:<cn0>\n1:108402972:a:<cn0>\n1:109366972:g:<cn0>\n1:109573240:a:<cn0>\n1:110187159:a:<cn0>\n1:110225019:c:<cn0>\n1:111013750:a:<cn0>\n1:111472607:g:<cn0>\n1:111802597:g:<ins:me:sva>\n1:111827762:a:<cn0>\n1:111896187:c:<ins:me:sva>\n1:112032284:t:<ins:me:alu>\n1:112123691:t:<ins:me:alu>\n1:112691740:a:<cn0>\n1:112736007:a:<ins:me:alu>\n1:112992009:t:<ins:me:alu>\n1:113799625:g:<cn0>\n1:114925678:t:<cn0>\n1:115178042:c:<cn0>\n1:116229468:c:<cn0>\n1:116983571:t:<ins:me:alu>\n1:117593370:a:<cn0>\n1:119526940:a:<cn0>\n1:119553366:c:<ins:me:line1>\n1:120012853:a:<cn0>\n1:152555495:g:<cn0>\n1:152643788:a:<cn0>\n1:152760084:c:<cn0>\n1:153133703:a:<cn0>\n1:154123770:t:<ins:me:alu>\n1:154324167:g:<cn0>\n1:154865017:g:<ins:me:alu>\n1:157173860:t:<cn0>\n1:157363502:t:<ins:me:alu>\n1:157540655:g:<cn0>\n1:157887236:t:<inv>\n1:158371473:a:<ins:me:alu>\n1:158488410:a:<cn0>\n1:158726918:a:<cn0>\n1:160979498:c:<cn0>\n1:162263027:t:<ins:me:alu>\n1:163088865:t:<ins:me:alu>\n1:163314443:g:<ins:me:alu>\n1:163639693:t:<ins:me:alu>\n1:165553149:t:<ins:me:line1>\n1:165861400:t:<ins:me:sva>\n1:166189445:t:<ins:me:alu>\n1:167506110:g:<ins:me:alu>\n1:167712862:g:<ins:me:alu>\n1:168926083:a:<ins:me:sva>\n1:169004356:c:<cn0>\n1:169042039:c:<cn0>\n1:169225213:t:<cn0>\n1:169524859:t:<ins:me:line1>\n1:170603451:a:<ins:me:alu>\n1:170991168:c:<ins:me:alu>\n1:171358314:t:<ins:me:alu>\n1:172177959:g:<cn0>\n1:172825753:g:<cn0>\n1:173811663:a:<cn0>\n1:174654509:g:<cn0>\n1:174796517:t:<cn0>\n1:174894014:g:<cn0>\n1:175152408:g:<cn0>\n1:177509016:g:<cn0>\n1:177544393:g:<cn0>\n1:177946159:a:<cn0>\n1:178397612:t:<ins:me:alu>\n1:178495321:a:<cn0>\n1:178692798:t:<ins:me:alu>\n1:179491966:t:<ins:me:alu>\n1:179607260:a:<cn0>\n1:180272299:a:<cn0>\n1:180857564:c:<ins:me:alu>\n1:181043348:a:<cn0>\n1:181588360:t:<ins:me:alu>\n1:181601286:t:<ins:me:alu>\n1:181853551:g:<ins:me:alu>\n1:182420857:t:<ins:me:alu>\n1:183308627:a:<cn0>\n1:185009806:t:<cn0>\n1:185504717:c:<ins:me:alu>\n1:185584799:t:<ins:me:alu>\n1:185857064:a:<cn0>\n1:187464747:t:<cn0>\n1:187522081:g:<ins:me:alu>\n1:187609013:t:<cn0>\n1:187716053:g:<cn0>\n1:187932575:t:<cn0>\n1:187955397:c:<ins:me:alu>\n1:188174657:t:<ins:me:alu>\n1:188186464:t:<ins:me:alu>\n1:188438213:t:<ins:me:alu>\n1:188615934:g:<ins:me:alu>\n1:189247039:a:<ins:me:alu>\n1:190052658:t:<cn0>\n1:190309695:t:<cn0>\n1:190773296:t:<ins:me:alu>\n1:190874469:t:<ins:me:alu>\n1:191466954:t:<ins:me:line1>\n1:191580781:a:<ins:me:alu>\n1:191817437:c:<ins:me:alu>\n1:191916438:t:<cn0>\n1:192008678:t:<ins:me:line1>\n1:192262268:a:<ins:me:line1>\n1:193549655:c:<ins:me:line1>\n1:193675125:t:<ins:me:alu>\n1:193999047:t:<cn0>\n1:194067859:t:<ins:me:alu>\n1:194575585:t:<cn0>\n1:194675140:c:<ins:me:alu>\n1:195146820:c:<ins:me:alu>\n1:195746415:a:<ins:me:line1>\n1:195885406:g:<cn0>\n1:195904499:g:<cn0>\n1:196464453:a:<ins:me:line1>\n1:196602664:a:<cn0>\n1:196728877:g:<cn0>\n1:196734744:a:<cn0>\n1:196761370:t:<ins:me:alu>\n1:197756784:c:<inv>\n1:197894025:c:<cn0>\n1:198093872:c:<ins:me:alu>\n1:198243300:t:<ins:me:alu>\n1:198529696:t:<ins:me:line1>\n1:198757296:t:<cn0>\n1:198773749:t:<cn0>\n1:198815313:a:<ins:me:alu>\n1:202961159:t:<ins:me:alu>\n1:203684252:t:<cn0>\n1:204238474:c:<ins:me:alu>\n1:204345055:t:<ins:me:alu>\n1:204381864:c:<cn0>\n1:205178526:t:<inv>\u201d\nIn\u00a0[17]: Copied!
harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\nharmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)
Harmonising BMI (rvi6Om) and T2D (ETcv15)\n\nIn\u00a0[18]: Copied!
harmonized_data\nharmonized_data A data.frame: 28 \u00d7 29 SNPeffect_allele.exposureother_allele.exposureeffect_allele.outcomeother_allele.outcomebeta.exposurebeta.outcomeeaf.exposureeaf.outcomeremove\u22efpval.exposurese.exposuresamplesize.exposureexposuremr_keep.exposurepval_origin.exposureid.exposureactionmr_keepsamplesize.outcome <chr><chr><chr><chr><chr><dbl><dbl><dbl><dbl><lgl>\u22ef<dbl><dbl><dbl><chr><lgl><chr><chr><dbl><lgl><lgl> 1rs10198356GAGA 0.044 0.0278218160.4500.46949841FALSE\u22ef1.5e-170.005172282BMITRUEreportedrvi6Om1TRUENA 2rs10209994CACA 0.030 0.0284334240.6400.65770918FALSE\u22ef2.0e-080.005472282BMITRUEreportedrvi6Om1TRUENA 3rs10824329AGAG 0.029 0.0182171190.5100.56240335FALSE\u22ef1.7e-080.005172282BMITRUEreportedrvi6Om1TRUENA 4rs10938397GAGA 0.036 0.0445547360.2800.29915686FALSE\u22ef1.0e-100.005672282BMITRUEreportedrvi6Om1TRUENA 5rs11066132TCTC-0.053-0.0319288060.1600.24197159FALSE\u22ef1.0e-130.007172282BMITRUEreportedrvi6Om1TRUENA 6rs12522139GTGT-0.037-0.0107492430.2700.24543922FALSE\u22ef1.8e-100.005772282BMITRUEreportedrvi6Om1TRUENA 7rs12591730AGAG 0.037 0.0330428120.2200.25367536FALSE\u22ef1.5e-080.006572282BMITRUEreportedrvi6Om1TRUENA 8rs13013021TCTC 0.070 0.1040752230.9070.90195307FALSE\u22ef1.9e-150.008872282BMITRUEreportedrvi6Om1TRUENA 9rs1955337 TGTG 0.036 0.0195935030.3000.24112816FALSE\u22ef7.4e-110.005672282BMITRUEreportedrvi6Om1TRUENA 10rs2076308 CGCG 0.037 0.0413520380.3100.31562874FALSE\u22ef3.4e-110.005572282BMITRUEreportedrvi6Om1TRUENA 11rs2278557 GCGC 0.034 0.0212111960.3200.29052039FALSE\u22ef7.4e-100.005572282BMITRUEreportedrvi6Om1TRUENA 12rs2304608 ACAC 0.031 0.0466695150.4700.44287320FALSE\u22ef1.1e-090.005172282BMITRUEreportedrvi6Om1TRUENA 13rs2531995 TCTC 0.031 0.0433160150.3700.33584772FALSE\u22ef5.2e-090.005372282BMITRUEreportedrvi6Om1TRUENA 14rs261967 CACA 0.032 0.0489708280.4400.39718313FALSE\u22ef3.5e-100.005172282BMITRUEreportedrvi6Om1TRUENA 15rs35332469CTCT-0.035 0.0080755980.2200.17678428FALSE\u22ef3.6e-080.006372282BMITRUEreportedrvi6Om1TRUENA 16rs35560038TATA-0.047 0.0739350890.5900.61936434FALSE\u22ef1.4e-190.005272282BMITRUEreportedrvi6Om1TRUENA 17rs3755804 TCTC 0.043 0.0228541340.2800.30750660FALSE\u22ef1.5e-140.005672282BMITRUEreportedrvi6Om1TRUENA 18rs4470425 ACAC-0.030-0.0208441370.4500.44152032FALSE\u22ef4.9e-090.005172282BMITRUEreportedrvi6Om1TRUENA 19rs476828 CTCT 0.067 0.0786518590.2700.25309742FALSE\u22ef2.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 20rs4883723 AGAG 0.039 0.0213709100.2800.22189601FALSE\u22ef8.3e-120.005772282BMITRUEreportedrvi6Om1TRUENA 21rs509325 GTGT 0.065 0.0356917590.2800.26816326FALSE\u22ef7.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 22rs55872725TCTC 0.090 0.1215170230.1200.20355108FALSE\u22ef1.8e-310.007772282BMITRUEreportedrvi6Om1TRUENA 23rs6089309 CTCT-0.033-0.0186698330.7000.65803267FALSE\u22ef3.5e-090.005672282BMITRUEreportedrvi6Om1TRUENA 24rs6265 TCTC-0.049-0.0316426960.4600.40541994FALSE\u22ef6.1e-220.005172282BMITRUEreportedrvi6Om1TRUENA 25rs6736712 GCGC-0.053-0.0297168990.9170.93023505FALSE\u22ef2.1e-080.009572282BMITRUEreportedrvi6Om1TRUENA 26rs7560832 CACA-0.150-0.0904811950.0120.01129784FALSE\u22ef2.0e-090.025072282BMITRUEreportedrvi6Om1TRUENA 27rs825486 TCTC-0.031 0.0190735540.6900.75485104FALSE\u22ef3.1e-080.005672282BMITRUEreportedrvi6Om1TRUENA 28rs9348441 ATAT-0.036 0.1792307940.4700.42502848FALSE\u22ef1.3e-120.005172282BMITRUEreportedrvi6Om1TRUENA In\u00a0[6]: Copied!
res <- mr(harmonized_data)\nres <- mr(harmonized_data)
Analysing 'rvi6Om' on 'hff6sO'\n\nIn\u00a0[7]: Copied!
res\nres A data.frame: 5 \u00d7 9 id.exposureid.outcomeoutcomeexposuremethodnsnpbsepval <chr><chr><chr><chr><chr><int><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 281.33375800.694852606.596064e-02 rvi6Omhff6sOT2DBMIWeighted median 280.62989800.085163151.399605e-13 rvi6Omhff6sOT2DBMIInverse variance weighted280.55989560.232258061.592361e-02 rvi6Omhff6sOT2DBMISimple mode 280.60978420.133054299.340189e-05 rvi6Omhff6sOT2DBMIWeighted mode 280.59467780.126803557.011481e-05 In\u00a0[8]: Copied!
mr_heterogeneity(harmonized_data)\nmr_heterogeneity(harmonized_data) A data.frame: 2 \u00d7 8 id.exposureid.outcomeoutcomeexposuremethodQQ_dfQ_pval <chr><chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 670.7022261.000684e-124 rvi6Omhff6sOT2DBMIInverse variance weighted706.6579271.534239e-131 In\u00a0[9]: Copied!
mr_pleiotropy_test(harmonized_data)\nmr_pleiotropy_test(harmonized_data) A data.frame: 1 \u00d7 7 id.exposureid.outcomeoutcomeexposureegger_interceptsepval <chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMI-0.036036970.03052410.2484472 In\u00a0[10]: Copied!
res_single <- mr_singlesnp(harmonized_data)\nres_single <- mr_singlesnp(harmonized_data) In\u00a0[11]: Copied!
res_single\nres_single A data.frame: 30 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs10198356 0.63231400.20828372.398742e-03 2BMIT2Drvi6Omhff6sONArs10209994 0.94778080.32258143.302164e-03 3BMIT2Drvi6Omhff6sONArs10824329 0.62817650.32462145.297739e-02 4BMIT2Drvi6Omhff6sONArs10938397 1.23763160.27758548.251150e-06 5BMIT2Drvi6Omhff6sONArs11066132 0.60243030.22324016.963693e-03 6BMIT2Drvi6Omhff6sONArs12522139 0.29052010.28902403.148119e-01 7BMIT2Drvi6Omhff6sONArs12591730 0.89304900.30766873.700413e-03 8BMIT2Drvi6Omhff6sONArs13013021 1.48678890.22077771.646925e-11 9BMIT2Drvi6Omhff6sONArs1955337 0.54426400.29941466.910079e-02 10BMIT2Drvi6Omhff6sONArs2076308 1.11762260.26579692.613132e-05 11BMIT2Drvi6Omhff6sONArs2278557 0.62385870.29681843.556906e-02 12BMIT2Drvi6Omhff6sONArs2304608 1.50546820.29689053.961740e-07 13BMIT2Drvi6Omhff6sONArs2531995 1.39729080.31301578.045689e-06 14BMIT2Drvi6Omhff6sONArs261967 1.53033840.29211921.616714e-07 15BMIT2Drvi6Omhff6sONArs35332469 -0.23073140.34792195.072217e-01 16BMIT2Drvi6Omhff6sONArs35560038 -1.57308700.20189686.619637e-15 17BMIT2Drvi6Omhff6sONArs3755804 0.53149150.23250732.225933e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.69480460.30799442.407689e-02 19BMIT2Drvi6Omhff6sONArs476828 1.17390830.15685507.207355e-14 20BMIT2Drvi6Omhff6sONArs4883723 0.54797210.28550045.494141e-02 21BMIT2Drvi6Omhff6sONArs509325 0.54910400.15981965.908641e-04 22BMIT2Drvi6Omhff6sONArs55872725 1.35018910.12597918.419325e-27 23BMIT2Drvi6Omhff6sONArs6089309 0.56575250.33470099.096620e-02 24BMIT2Drvi6Omhff6sONArs6265 0.64576930.19018716.851804e-04 25BMIT2Drvi6Omhff6sONArs6736712 0.56069620.34487841.039966e-01 26BMIT2Drvi6Omhff6sONArs7560832 0.60320800.29049723.785077e-02 27BMIT2Drvi6Omhff6sONArs825486 -0.61527590.35003347.878772e-02 28BMIT2Drvi6Omhff6sONArs9348441 -4.97863320.25727821.992909e-83 29BMIT2Drvi6Omhff6sONAAll - Inverse variance weighted 0.55989560.23225811.592361e-02 30BMIT2Drvi6Omhff6sONAAll - MR Egger 1.33375800.69485266.596064e-02 In\u00a0[12]: Copied!
res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\nres_loo <- mr_leaveoneout(harmonized_data) res_loo A data.frame: 29 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs101983560.55628340.24249172.178871e-02 2BMIT2Drvi6Omhff6sONArs102099940.55205760.23881222.079526e-02 3BMIT2Drvi6Omhff6sONArs108243290.55853350.23902391.945341e-02 4BMIT2Drvi6Omhff6sONArs109383970.54126880.23887092.345460e-02 5BMIT2Drvi6Omhff6sONArs110661320.55806060.24172752.096381e-02 6BMIT2Drvi6Omhff6sONArs125221390.56671020.23950641.797373e-02 7BMIT2Drvi6Omhff6sONArs125917300.55248020.23909902.085075e-02 8BMIT2Drvi6Omhff6sONArs130130210.51897150.23868082.968017e-02 9BMIT2Drvi6Omhff6sONArs1955337 0.56026350.23945051.929468e-02 10BMIT2Drvi6Omhff6sONArs2076308 0.54313550.23944032.330758e-02 11BMIT2Drvi6Omhff6sONArs2278557 0.55836340.23949241.972992e-02 12BMIT2Drvi6Omhff6sONArs2304608 0.53725570.23773252.382639e-02 13BMIT2Drvi6Omhff6sONArs2531995 0.54190160.23797122.277590e-02 14BMIT2Drvi6Omhff6sONArs261967 0.53587610.23766862.415093e-02 15BMIT2Drvi6Omhff6sONArs353324690.57359070.23783451.587739e-02 16BMIT2Drvi6Omhff6sONArs355600380.67349060.22178042.391474e-03 17BMIT2Drvi6Omhff6sONArs3755804 0.56102150.24132492.008503e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.55689930.23926321.993549e-02 19BMIT2Drvi6Omhff6sONArs476828 0.50375550.24432243.922224e-02 20BMIT2Drvi6Omhff6sONArs4883723 0.56020500.23973251.945000e-02 21BMIT2Drvi6Omhff6sONArs509325 0.56084290.24685062.308693e-02 22BMIT2Drvi6Omhff6sONArs558727250.44194460.24547717.180543e-02 23BMIT2Drvi6Omhff6sONArs6089309 0.55978590.23889021.911519e-02 24BMIT2Drvi6Omhff6sONArs6265 0.55470680.24369102.282978e-02 25BMIT2Drvi6Omhff6sONArs6736712 0.55988150.23876021.902944e-02 26BMIT2Drvi6Omhff6sONArs7560832 0.55881130.23962291.969836e-02 27BMIT2Drvi6Omhff6sONArs825486 0.58000260.23675451.429330e-02 28BMIT2Drvi6Omhff6sONArs9348441 0.73789670.13668386.717515e-08 29BMIT2Drvi6Omhff6sONAAll 0.55989560.23225811.592361e-02 In\u00a0[29]: Copied!
harmonized_data$\"r.outcome\" <- get_r_from_lor(\n harmonized_data$\"beta.outcome\",\n harmonized_data$\"eaf.outcome\",\n 45383,\n 132032,\n 0.26,\n model = \"logit\",\n correction = FALSE\n)\nharmonized_data$\"r.outcome\" <- get_r_from_lor( harmonized_data$\"beta.outcome\", harmonized_data$\"eaf.outcome\", 45383, 132032, 0.26, model = \"logit\", correction = FALSE ) In\u00a0[34]: Copied!
out <- directionality_test(harmonized_data)\nout\nout <- directionality_test(harmonized_data) out
r.exposure and/or r.outcome not present.\n\nCalculating approximate SNP-exposure and/or SNP-outcome correlations, assuming all are quantitative traits. Please pre-calculate r.exposure and/or r.outcome using get_r_from_lor() for any binary traits\n\nA data.frame: 1 \u00d7 8 id.exposureid.outcomeexposureoutcomesnp_r2.exposuresnp_r2.outcomecorrect_causal_directionsteiger_pval <chr><chr><chr><chr><dbl><dbl><lgl><dbl> rvi6OmETcv15BMIT2D0.021254530.005496427TRUENA In\u00a0[\u00a0]: Copied!
res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\nres <- mr(harmonized_data) p1 <- mr_scatter_plot(res, harmonized_data) p1[[1]] In\u00a0[\u00a0]: Copied!
res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\nres_single <- mr_singlesnp(harmonized_data) p2 <- mr_forest_plot(res_single) p2[[1]] In\u00a0[\u00a0]: Copied!
res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\nres_loo <- mr_leaveoneout(harmonized_data) p3 <- mr_leaveoneout_plot(res_loo) p3[[1]] In\u00a0[\u00a0]: Copied!
res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\nres_single <- mr_singlesnp(harmonized_data) p4 <- mr_funnel_plot(res_single) p4[[1]] In\u00a0[\u00a0]: Copied!
\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"Visualization/","title":"Visualization by gwaslab","text":"In\u00a0[2]: Copied!
import gwaslab as gl\nimport gwaslab as gl In\u00a0[3]: Copied!
sumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")\nsumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")
Tue Dec 26 15:56:49 2023 GWASLab v3.4.22 https://cloufield.github.io/gwaslab/\nTue Dec 26 15:56:49 2023 (C) 2022-2023, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\nTue Dec 26 15:56:49 2023 Start to load format from formatbook....\nTue Dec 26 15:56:49 2023 -plink2 format meta info:\nTue Dec 26 15:56:49 2023 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear\nTue Dec 26 15:56:49 2023 - format_source : https://www.cog-genomics.org/plink/2.0/formats\nTue Dec 26 15:56:49 2023 - format_version : Alpha 3.3 final (3 Jun)\nTue Dec 26 15:56:49 2023 - last_check_date : 20220806\nTue Dec 26 15:56:49 2023 -plink2 to gwaslab format dictionary:\nTue Dec 26 15:56:49 2023 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\nTue Dec 26 15:56:49 2023 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\nTue Dec 26 15:56:49 2023 Start to initiate from file :1kgeas.B1.glm.firth\nTue Dec 26 15:56:50 2023 -Reading columns : REF,ID,ALT,POS,OR,LOG(OR)_SE,Z_STAT,OBS_CT,A1,#CHROM,P,A1_FREQ\nTue Dec 26 15:56:50 2023 -Renaming columns to : REF,SNPID,ALT,POS,OR,SE,Z,N,EA,CHR,P,EAF\nTue Dec 26 15:56:50 2023 -Current Dataframe shape : 1128732 x 12\nTue Dec 26 15:56:50 2023 -Initiating a status column: STATUS ...\nTue Dec 26 15:56:50 2023 NEA not available: assigning REF to NEA...\nTue Dec 26 15:56:50 2023 -EA,REF and ALT columns are available: assigning NEA...\nTue Dec 26 15:56:50 2023 -For variants with EA == ALT : assigning REF to NEA ...\nTue Dec 26 15:56:50 2023 -For variants with EA != ALT : assigning ALT to NEA ...\nTue Dec 26 15:56:50 2023 Start to reorder the columns...\nTue Dec 26 15:56:50 2023 -Current Dataframe shape : 1128732 x 14\nTue Dec 26 15:56:50 2023 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\nTue Dec 26 15:56:50 2023 Finished sorting columns successfully!\nTue Dec 26 15:56:50 2023 -Column: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \nTue Dec 26 15:56:50 2023 -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\nTue Dec 26 15:56:50 2023 Finished loading data successfully!\nIn\u00a0[4]: Copied!
sumstats.data\nsumstats.data Out[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 0 1:15774:G:A 1 15774 A G 0.028283 NaN NaN NaN NaN 495 9999999 G A 1 1:15777:A:G 1 15777 G A 0.073737 NaN NaN NaN NaN 495 9999999 A G 2 1:57292:C:T 1 57292 T C 0.104675 NaN NaN NaN NaN 492 9999999 C T 3 1:77874:G:A 1 77874 A G 0.019153 0.462750 0.249299 0.803130 1.122280 496 9999999 G A 4 1:87360:C:T 1 87360 T C 0.023139 NaN NaN NaN NaN 497 9999999 C T ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 1128727 22:51217954:G:A 22 51217954 A G 0.033199 NaN NaN NaN NaN 497 9999999 G A 1128728 22:51218377:G:C 22 51218377 C G 0.033333 0.362212 -0.994457 0.320000 0.697534 495 9999999 G C 1128729 22:51218615:T:A 22 51218615 A T 0.033266 0.362476 -1.029230 0.303374 0.688618 496 9999999 T A 1128730 22:51222100:G:T 22 51222100 T G 0.039157 NaN NaN NaN NaN 498 9999999 G T 1128731 22:51239678:G:T 22 51239678 T G 0.034137 NaN NaN NaN NaN 498 9999999 G T
1128732 rows \u00d7 14 columns
In\u00a0[5]: Copied!sumstats.get_lead(sig_level=5e-8)\nsumstats.get_lead(sig_level=5e-8)
Tue Dec 26 15:56:51 2023 Start to extract lead variants...\nTue Dec 26 15:56:51 2023 -Processing 1128732 variants...\nTue Dec 26 15:56:51 2023 -Significance threshold : 5e-08\nTue Dec 26 15:56:51 2023 -Sliding window size: 500 kb\nTue Dec 26 15:56:51 2023 -Found 43 significant variants in total...\nTue Dec 26 15:56:51 2023 -Identified 4 lead variants!\nTue Dec 26 15:56:51 2023 Finished extracting lead variants successfully!\nOut[5]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 54904 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A 113179 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T 549726 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G 1088750 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C In\u00a0[9]: Copied!
sumstats.plot_mqq(skip=2,anno=True)\nsumstats.plot_mqq(skip=2,anno=True)
Tue Dec 26 15:59:17 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:59:17 2023 -Genomic coordinates version: 99...\nTue Dec 26 15:59:17 2023 -WARNING!!! Genomic coordinates version is unknown...\nTue Dec 26 15:59:17 2023 -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:59:17 2023 -Raw input contains 1128732 variants...\nTue Dec 26 15:59:17 2023 -Plot layout mode is : mqq\nTue Dec 26 15:59:17 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:59:17 2023 Start conversion and sanity check:\nTue Dec 26 15:59:17 2023 -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:59:17 2023 -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:59:17 2023 -Removed 220793 variants with nan in P column ...\nTue Dec 26 15:59:17 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:59:17 2023 -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:59:17 2023 -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:59:17 2023 -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:59:17 2023 Finished data conversion and sanity check.\nTue Dec 26 15:59:17 2023 Start to create manhattan plot with 6866 variants:\nTue Dec 26 15:59:17 2023 -Found 4 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:17 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:17 2023 -Annotating using column CHR:POS...\nTue Dec 26 15:59:17 2023 -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:17 2023 Start to create QQ plot with 6866 variants:\nTue Dec 26 15:59:17 2023 Expected range of P: (0,1.0)\nTue Dec 26 15:59:17 2023 -Lambda GC (MLOG10P mode) at 0.5 is 0.98908\nTue Dec 26 15:59:17 2023 Finished creating QQ plot successfully!\nTue Dec 26 15:59:17 2023 -Skip saving figures!\nOut[9]:
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)In\u00a0[6]: Copied!
sumstats.basic_check()\nsumstats.basic_check()
Tue Dec 27 23:08:13 2022 Start to check IDs...\nTue Dec 27 23:08:13 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:13 2022 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)\nTue Dec 27 23:08:14 2022 Finished checking IDs successfully!\nTue Dec 27 23:08:14 2022 Start to fix chromosome notation...\nTue Dec 27 23:08:14 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:17 2022 -Vairants with standardized chromosome notation: 1122299\nTue Dec 27 23:08:19 2022 -All CHR are already fixed...\nTue Dec 27 23:08:21 2022 Finished fixing chromosome notation successfully!\nTue Dec 27 23:08:21 2022 Start to fix basepair positions...\nTue Dec 27 23:08:21 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:21 2022 -Converting to Int64 data type ...\nTue Dec 27 23:08:22 2022 -Position upper_bound is: 250,000,000\nTue Dec 27 23:08:24 2022 -Remove outliers: 0\nTue Dec 27 23:08:24 2022 -Converted all position to datatype Int64.\nTue Dec 27 23:08:24 2022 Finished fixing basepair position successfully!\nTue Dec 27 23:08:24 2022 Start to fix alleles...\nTue Dec 27 23:08:24 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:25 2022 -Detected 0 variants with alleles that contain bases other than A/C/T/G .\nTue Dec 27 23:08:25 2022 -Converted all bases to string datatype and UPPERCASE.\nTue Dec 27 23:08:27 2022 Finished fixing allele successfully!\nTue Dec 27 23:08:27 2022 Start sanity check for statistics ...\nTue Dec 27 23:08:27 2022 -Current Dataframe shape : 1122299 x 11\nTue Dec 27 23:08:27 2022 -Checking if 0 <=N<= inf ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad N.\nTue Dec 27 23:08:27 2022 -Checking if -37.5 <Z< 37.5 ...\nTue Dec 27 23:08:27 2022 -Removed 14 variants with bad Z.\nTue Dec 27 23:08:27 2022 -Checking if 5e-300 <= P <= 1 ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad P.\nTue Dec 27 23:08:27 2022 -Checking if 0 <SE< inf ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad SE.\nTue Dec 27 23:08:27 2022 -Checking if -10 <log(OR)< 10 ...\nTue Dec 27 23:08:27 2022 -Removed 0 variants with bad OR.\nTue Dec 27 23:08:27 2022 -Checking STATUS...\nTue Dec 27 23:08:28 2022 -Coverting STAUTUS to interger.\nTue Dec 27 23:08:28 2022 -Removed 14 variants with bad statistics in total.\nTue Dec 27 23:08:28 2022 Finished sanity check successfully!\nTue Dec 27 23:08:28 2022 Start to normalize variants...\nTue Dec 27 23:08:28 2022 -Current Dataframe shape : 1122285 x 11\nTue Dec 27 23:08:29 2022 -No available variants to normalize..\nTue Dec 27 23:08:29 2022 Finished normalizing variants successfully!\nIn\u00a0[7]: Copied!
sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\")\n#2:55513738\nsumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\") #2:55513738
Tue Dec 26 15:58:10 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:10 2023 -Genomic coordinates version: 19...\nTue Dec 26 15:58:10 2023 -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:10 2023 -Raw input contains 1128732 variants...\nTue Dec 26 15:58:10 2023 -Plot layout mode is : r\nTue Dec 26 15:58:10 2023 -Region to plot : chr2:54513738-56513738.\nTue Dec 26 15:58:10 2023 -Extract SNPs in region : chr2:54513738-56513738...\nTue Dec 26 15:58:10 2023 -Extract SNPs in specified regions: 865\nTue Dec 26 15:58:10 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:10 2023 Start conversion and sanity check:\nTue Dec 26 15:58:10 2023 -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:10 2023 -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:10 2023 -Removed 160 variants with nan in P column ...\nTue Dec 26 15:58:10 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:10 2023 -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:10 2023 -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:11 2023 -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:11 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:11 2023 Start to create manhattan plot with 705 variants:\nTue Dec 26 15:58:11 2023 -Extracting lead variant...\nTue Dec 26 15:58:11 2023 -Loading gtf files from:default\n
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
Tue Dec 26 15:58:40 2023 -plotting gene track..\nTue Dec 26 15:58:40 2023 -Finished plotting gene track..\nTue Dec 26 15:58:40 2023 -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:58:40 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:58:40 2023 -Annotating using column CHR:POS...\nTue Dec 26 15:58:40 2023 -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:58:40 2023 -Skip saving figures!\nOut[7]:
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)In\u00a0[8]: Copied!
gl.download_ref(\"1kg_eas_hg19\")\ngl.download_ref(\"1kg_eas_hg19\")
Tue Dec 27 22:44:52 2022 Start to download 1kg_eas_hg19 ...\nTue Dec 27 22:44:52 2022 -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 27 22:52:33 2022 -Updating record in config file...\nTue Dec 27 22:52:35 2022 -Updating record in config file...\nTue Dec 27 22:52:35 2022 -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi\nTue Dec 27 22:52:35 2022 Downloaded 1kg_eas_hg19 successfully!\nIn\u00a0[8]: Copied!
sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")\nsumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")
Tue Dec 26 15:58:41 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:41 2023 -Genomic coordinates version: 19...\nTue Dec 26 15:58:41 2023 -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:41 2023 -Raw input contains 1128732 variants...\nTue Dec 26 15:58:41 2023 -Plot layout mode is : r\nTue Dec 26 15:58:41 2023 -Region to plot : chr2:54531536-56731536.\nTue Dec 26 15:58:41 2023 -Checking prefix for chromosomes in vcf files...\nTue Dec 26 15:58:41 2023 -No prefix for chromosomes in the VCF files.\nTue Dec 26 15:58:41 2023 -Extract SNPs in region : chr2:54531536-56731536...\nTue Dec 26 15:58:41 2023 -Extract SNPs in specified regions: 967\nTue Dec 26 15:58:41 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:41 2023 Start conversion and sanity check:\nTue Dec 26 15:58:41 2023 -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:41 2023 -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:41 2023 -Removed 172 variants with nan in P column ...\nTue Dec 26 15:58:41 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:41 2023 -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:41 2023 -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:41 2023 -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:41 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:41 2023 Start to load reference genotype...\nTue Dec 26 15:58:41 2023 -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 26 15:58:43 2023 -Retrieving index...\nTue Dec 26 15:58:43 2023 -Ref variants in the region: 71908\nTue Dec 26 15:58:43 2023 -Matching variants using POS, NEA, EA ...\nTue Dec 26 15:58:43 2023 -Calculating Rsq...\nTue Dec 26 15:58:43 2023 Finished loading reference genotype successfully!\nTue Dec 26 15:58:43 2023 Start to create manhattan plot with 795 variants:\nTue Dec 26 15:58:43 2023 -Extracting lead variant...\nTue Dec 26 15:58:44 2023 -Loading gtf files from:default\n
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
Tue Dec 26 15:59:12 2023 -plotting gene track..\nTue Dec 26 15:59:12 2023 -Finished plotting gene track..\nTue Dec 26 15:59:13 2023 -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:13 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:13 2023 -Annotating using column CHR:POS...\nTue Dec 26 15:59:13 2023 -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:13 2023 -Skip saving figures!\nOut[8]:
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)In\u00a0[\u00a0]: Copied!
\n"},{"location":"Visualization/#visualization-by-gwaslab","title":"Visualization by gwaslab\u00b6","text":""},{"location":"Visualization/#import-gwaslab-package","title":"Import gwaslab package\u00b6","text":""},{"location":"Visualization/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"Visualization/#check-the-lead-variants-in-significant-loci","title":"Check the lead variants in significant loci\u00b6","text":""},{"location":"Visualization/#create-mahattan-plot","title":"Create mahattan plot\u00b6","text":""},{"location":"Visualization/#qc-check","title":"QC check\u00b6","text":""},{"location":"Visualization/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"Visualization/#create-regional-plot-with-ld-information","title":"Create regional plot with LD information\u00b6","text":""},{"location":"finemapping_susie/","title":"Finemapping using susieR","text":"In\u00a0[1]: Copied!
import gwaslab as gl\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nimport gwaslab as gl import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt In\u00a0[2]: Copied!
sumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")\nsumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")
2024/04/18 10:40:48 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/\n2024/04/18 10:40:48 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\n2024/04/18 10:40:48 Start to load format from formatbook....\n2024/04/18 10:40:48 -plink2 format meta info:\n2024/04/18 10:40:48 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear\n2024/04/18 10:40:48 - format_source : https://www.cog-genomics.org/plink/2.0/formats\n2024/04/18 10:40:48 - format_version : Alpha 3.3 final (3 Jun)\n2024/04/18 10:40:48 - last_check_date : 20220806\n2024/04/18 10:40:48 -plink2 to gwaslab format dictionary:\n2024/04/18 10:40:48 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\n2024/04/18 10:40:48 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\n2024/04/18 10:40:48 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz\n2024/04/18 10:40:49 -Reading columns : Z_STAT,A1_FREQ,POS,ALT,REF,P,A1,OR,OBS_CT,#CHROM,LOG(OR)_SE,ID\n2024/04/18 10:40:49 -Renaming columns to : Z,EAF,POS,ALT,REF,P,EA,OR,N,CHR,SE,SNPID\n2024/04/18 10:40:49 -Current Dataframe shape : 1128732 x 12\n2024/04/18 10:40:49 -Initiating a status column: STATUS ...\n2024/04/18 10:40:49 #WARNING! Version of genomic coordinates is unknown...\n2024/04/18 10:40:49 NEA not available: assigning REF to NEA...\n2024/04/18 10:40:49 -EA,REF and ALT columns are available: assigning NEA...\n2024/04/18 10:40:49 -For variants with EA == ALT : assigning REF to NEA ...\n2024/04/18 10:40:49 -For variants with EA != ALT : assigning ALT to NEA ...\n2024/04/18 10:40:49 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:49 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:49 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:49 Finished reordering the columns.\n2024/04/18 10:40:49 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \n2024/04/18 10:40:49 -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\n2024/04/18 10:40:49 -Verified: T T T T T T T T T T T T T T \n2024/04/18 10:40:50 -Current Dataframe memory usage: 106.06 MB\n2024/04/18 10:40:50 Finished loading data successfully!\nIn\u00a0[3]: Copied!
sumstats.basic_check()\nsumstats.basic_check()
2024/04/18 10:40:50 Start to check SNPID/rsID...v3.4.43\n2024/04/18 10:40:50 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:50 -Checking SNPID data type...\n2024/04/18 10:40:50 -Converting SNPID to pd.string data type...\n2024/04/18 10:40:50 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)\n2024/04/18 10:40:51 Finished checking SNPID/rsID.\n2024/04/18 10:40:51 Start to fix chromosome notation (CHR)...v3.4.43\n2024/04/18 10:40:51 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:51 -Checking CHR data type...\n2024/04/18 10:40:51 -Variants with standardized chromosome notation: 1128732\n2024/04/18 10:40:51 -All CHR are already fixed...\n2024/04/18 10:40:52 Finished fixing chromosome notation (CHR).\n2024/04/18 10:40:52 Start to fix basepair positions (POS)...v3.4.43\n2024/04/18 10:40:52 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 107.13 MB\n2024/04/18 10:40:52 -Converting to Int64 data type ...\n2024/04/18 10:40:53 -Position bound:(0 , 250,000,000)\n2024/04/18 10:40:53 -Removed outliers: 0\n2024/04/18 10:40:53 Finished fixing basepair positions (POS).\n2024/04/18 10:40:53 Start to fix alleles (EA and NEA)...v3.4.43\n2024/04/18 10:40:53 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:53 -Converted all bases to string datatype and UPPERCASE.\n2024/04/18 10:40:53 -Variants with bad EA : 0\n2024/04/18 10:40:54 -Variants with bad NEA : 0\n2024/04/18 10:40:54 -Variants with NA for EA or NEA: 0\n2024/04/18 10:40:54 -Variants with same EA and NEA: 0\n2024/04/18 10:40:54 -Detected 0 variants with alleles that contain bases other than A/C/T/G .\n2024/04/18 10:40:55 Finished fixing alleles (EA and NEA).\n2024/04/18 10:40:55 Start to perform sanity check for statistics...v3.4.43\n2024/04/18 10:40:55 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:55 -Comparison tolerance for floats: 1e-07\n2024/04/18 10:40:55 -Checking if 0 <= N <= 2147483647 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na N.\n2024/04/18 10:40:55 -Checking if -1e-07 < EAF < 1.0000001 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na EAF.\n2024/04/18 10:40:55 -Checking if -9999.0000001 < Z < 9999.0000001 ...\n2024/04/18 10:40:55 -Examples of invalid variants(SNPID): 1:15774:G:A,1:15777:A:G,1:57292:C:T,1:87360:C:T,1:625392:T:C ...\n2024/04/18 10:40:55 -Examples of invalid values (Z): NA,NA,NA,NA,NA ...\n2024/04/18 10:40:55 -Removed 220793 variants with bad/na Z.\n2024/04/18 10:40:55 -Checking if -1e-07 < P < 1.0000001 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na P.\n2024/04/18 10:40:55 -Checking if -1e-07 < SE < inf ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na SE.\n2024/04/18 10:40:55 -Checking if -100.0000001 < OR < 100.0000001 ...\n2024/04/18 10:40:55 -Removed 0 variants with bad/na OR.\n2024/04/18 10:40:55 -Checking STATUS and converting STATUS to categories....\n2024/04/18 10:40:56 -Removed 220793 variants with bad statistics in total.\n2024/04/18 10:40:56 -Data types for each column:\n2024/04/18 10:40:56 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \n2024/04/18 10:40:56 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:40:56 -Verified: T T T T T T T T T T T T T T \n2024/04/18 10:40:56 Finished sanity check for statistics.\n2024/04/18 10:40:56 Start to check data consistency across columns...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 -Tolerance: 0.001 (Relative) and 0.001 (Absolute)\n2024/04/18 10:40:56 -No availalbe columns for data consistency checking...Skipping...\n2024/04/18 10:40:56 Finished checking data consistency across columns.\n2024/04/18 10:40:56 Start to normalize indels...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 -No available variants to normalize..\n2024/04/18 10:40:56 Finished normalizing variants successfully!\n2024/04/18 10:40:56 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 Finished sorting coordinates.\n2024/04/18 10:40:56 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:56 Finished reordering the columns.\n
Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.
In\u00a0[4]: Copied!sumstats.get_lead()\nsumstats.get_lead()
2024/04/18 10:40:56 Start to extract lead variants...v3.4.43\n2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56 -Processing 907939 variants...\n2024/04/18 10:40:56 -Significance threshold : 5e-08\n2024/04/18 10:40:56 -Sliding window size: 500 kb\n2024/04/18 10:40:56 -Using P for extracting lead variants...\n2024/04/18 10:40:56 -Found 43 significant variants in total...\n2024/04/18 10:40:56 -Identified 4 lead variants!\n2024/04/18 10:40:56 Finished extracting lead variants.\nOut[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 44298 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9960099 G A 91266 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9960099 C T 442239 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9960099 T G 875859 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9960099 T C In\u00a0[5]: Copied!
sumstats.plot_mqq()\nsumstats.plot_mqq()
2024/04/18 10:40:57 Start to create MQQ plot...v3.4.43:\n2024/04/18 10:40:57 -Genomic coordinates version: 99...\n2024/04/18 10:40:57 #WARNING! Genomic coordinates version is unknown.\n2024/04/18 10:40:57 -Genome-wide significance level to plot is set to 5e-08 ...\n2024/04/18 10:40:57 -Raw input contains 907939 variants...\n2024/04/18 10:40:57 -MQQ plot layout mode is : mqq\n2024/04/18 10:40:57 Finished loading specified columns from the sumstats.\n2024/04/18 10:40:57 Start data conversion and sanity check:\n2024/04/18 10:40:57 -Removed 0 variants with nan in CHR or POS column ...\n2024/04/18 10:40:57 -Removed 0 variants with CHR <=0...\n2024/04/18 10:40:57 -Removed 0 variants with nan in P column ...\n2024/04/18 10:40:57 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\n2024/04/18 10:40:57 -Sumstats P values are being converted to -log10(P)...\n2024/04/18 10:40:57 -Sanity check: 0 na/inf/-inf variants will be removed...\n2024/04/18 10:40:57 -Converting data above cut line...\n2024/04/18 10:40:57 -Maximum -log10(P) value is 14.772946706439042 .\n2024/04/18 10:40:57 Finished data conversion and sanity check.\n2024/04/18 10:40:57 Start to create MQQ plot with 907939 variants...\n2024/04/18 10:40:58 -Creating background plot...\n2024/04/18 10:40:59 Finished creating MQQ plot successfully!\n2024/04/18 10:40:59 Start to extract variants for annotation...\n2024/04/18 10:40:59 -Found 4 significant variants with a sliding window size of 500 kb...\n2024/04/18 10:40:59 Finished extracting variants for annotation...\n2024/04/18 10:40:59 Start to process figure arts.\n2024/04/18 10:40:59 -Processing X ticks...\n2024/04/18 10:40:59 -Processing X labels...\n2024/04/18 10:40:59 -Processing Y labels...\n2024/04/18 10:40:59 -Processing Y tick lables...\n2024/04/18 10:40:59 -Processing Y labels...\n2024/04/18 10:40:59 -Processing lines...\n2024/04/18 10:40:59 Finished processing figure arts.\n2024/04/18 10:40:59 Start to annotate variants...\n2024/04/18 10:40:59 -Skip annotating\n2024/04/18 10:40:59 Finished annotating variants.\n2024/04/18 10:40:59 Start to create QQ plot with 907939 variants:\n2024/04/18 10:40:59 -Plotting all variants...\n2024/04/18 10:40:59 -Expected range of P: (0,1.0)\n2024/04/18 10:40:59 -Lambda GC (MLOG10P mode) at 0.5 is 0.98908\n2024/04/18 10:40:59 -Processing Y tick lables...\n2024/04/18 10:40:59 Finished creating QQ plot successfully!\n2024/04/18 10:40:59 Start to save figure...\n2024/04/18 10:40:59 -Skip saving figure!\n2024/04/18 10:40:59 Finished saving figure...\n2024/04/18 10:40:59 Finished creating plot successfully!\nOut[5]:
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fa6ad1132b0>)In\u00a0[6]: Copied!
locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')\nlocus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
2024/04/18 10:41:06 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06 -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06 Finished filtering values.\nIn\u00a0[7]: Copied!
locus.fill_data(to_fill=[\"BETA\"])\nlocus.fill_data(to_fill=[\"BETA\"])
2024/04/18 10:41:06 Start filling data using existing columns...v3.4.43\n2024/04/18 10:41:06 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT \n2024/04/18 10:41:06 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:41:06 -Verified: T T T T T T T T T T T T T T \n2024/04/18 10:41:06 -Overwrite mode: False\n2024/04/18 10:41:06 -Skipping columns: []\n2024/04/18 10:41:06 -Filling columns: ['BETA']\n2024/04/18 10:41:06 - Filling Columns iteratively...\n2024/04/18 10:41:06 - Filling BETA value using OR column...\n2024/04/18 10:41:06 Finished filling data using existing columns.\n2024/04/18 10:41:06 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:06 -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:06 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:06 Finished reordering the columns.\nIn\u00a0[8]: Copied!
locus.data\nlocus.data Out[8]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 91067 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960099 A T 91068 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960099 G A 91069 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960099 G A 91070 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960099 A C 91071 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960099 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 91441 2:56004219:G:T 2 56004219 G T 0.171717 0.148489 0.169557 0.875763 0.381159 1.160080 495 9960099 G T 91442 2:56007034:T:C 2 56007034 T C 0.260121 0.073325 0.145565 0.503737 0.614446 1.076080 494 9960099 T C 91443 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960099 C G 91444 2:56009480:A:T 2 56009480 A T 0.157258 0.135667 0.177621 0.763784 0.444996 1.145300 496 9960099 A T 91445 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960099 C T
379 rows \u00d7 15 columns
In\u00a0[9]: Copied!locus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")\nlocus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")
2024/04/18 10:41:07 Start to check if NEA is aligned with reference sequence...v3.4.43\n2024/04/18 10:41:07 -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:07 -Reference genome FASTA file: /home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\n2024/04/18 10:41:07 -Loading fasta records:2 \n2024/04/18 10:41:19 -Checking records\n2024/04/18 10:41:19 -Building numpy fasta records from dict\n2024/04/18 10:41:20 -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )\n2024/04/18 10:41:20 -Checking records for ( len(NEA) > 4 or len(EA) > 4 )\n2024/04/18 10:41:20 -Finished checking records\n2024/04/18 10:41:20 -Variants allele on given reference sequence : 264\n2024/04/18 10:41:20 -Variants flipped : 115\n2024/04/18 10:41:20 -Raw Matching rate : 100.00%\n2024/04/18 10:41:20 -Variants inferred reverse_complement : 0\n2024/04/18 10:41:20 -Variants inferred reverse_complement_flipped : 0\n2024/04/18 10:41:20 -Both allele on genome + unable to distinguish : 0\n2024/04/18 10:41:20 -Variants not on given reference sequence : 0\n2024/04/18 10:41:20 Finished checking if NEA is aligned with reference sequence.\n2024/04/18 10:41:20 Start to adjust statistics based on STATUS code...v3.4.43\n2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.43\n2024/04/18 10:41:20 -Flipping 115 variants...\n2024/04/18 10:41:20 -Swapping column: NEA <=> EA...\n2024/04/18 10:41:20 -Flipping column: BETA = - BETA...\n2024/04/18 10:41:20 -Flipping column: Z = - Z...\n2024/04/18 10:41:20 -Flipping column: EAF = 1 - EAF...\n2024/04/18 10:41:20 -Flipping column: OR = 1 / OR...\n2024/04/18 10:41:20 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x\n2024/04/18 10:41:20 Finished adjusting statistics based on STATUS code.\n2024/04/18 10:41:20 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Finished sorting coordinates.\n2024/04/18 10:41:20 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.03 MB\n2024/04/18 10:41:20 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:20 Finished reordering the columns.\nOut[9]:
<gwaslab.g_Sumstats.Sumstats at 0x7fa6a33a8130>In\u00a0[10]: Copied!
locus.data\nlocus.data Out[10]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T
379 rows \u00d7 15 columns
In\u00a0[11]: Copied!locus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\nlocus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None) locus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None) In\u00a0[12]: Copied!
!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt\n\n!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r2 square \\\n --extract sig_locus.snplist \\\n --out sig_locus_mt_r2\n!plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r square \\ --extract sig_locus.snplist \\ --out sig_locus_mt !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract sig_locus.snplist \\ --out sig_locus_mt_r2
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to sig_locus_mt.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract sig_locus.snplist\n --keep-allele-order\n --out sig_locus_mt\n --r square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r square to sig_locus_mt.ld ... 0% [processingwriting] done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to sig_locus_mt_r2.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract sig_locus.snplist\n --keep-allele-order\n --out sig_locus_mt_r2\n --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt_r2.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting] done.\nIn\u00a0[13]: Copied!
import rpy2\nimport rpy2.robjects as ro\nfrom rpy2.robjects.packages import importr\nimport rpy2.robjects.numpy2ri as numpy2ri\nnumpy2ri.activate()\nimport rpy2 import rpy2.robjects as ro from rpy2.robjects.packages import importr import rpy2.robjects.numpy2ri as numpy2ri numpy2ri.activate()
INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY\nINFO:rpy2.situation:R home found: /home/yunye/anaconda3/envs/gwaslab_py39/lib/R\nINFO:rpy2.situation:R library path: \nINFO:rpy2.situation:LD_LIBRARY_PATH: \nINFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save\nINFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.\nIn\u00a0[14]: Copied!
df = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\")\ndf\ndf = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\") df Out[14]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T
379 rows \u00d7 15 columns
In\u00a0[15]: Copied!# import susieR as object\nsusieR = importr('susieR')\n# import susieR as object susieR = importr('susieR') In\u00a0[16]: Copied!
# convert pd.DataFrame to numpy\nld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None)\nR_df = ld.values\nld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None)\nR_df2 = ld2.values\n# convert pd.DataFrame to numpy ld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None) R_df = ld.values ld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None) R_df2 = ld2.values In\u00a0[17]: Copied!
R_df\nR_df Out[17]:
array([[ 1.00000e+00, 9.58562e-01, -3.08678e-01, ..., 1.96204e-02,\n -3.54602e-04, -7.14868e-03],\n [ 9.58562e-01, 1.00000e+00, -2.97617e-01, ..., 2.47755e-02,\n -1.49234e-02, -7.00509e-03],\n [-3.08678e-01, -2.97617e-01, 1.00000e+00, ..., -3.49335e-02,\n -1.37163e-02, -2.12828e-02],\n ...,\n [ 1.96204e-02, 2.47755e-02, -3.49335e-02, ..., 1.00000e+00,\n 5.26193e-02, -3.09069e-02],\n [-3.54602e-04, -1.49234e-02, -1.37163e-02, ..., 5.26193e-02,\n 1.00000e+00, -3.01142e-01],\n [-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02,\n -3.01142e-01, 1.00000e+00]])In\u00a0[18]: Copied!
plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0])\nsns.heatmap(data=R_df2,ax=ax[1])\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\nplt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0]) sns.heatmap(data=R_df2,ax=ax[1]) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[18]:
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>
https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#fine-mapping-with-susier-using-summary-statistics
In\u00a0[19]: Copied!ro.r('set.seed(123)')\nfit = susieR.susie_rss(\n bhat = df[\"BETA\"].values.reshape((len(R_df),1)),\n shat = df[\"SE\"].values.reshape((len(R_df),1)),\n R = R_df,\n L = 10,\n n = 503\n)\nro.r('set.seed(123)') fit = susieR.susie_rss( bhat = df[\"BETA\"].values.reshape((len(R_df),1)), shat = df[\"SE\"].values.reshape((len(R_df),1)), R = R_df, L = 10, n = 503 ) In\u00a0[20]: Copied!
# show the results of susie_get_cs\nprint(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\n# show the results of susie_get_cs print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
$L1\n[1] 200 218 221 224\n\n\n
We found 1 credible set here
In\u00a0[21]: Copied!# add the information to dataframe for plotting\ndf[\"cs\"] = 0\nn_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\nfor i in range(n_cs):\n cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]\n df.loc[np.array(cs_index)-1,\"cs\"] = i + 1\ndf[\"pip\"] = np.array(susieR.susie_get_pip(fit))\n# add the information to dataframe for plotting df[\"cs\"] = 0 n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0]) for i in range(n_cs): cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i] df.loc[np.array(cs_index)-1,\"cs\"] = i + 1 df[\"pip\"] = np.array(susieR.susie_get_pip(fit)) In\u00a0[22]: Copied!
fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))\ndf[\"MLOG10P\"] = -np.log10(df[\"P\"])\ncol_to_plot = \"MLOG10P\"\np=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot],\n marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\naxes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\n marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[0].set_xlabel(\"position\")\naxes[0].set_xlim((55400000, 55800000))\naxes[0].set_ylabel(col_to_plot)\naxes[0].legend()\n\np=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"],\n marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[1].set_xlabel(\"position\")\naxes[1].set_xlim((55400000, 55800000))\naxes[1].set_ylabel(\"PIP\")\naxes[1].legend()\nfig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1)) df[\"MLOG10P\"] = -np.log10(df[\"P\"]) col_to_plot = \"MLOG10P\" p=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2) axes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot], marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[0].set_xlabel(\"position\") axes[0].set_xlim((55400000, 55800000)) axes[0].set_ylabel(col_to_plot) axes[0].legend() p=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2) axes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[1].set_xlabel(\"position\") axes[1].set_xlim((55400000, 55800000)) axes[1].set_ylabel(\"PIP\") axes[1].legend()
/tmp/ipykernel_420/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.\n axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\nOut[22]:
<matplotlib.legend.Legend at 0x7fa6a330d5e0>
The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.
Lets then check the variants in credible set
In\u00a0[23]: Copied!df.loc[np.array(cs_index)-1,:]\ndf.loc[np.array(cs_index)-1,:] Out[23]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT cs pip MLOG10P 199 2:55513738:C:T 2 55513738 T C 0.623992 1.219516 0.153159 7.96244 1.686760e-15 3.385550 496 9960019 C T 1 0.325435 14.772947 217 2:55605943:A:G 2 55605943 G A 0.685484 1.321987 0.166688 7.93089 2.175840e-15 3.750867 496 9960019 A G 1 0.267953 14.662373 220 2:55612986:G:C 2 55612986 C G 0.685223 1.302133 0.166154 7.83691 4.617840e-15 3.677133 494 9960019 G C 1 0.150449 14.335561 223 2:55622624:G:A 2 55622624 A G 0.688508 1.324109 0.167119 7.92315 2.315640e-15 3.758833 496 9960019 G A 1 0.255449 14.635329 In\u00a0[24]: Copied!
!echo \"2:55513738:C:T\" > credible.snplist\n!echo \"2:55605943:A:G\" >> credible.snplist\n!echo \"2:55612986:G:C\" >> credible.snplist\n!echo \"2:55620927:G:A\" >> credible.snplist\n!echo \"2:55622624:G:A\" >> credible.snplist\n\n!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r2 square \\\n --extract credible.snplist \\\n --out credible_r\n\n!plink \\\n --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n --keep-allele-order \\\n --r2 square \\\n --extract credible.snplist \\\n --out credible_r2\n!echo \"2:55513738:C:T\" > credible.snplist !echo \"2:55605943:A:G\" >> credible.snplist !echo \"2:55612986:G:C\" >> credible.snplist !echo \"2:55620927:G:A\" >> credible.snplist !echo \"2:55622624:G:A\" >> credible.snplist !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r2
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to credible_r.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract credible.snplist\n --keep-allele-order\n --out credible_r\n --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r.ld ... 0% [processingwriting] done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3\nLogging to credible_r2.log.\nOptions in effect:\n --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n --extract credible.snplist\n --keep-allele-order\n --out credible_r2\n --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r2.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r2.ld ... 0% [processingwriting] done.\nIn\u00a0[25]: Copied!
credible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"]\nld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None)\nld.columns=credible_snplist\nld.index=credible_snplist\nld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None)\nld2.columns=credible_snplist\nld2.index=credible_snplist\ncredible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"] ld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None) ld.columns=credible_snplist ld.index=credible_snplist ld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None) ld2.columns=credible_snplist ld2.index=credible_snplist In\u00a0[26]: Copied!
plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0)\nsns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1)\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\nplt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0) sns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[26]:
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>
Variants in the credible set are in strong LD with the bona fide causal variant.
This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.
"},{"location":"finemapping_susie/#finemapping-using-susier","title":"Finemapping using susieR\u00b6","text":""},{"location":"finemapping_susie/#data-preparation","title":"Data preparation\u00b6","text":""},{"location":"finemapping_susie/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"finemapping_susie/#data-standardization-and-sanity-check","title":"Data standardization and sanity check\u00b6","text":""},{"location":"finemapping_susie/#extract-lead-variants","title":"Extract lead variants\u00b6","text":""},{"location":"finemapping_susie/#create-manhattan-plot-for-checking","title":"Create manhattan plot for checking\u00b6","text":""},{"location":"finemapping_susie/#extract-the-variants-around-255513738ct-for-finemapping","title":"Extract the variants around 2:55513738:C:T for finemapping\u00b6","text":""},{"location":"finemapping_susie/#convert-or-to-beta","title":"Convert OR to BETA\u00b6","text":""},{"location":"finemapping_susie/#align-nea-with-reference-sequence","title":"Align NEA with reference sequence\u00b6","text":""},{"location":"finemapping_susie/#output-the-sumstats-of-this-locus","title":"Output the sumstats of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-plink-to-get-ld-matrix-for-this-locus","title":"Run PLINK to get LD matrix for this locus\u00b6","text":""},{"location":"finemapping_susie/#finemapping","title":"Finemapping\u00b6","text":""},{"location":"finemapping_susie/#load-locus-sumstats","title":"Load locus sumstats\u00b6","text":""},{"location":"finemapping_susie/#import-sumsier","title":"Import sumsieR\u00b6","text":""},{"location":"finemapping_susie/#load-ld-matrix","title":"Load LD matrix\u00b6","text":""},{"location":"finemapping_susie/#visualize-the-ld-structure-of-this-locus","title":"Visualize the LD structure of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-finemapping-use-susier","title":"Run finemapping use susieR\u00b6","text":""},{"location":"finemapping_susie/#extract-credible-sets-and-pip","title":"Extract credible sets and PIP\u00b6","text":""},{"location":"finemapping_susie/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"finemapping_susie/#pitfalls","title":"Pitfalls\u00b6","text":""},{"location":"finemapping_susie/#check-ld-of-the-causal-variant-and-variants-in-the-credible-set","title":"Check LD of the causal variant and variants in the credible set\u00b6","text":""},{"location":"finemapping_susie/#load-ld-and-plot","title":"Load LD and plot\u00b6","text":""},{"location":"plot_PCA/","title":"Plotting PCA","text":"In\u00a0[1]: Copied!import pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd import matplotlib.pyplot as plt import seaborn as sns In\u00a0[2]: Copied!
pca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\")\npca\npca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\") pca Out[2]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752
500 rows \u00d7 14 columns
In\u00a0[6]: Copied!ped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\")\nped\nped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\") ped Out[6]: sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00096 GBR EUR male NaN NaN 1 HG00097 GBR EUR female NaN NaN 2 HG00099 GBR EUR female NaN NaN 3 HG00100 GBR EUR female NaN NaN 4 HG00101 GBR EUR male NaN NaN ... ... ... ... ... ... ... 2499 NA21137 GIH SAS female NaN NaN 2500 NA21141 GIH SAS female NaN NaN 2501 NA21142 GIH SAS female NaN NaN 2502 NA21143 GIH SAS female NaN NaN 2503 NA21144 GIH SAS female NaN NaN
2504 rows \u00d7 6 columns
In\u00a0[7]: Copied!pcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\")\npcaped\npcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\") pcaped Out[7]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 HG00403 CHS EAS male NaN NaN 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 HG00404 CHS EAS female NaN NaN 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 HG00406 CHS EAS male NaN NaN 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 HG00407 CHS EAS female NaN NaN 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 HG00409 CHS EAS male NaN NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 NA19087 JPT EAS female NaN NaN 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 NA19088 JPT EAS male NaN NaN 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 NA19089 JPT EAS male NaN NaN 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 NA19090 JPT EAS female NaN NaN 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752 NA19091 JPT EAS male NaN NaN
500 rows \u00d7 20 columns
In\u00a0[8]: Copied!plt.figure(figsize=(10,10))\nsns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50)\nplt.figure(figsize=(10,10)) sns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50) Out[8]:
<Axes: xlabel='PC1_AVG', ylabel='PC2_AVG'>"},{"location":"plot_PCA/#plotting-pca","title":"Plotting PCA\u00b6","text":""},{"location":"plot_PCA/#loading-files","title":"loading files\u00b6","text":""},{"location":"plot_PCA/#merge-pca-and-population-information","title":"Merge PCA and population information\u00b6","text":""},{"location":"plot_PCA/#plotting","title":"Plotting\u00b6","text":""},{"location":"prs_tutorial/","title":"PRS Tutorial","text":"In\u00a0[1]: Copied!
import sys\nsys.path.insert(0,\"/Users/he/work/PRSlink/src\")\nimport prslink as pl\nimport sys sys.path.insert(0,\"/Users/he/work/PRSlink/src\") import prslink as pl In\u00a0[2]: Copied!
a= pl.PRS()\na= pl.PRS() In\u00a0[3]: Copied!
a.add_score(\"./1kgeas.0.1.profile\", \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.2.profile\", \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.3.profile\", \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.4.profile\", \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.5.profile\", \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.1.profile\", \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.2.profile\", \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.3.profile\", \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.4.profile\", \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.5.profile\", \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")
- Dataset shape before loading : (0, 1)\n- Loading score data from file: ./1kgeas.0.1.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.1\n - Overlapping IDs:0\n- Loading finished successfully!\n- Dataset shape after loading : (504, 2)\n- Dataset shape before loading : (504, 2)\n- Loading score data from file: ./1kgeas.0.05.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.05\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 3)\n- Dataset shape before loading : (504, 3)\n- Loading score data from file: ./1kgeas.0.2.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.2\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 4)\n- Dataset shape before loading : (504, 4)\n- Loading score data from file: ./1kgeas.0.3.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.3\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 5)\n- Dataset shape before loading : (504, 5)\n- Loading score data from file: ./1kgeas.0.4.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.4\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 6)\n- Dataset shape before loading : (504, 6)\n- Loading score data from file: ./1kgeas.0.5.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.5\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 7)\n- Dataset shape before loading : (504, 7)\n- Loading score data from file: ./1kgeas.0.001.profile\n - Setting ID:IID\n - Loading score:SCORE\n - Loaded columns: 0.01\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 8)\nIn\u00a0[4]: Copied!
a.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")\na.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")
- Dataset shape before loading : (504, 8)\n- Loading pheno data from file: ../01_Dataset/t2d/1kgeas_t2d.txt\n - Setting ID:IID\n - Loading pheno:T2D\n - Loaded columns: T2D\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 9)\nIn\u00a0[5]: Copied!
a.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")\na.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")
- Dataset shape before loading : (504, 9)\n- Loading covar data from file: ./1kgeas.eigenvec\n - Setting ID:IID\n - Loading covar:PC1 PC2 PC3 PC4 PC5\n - Loaded columns: PC1 PC2 PC3 PC4 PC5\n - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 14)\nIn\u00a0[6]: Copied!
a.data[\"T2D\"] = a.data[\"T2D\"]-1\na.data[\"T2D\"] = a.data[\"T2D\"]-1 In\u00a0[7]: Copied!
a.data\na.data Out[7]: IID 0.1 0.05 0.2 0.3 0.4 0.5 0.01 T2D PC1 PC2 PC3 PC4 PC5 0 HG00403 -0.000061 -2.812450e-05 -0.000019 -2.131690e-05 -0.000024 -0.000022 0.000073 0 0.000107 0.039080 0.021048 0.016633 0.063373 1 HG00404 0.000025 4.460810e-07 0.000041 4.370760e-05 0.000024 0.000018 0.000156 1 -0.001216 0.045148 0.009013 0.028122 0.041474 2 HG00406 0.000011 2.369040e-05 -0.000009 2.928090e-07 -0.000010 -0.000008 -0.000188 0 0.005020 0.044668 0.016583 0.020077 -0.031782 3 HG00407 -0.000133 -1.326670e-04 -0.000069 -5.677710e-05 -0.000062 -0.000057 -0.000744 1 0.005408 0.034132 0.014955 0.003872 0.009794 4 HG00409 0.000010 -3.120730e-07 -0.000012 -1.873660e-05 -0.000025 -0.000023 -0.000367 1 -0.002121 0.031752 -0.048352 -0.043185 0.064674 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 499 NA19087 -0.000042 -6.215880e-05 -0.000038 -1.116230e-05 -0.000019 -0.000018 -0.000397 0 -0.067583 -0.040340 0.015038 0.039039 -0.010774 500 NA19088 0.000085 9.058670e-05 0.000047 2.666260e-05 0.000016 0.000014 0.000723 0 -0.069752 -0.047710 0.028578 0.036714 -0.000906 501 NA19089 -0.000067 -4.767610e-05 -0.000011 -1.393760e-05 -0.000019 -0.000016 -0.000126 0 -0.073989 -0.046706 0.040089 -0.034719 -0.062692 502 NA19090 0.000064 3.989030e-05 0.000022 7.445850e-06 0.000010 0.000003 -0.000149 0 -0.061156 -0.034606 0.032674 -0.016363 -0.065390 503 NA19091 0.000051 4.469220e-05 0.000043 3.089720e-05 0.000019 0.000016 0.000028 0 -0.067749 -0.052950 0.036908 -0.023856 -0.058515
504 rows \u00d7 14 columns
In\u00a0[13]: Copied!a.set_k({\"T2D\":0.2})\na.set_k({\"T2D\":0.2}) In\u00a0[14]: Copied!
a.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)\na.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)
- Binary trait: fitting logistic regression...\n - Binary trait: using records with phenotype being 0 or 1...\nOptimization terminated successfully.\n Current function value: 0.668348\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.653338\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.657903\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.654492\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.654413\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.653085\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.654681\n Iterations 5\nOptimization terminated successfully.\n Current function value: 0.661290\n Iterations 5\nOut[14]: PHENO TYPE PRS N_CASE N BETA CI_L CI_U P R2_null R2_full Delta_R2 AUC_null AUC_full Delta_AUC R2_lia_null R2_lia_full Delta_R2_lia SE 0 T2D B 0.01 200 502 0.250643 0.064512 0.436773 0.008308 0.010809 0.029616 0.018808 0.536921 0.586821 0.049901 0.010729 0.029826 0.019096 NaN 1 T2D B 0.05 200 502 0.310895 0.119814 0.501976 0.001428 0.010809 0.038545 0.027736 0.536921 0.601987 0.065066 0.010729 0.038925 0.028196 NaN 2 T2D B 0.5 200 502 0.367803 0.169184 0.566421 0.000284 0.010809 0.046985 0.036176 0.536921 0.605397 0.068477 0.010729 0.047553 0.036824 NaN 3 T2D B 0.2 200 502 0.365641 0.169678 0.561604 0.000255 0.010809 0.047479 0.036670 0.536921 0.607318 0.070397 0.010729 0.048079 0.037349 NaN 4 T2D B 0.3 200 502 0.367788 0.171062 0.564515 0.000248 0.010809 0.047686 0.036877 0.536921 0.608493 0.071573 0.010729 0.048315 0.037585 NaN 5 T2D B 0.1 200 502 0.374750 0.181520 0.567979 0.000144 0.010809 0.050488 0.039679 0.536921 0.613957 0.077036 0.010729 0.051270 0.040540 NaN 6 T2D B 0.4 200 502 0.389232 0.189866 0.588597 0.000130 0.010809 0.051145 0.040336 0.536921 0.609238 0.072318 0.010729 0.051845 0.041116 NaN In\u00a0[15]: Copied!
a.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)\na.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)
Optimization terminated successfully.\n Current function value: 0.668348\n Iterations 5\nIn\u00a0[16]: Copied!
a.plot_prs(a.score_cols)\na.plot_prs(a.score_cols) In\u00a0[\u00a0]: Copied!
\n"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 849c93ba..e31759a0 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ