diff --git a/05_PCA/index.html b/05_PCA/index.html index ecad9a63..9a8bb819 100644 --- a/05_PCA/index.html +++ b/05_PCA/index.html @@ -2331,6 +2331,10 @@

Principle component analysis (PCA)

PCA-UMAP
  • References
  • +
    +

    PCA workflow

    +

    image

    +

    Preparation

    Exclude SNPs in high-LD or HLA regions

    For PCA, we first exclude SNPs in high-LD or HLA regions from the genotype data.

    @@ -2375,7 +2379,7 @@

    Download BED-like fi

    Create a list of SNPs in high-LD or HLA regions

    -

    Next, use high-ld.txt to extract all SNPs which are located in the regions described in the file using the code as follows:

    +

    Next, use high-ld.txt to extract all SNPs that are located in the regions described in the file using the code as follows:

    plink --file ${plinkFile} --make-set high-ld.txt --write-set --out hild
     
    diff --git a/search/search_index.json b/search/search_index.json index c3079149..f1851072 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"GWASTutorial","text":"

    Note: this tutorial is being updated to Version 2024

    This Github page aims to provide a hands-on tutorial on common analysis in Complex Trait Genomics. This tutorial is designed for the course Fundamental Exercise II provided by The Laboratory of Complex Trait Genomics at the University of Tokyo. For more information, please see About.

    This tutorial covers the minimum skills and knowledge required to perform a typical genome-wide association study (GWAS). The contents are categorized into the following groups. Additionally, for absolute beginners, we also prepared a section on command lines in Linux.

    If you have any questions or suggestions, please feel free to let us know in the Issue section of this repository.

    "},{"location":"#contents","title":"Contents","text":""},{"location":"#command-lines","title":"Command lines","text":""},{"location":"#pre-gwas","title":"Pre-GWAS","text":""},{"location":"#gwas","title":"GWAS","text":""},{"location":"#post-gwas","title":"Post-GWAS","text":"

    In these sections, we will briefly introduce the Post-GWAS analyses, which will dig deeper into the GWAS summary statistics. \u00a0

    "},{"location":"#topics","title":"Topics","text":"

    Introductions on GWAS-related issues

    "},{"location":"#others","title":"Others","text":""},{"location":"01_Dataset/","title":"Sample Dataset","text":"

    504 EAS individuals from 1000 Genomes Project Phase 3 version 5

    Url: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

    Genome build: human_g1k_v37.fasta (hg19)

    "},{"location":"01_Dataset/#genotype-data-processing","title":"Genotype Data Processing","text":""},{"location":"01_Dataset/#download","title":"Download","text":"

    Note

    The sample dataset 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip has been included in 01_Dataset when you clone the repository. There is no need to download it again if you clone this repository.

    You can also simply run download_sampledata.sh in 01_Dataset and the dataset will be downloaded and decompressed.

    ./download_sampledata.sh\n

    Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.

    or you can manually download it from this link.

    Unzip the dataset unzip -j 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip, and you will get the following files:

    1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n
    "},{"location":"01_Dataset/#phenotype-simulation","title":"Phenotype Simulation","text":"

    Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.

    gcta  \\\n  --bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015 \\\n  --simu-cc 250 254  \\\n  --simu-causal-loci causal.snplist  \\\n  --simu-hsq 0.8  \\\n  --simu-k 0.5  \\\n  --simu-rep 1  \\\n  --out 1kgeas_binary\n
    $ cat causal.snplist\n2:55620927:G:A 3\n8:97094292:C:T 3\n20:42758834:T:C 3\n7:134326056:G:T 3\n1:167562605:G:A 3\n

    Warning

    This simulation is just used for showing the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result itself is meaningless.

    Allele frequency and Effect size

    "},{"location":"01_Dataset/#reference","title":"Reference","text":""},{"location":"02_Linux_basics/","title":"Introduction","text":"

    This section is intended to provide a minimum introduction of the command line in Linux system for handling genomic data. (If you are alreay familiar with Linux commands, it is completely ok to skip this section.)

    If you are a beginner with no background in programming, it would be helpful if you could learn some basic commands first before any analysis. In this section, we will introduce the most basic commands which enable you to handle genomic files in the terminal using command lines in a linux system.

    For Mac users

    This tutorial will probably work with no problems. Just simply open your terminal and follow the tutorial. (Note: A few commands might be different on MacOS.)

    For Windows users

    You can simply insall WSL to get a linux environment. Please check here for how to install WSL.

    "},{"location":"02_Linux_basics/#table-of-contents","title":"Table of Contents","text":""},{"location":"02_Linux_basics/#linux-system-introduction","title":"Linux System Introduction","text":""},{"location":"02_Linux_basics/#what-is-linux","title":"What is Linux?","text":"Term Description Linux refers to a family of open-source Unix-like operating systems based on the Linux kernel. Linux kernel a free and open-source Unix-like operating system kernel, which controls the software and hardware of the computer. Linux distributions refer to\u00a0operating systems\u00a0made from a software collection that is based upon the\u00a0Linux kernel.

    Main functions of the Linux kernel

    Some of the most common linux distributions

    Linux and Linus

    Linux is named after Linus Benedict Torvalds, who is a legendary Finnish software engineer who lead the development of the Linux kernel. He also developped the amazing version control software - Git.

    Reference: https://en.wikipedia.org/wiki/Linux

    "},{"location":"02_Linux_basics/#how-do-we-interact-with-computers","title":"How do we interact with computers?","text":"

    GUI and CUI

    Shell

    "},{"location":"02_Linux_basics/#a-general-comparison-between-cui-and-gui","title":"A general comparison between CUI and GUI","text":"GUI CUI Interaction Graphics Command line Precision LOW HIGH Speed LOW HIGH Memory required HIGH LOW Ease of operation Easier DIFFICULT Flexibility MORE flexible LESS flexible

    Tip

    The reason why we want to use CUI for large-scale data analysis is that CUI is better in term of precision, memory usage and processing speed.

    "},{"location":"02_Linux_basics/#overview-of-the-basic-commands-in-linux","title":"Overview of the basic commands in Linux","text":"

    Unlike clicking and dragging files in Windows or MacOS, in Linux, we usually handle files by typing commands in the terminal.

    Here is a list of the basic commands we are going to cover in this brief tutorial:

    Basic Linux commands

    Function group Commands Description Directories pwd, ls, mkdir, rmdir Commands for checking, creating and removing directories Files touch,cp,mv,rm Commands for creating, copying, moving and removing files Checking files cat,zcat,head,tail,less,more,wc Commands for inspecting files Archiving and compression tar,gzip,gunzip,zip,unzip Commands for Archiving and Compressing files Manipulating text sort,uniq,cut,join,tr Commands for manipulating text files Modifying permission chmod,chown, chgrp Commands for changing the permissions of files and directories Links ln Commands for creating symbolic and hard links Pipe, redirect and others pipe, >,>>,*,.,.. A group of miscellaneous commands Advance text editing awk, sed Commands for more complicated text manipulation and editing"},{"location":"02_Linux_basics/#how-to-check-the-usage-of-a-command-using-man","title":"How to check the usage of a command using man:","text":"

    The first command we might want to learn is man, which shows the manual for a certain command. When you forget how to use a command, you can always use man to check.

    man : Check the manual of a command (e.g., man chmod) or --help option (e.g., chmod --help)

    For example, we want to check the usage of pwd:

    Use man to get the manual for commands

    $ man pwd\n
    Then you will see the manual of pwd in your terminal.
    PWD(1)                                              User     Commands                                              PWD(1)\n\nNAME\n       pwd - print name of current/working directory\n\nSYNOPSIS\n       pwd [OPTION]...\n\nDESCRIPTION\n       Print the full filename of the current working directory.\n....\n

    Explain shell

    Or you can use this wonderful website to get explanations for your commands.

    URL : https://explainshell.com/

    "},{"location":"02_Linux_basics/#commands","title":"Commands","text":""},{"location":"02_Linux_basics/#directories","title":"Directories","text":"

    The first set of commands are: pwd , cd , ls, mkdir and rmdir, which are related to directories (like the folders in a Windows system).

    "},{"location":"02_Linux_basics/#pwd","title":"pwd","text":"

    pwd : Print working directory, which means printing the path of the current directory (working directory)

    Use pwd to print the current directory you are in

    $ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n

    This command prints the absolute path.

    An example of Linux file system and file paths

    Type Description Example Absolute path path starting from root (the orange path) /home/User3/GWASTutorial/02_Linux_basics/README.md Relative path path starting from the current directory (the blue path) ./GWASTutorial/02_Linux_basics/README.md

    Tip: use readlink to obtain the absolute path of a file

    To get the absolute path of a file, you can use readlink -f [filename].

    $ readlink -f README.md \n/home/he/work/GWASTutorial/02_Linux_basics/README.md\n
    "},{"location":"02_Linux_basics/#cd","title":"cd","text":"

    cd: Change the current working directory.

    Use cd to change directory to 02_Linux_basics and then print the current directory

    $ cd 02_Linux_basics\n$ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n
    "},{"location":"02_Linux_basics/#ls","title":"ls","text":"

    ls : List the contents in the working directory

    Some frequently used options for ls :

    Simply list the files and directories in the current directory

    $ ls\nREADME.md  sumstats.txt\n

    List the files and directories with options -lha

    $ ls -lha\ndrwxr-xr-x   4 he  staff   128B Dec 23 14:07 .\ndrwxr-xr-x  17 he  staff   544B Dec 23 12:13 ..\n-rw-r--r--   1 he  staff     0B Oct 17 11:24 README.md\n-rw-r--r--   1 he  staff    31M Dec 23 14:07 sumstats.txt\n

    Tip: use tree to visualize the structure of a directory

    You can use tree command to visualize the structure of a directory.

    $ tree ./02_Linux_basics/\n./02_Linux_basics/\n\u251c\u2500\u2500 README.md\n\u2514\u2500\u2500 sumstats.txt\n\n0 directories, 2 files\n
    "},{"location":"02_Linux_basics/#mkdir-rmdir","title":"mkdir & rmdir","text":"

    Make a directory and delete it

    $ mkdir new_directory\n$ ls\nnew_directory  README.md  sumstats.txt\n$ rmdir new_directory/\n$ ls\nREADME.md  sumstats.txt\n
    "},{"location":"02_Linux_basics/#manipulating-files","title":"Manipulating files","text":"

    This set of commands includes: touch, mv , rm and cp

    "},{"location":"02_Linux_basics/#touch","title":"touch","text":"

    touch command is used to create a new empty file.

    Create an empty text file called newfile.txt in this directory

    $ ls -l\ntotal 64048\n-rw-r--r--  1 he  staff         0 Oct 17 11:24 README.md\n-rw-r--r--  1 he  staff  32790417 Dec 23 14:07 sumstats.txt\n\ntouch newfile.txt\n\n$ touch newfile.txt\n$ ls -l\ntotal 64048\n-rw-r--r--  1 he  staff         0 Oct 17 11:24 README.md\n-rw-r--r--  1 he  staff         0 Dec 23 14:14 newfile.txt\n-rw-r--r--  1 he  staff  32790417 Dec 23 14:07 sumstats.txt\n
    "},{"location":"02_Linux_basics/#mv","title":"mv","text":"

    mv has two functions:

    The following command will create a new directoru called new_directory, and move sumstats.txt into that directory. Just like draggig a file in to a folder in window system.

    Move a file to a different directory

    # make a new directory\n$ mkdir new_directory\n\n#move sumstats to the new directory\n$ mv sumstats.txt new_directory/\n\n# list the item in new_directory\n$ ls new_directory/\nsumstats.txt\n

    Now, let's move it back to the current directory and rename it to sumstats_new.txt.

    Rename a file using mv

    $ mv ./new_directory/sumstats.txt ./\n
    Note: ./ means the current directory You can also use mv to rename a file:
    #rename\n$mv sumstats.txt sumstats_new.txt \n

    "},{"location":"02_Linux_basics/#rm","title":"rm","text":"

    rm : Remove files or diretories

    Remove a file and a directory

    # remove a file\n$rm file\n\n#remove files in a directory (recursive mode)\n$rm -r directory/\n

    There is no trash can in Linux command-line interface

    If you delete a file with rm , it will be very difficult to restore it. Please be careful wehn using rm.

    "},{"location":"02_Linux_basics/#cp","title":"cp","text":"

    cp command is used to copy files or diretories.

    Copy a file and a directory

    #cp files\n$cp file1 file2\n\n# copy directory\n$cp -r directory1/ directory2/\n
    "},{"location":"02_Linux_basics/#links","title":"Links","text":"

    Symbolic link is like a shortcut on window system, which is a special type of file that points to another file.

    It is very useful when you want to organize your tool box or working space.

    You can use ln -s pathA pathB to create such a link.

    Create a symbolic link for plink

    Let`s create a symbolic link for plink first.

    # /home/he/tools/plink/plink is the orinial file\n# /home/he/tools/bin is the path for the symbolic link \nln -s /home/he/tools/plink/plink /home/he/tools/bin\n

    And then check the link.

    cd /home/he/tools/bin\nls -lha\nlrwxr-xr-x  1 he  staff    27B Aug 30 11:30 plink -> /home/he/tools/plink/plink\n
    "},{"location":"02_Linux_basics/#archiving-and-compression","title":"Archiving and Compression","text":"

    Results for millions of variants are usually very large, sometimes >10GB, or consists of multiple files.

    To save space and make it easier to transfer, we need to archive and compress these files.

    Archiving and Compression

    Commoly used commands for archiving and compression:

    Extensions Create Extract Functions file.gz gzip gunzip compress files.tar tar -cvf tar -xvf archive files.tar.gz or files.tgz tar -czvf tar -xvzf archive and compress file.zip zip unzip archive and compress

    Compress and decompress a file using gzip and gunzip

    $ ls -lh\n-rw-r--r--  1 he  staff    31M Dec 23 14:07 sumstats.txt\n\n$ gzip sumstats.txt\n$ ls -lh\n-rw-r--r--  1 he  staff   9.9M Dec 23 14:07 sumstats.txt.gz\n\n$ gunzip sumstats.txt.gz\n$ ls -lh\n-rw-r--r--   1 he  staff    31M Dec 23 14:07 sumstats.txt\n
    "},{"location":"02_Linux_basics/#read-and-check-files","title":"Read and check files","text":"

    We have a group of handy commands to check part of or the entire file, including cat, zcat, less, head, tail, wc

    "},{"location":"02_Linux_basics/#cat","title":"cat","text":"

    cat command can print the contents of files or concatenate the files.

    Create and then cat the file a_text_file.txt

    $ ls -lha > a_text_file.txt\n$ cat a_text_file.txt \ntotal 32M\ndrwxr-x---  2 he staff 4.0K Apr  2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr  1 22:20 ..\n-rw-r-----  1 he staff    0 Apr  2 00:37 a_text_file.txt\n-rw-r-----  1 he staff 5.0K Apr  1 22:20 README.md\n-rw-r-----  1 he staff  32M Mar 30 18:17 sumstats.txt\n

    Warning

    Be careful not to cat a text file with a huge number of lines. You can try to cat sumstats.txt and see what happends.

    By the way, > a_text_file.txt here means redirect the output to file a_text_file.txt.

    "},{"location":"02_Linux_basics/#zcat","title":"zcat","text":"

    zcat is similar to cat, but can only applied to compressed files.

    cat and zcat a gzipped text file

    $ gzip a_text_file.txt \n$ cat a_text_file.txt.gz                                                         TGba_text_file.    txt\u044f\n@\u0231\u00bbO\ud8ac\udc19v\u0602\ud85e\udca9\u00bc\ud9c3\udce0bq}\udb06\udca4\\\ueee0\u00a4n\u0662\u00aa\uda40\udc2cn\u00bb\u06a1\u01ed\n                          w5J_\u00bd\ud88d\ude27P\u07c9=\u00ffK\n(\u05a3\u0530\u00a7\u04a4\u0176a\u0786                              \u00acM\u00adR\udbb5\udc8am\u00b3\u00fee\u00b8\u00a4\u00bc\u05cdSd\ufff1\u07f2\ub4e4\u00aa\u00adv\n       \u5a41                                                                                                               resize: unknown character, exiting.\n\n$ zcat a_text_file.txt.gz \ntotal 32M\ndrwxr-x---  2 he staff 4.0K Apr  2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr  1 22:20 ..\n-rw-r-----  1 he staff    0 Apr  2 00:37 a_text_file.txt\n-rw-r-----  1 he staff 5.0K Apr  1 22:20 README.md\n-rw-r-----  1 he staff  32M Mar 30 18:17 sumstats.txt\n

    gzcat

    Use gzcat instead of zcat if your device is running MacOS.

    "},{"location":"02_Linux_basics/#head","title":"head","text":"

    head: Print the first 10 lines.

    -n: option to change the number of lines.

    Check the first 10 lines and only the first line of the file sumstats.txt

    $ head sumstats.txt \nCHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   319 17  2   1   1   ADD 10000   1.04326 0.0495816   0.854176    0.393008    .\n1   319 22  1   2   2   ADD 10000   1.03347 0.0493972   0.666451    0.505123    .\n1   418 23  1   2   2   ADD 10000   1.02668 0.0498185   0.528492    0.597158    .\n1   537 30  1   2   2   ADD 10000   1.01341 0.0498496   0.267238    0.789286    .\n1   546 31  2   1   1   ADD 10000   1.02051 0.0336786   0.60284 0.546615    .\n1   575 33  2   1   1   ADD 10000   1.09795 0.0818305   1.14199 0.25346 .\n1   752 44  2   1   1   ADD 10000   1.02038 0.0494069   0.408395    0.682984    .\n1   913 50  2   1   1   ADD 10000   1.07852 0.0493585   1.53144 0.12566 .\n1   1356    77  2   1   1   ADD 10000   0.947521    0.0339805   -1.5864 0.112649    .\n\n$ head -n 1 sumstats.txt \nCHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n
    "},{"location":"02_Linux_basics/#tail","title":"tail","text":"

    Similar to head, you can use tail ro check the last 10 lines. -n works in the same way.

    Check the last 10 lines of the file sumstats.txt

    $ tail sumstats.txt \n22  99996057    9959945 2   1   1   ADD 10000   1.03234 0.0335547   0.948413    0.342919.\n22  99996465    9959971 2   1   1   ADD 10000   1.04755 0.0337187   1.37769 0.1683  .\n22  99997041    9960013 2   1   1   ADD 10000   1.01942 0.0937548   0.205195    0.837419.\n22  99997608    9960051 2   1   1   ADD 10000   0.969928    0.0397711   -0.767722   0.    442652    .\n22  99997629    9960055 2   1   1   ADD 10000   0.986949    0.0395305   -0.332315   0.    739652    .\n22  99997742    9960061 2   1   1   ADD 10000   0.990829    0.0396614   -0.232298   0.    816307    .\n22  99998121    9960086 2   1   1   ADD 10000   1.04448 0.0335879   1.29555 0.19513 .\n22  99998455    9960106 2   1   1   ADD 10000   0.880953    0.152754    -0.829771   0.    406668    .\n22  99999208    9960146 2   1   1   ADD 10000   0.944604    0.065187    -0.874248   0.    381983    .\n22  99999382    9960164 2   1   1   ADD 10000   0.970509    0.033978    -0.881014   0.37831 .\n
    "},{"location":"02_Linux_basics/#wc","title":"wc","text":"

    wc: short for word count, which count the lines, words, and characters in a file.

    For example,

    Count the lines, words, and characters in sumstats.txt

    $ wc sumstats.txt \n  445933  5797129 32790417 sumstats.txt\n
    This means that sumstats.txt has 445933 lines, 5797129 words, and 32790417 characters.

    "},{"location":"02_Linux_basics/#edit-files","title":"Edit files","text":"

    Vim is a handy text editor for command line.

    Vim - text editor

    vim README.md\n

    Simple workflow using Vim

    1. vim file_to_edit.txt
    2. Press i to enter the INSERT mode.
    3. Edit the file.
    4. When finished, just press Esc key to escape the INSERT mode.
    5. Then enter :wq to quit and also save the file.

    Vim is a little bit hard to learn for beginners, but when you get familiar with it, it will be a mighty and convenient tool. For more detailed tutorials on Vim, you can check: https://github.com/iggredible/Learn-Vim

    Other common command line text editors

    "},{"location":"02_Linux_basics/#permission","title":"Permission","text":"

    The permissions of a file or directory are represented as a 10-character string (1+3+3+3) :

    For example, this represents a directory(the initial d) which is readable, writable and executable for the owner(the first 3: rwx), users in the same group(the 3 characters in the middle: rwx) and others (last 3 characters: rwx).

    drwxrwxrwx

    -> d (directory or file) rwx (permissions for owner) rwx (permissions for users in the same group) rwx (permissions for other users)

    Notation Description r readable w writable x executable d directory - file

    Command for checking the permissions of files in the current directory: ls -l

    Command for changing permissions: chmod, chown, chgrp

    Syntax:

    chmod [3-digit Binary notation] [path]\n

    Number notation Permission 3-digit Binary notation 7 rwx 111 6 rw- 110 5 r-x 101 4 r-- 100 3 -wx 011 2 -w- 010 1 --x 001 0 --- 000

    Change the permissions of the file README.md to 660

    # there is a readme file in the directory, and its permissions are -rw-r----- \n$ ls -lh\ntotal 4.0K\n-rw-r----- 1 he staff 2.1K Feb 24 01:16 README.md\n\n# let's change the permissions to 660, which is a numeric notation of -rw-rw---- based on the     table above\n$ chmod 660 README.md \n\n# chack again, and it was changed.\n$ ls -lh\ntotal 4.0K\n-rw-rw---- 1 he staff 2.1K Feb 24 01:16 README.md\n

    Note

    These commands are very important because we use genome data, which could raise severe ethical and privacy issues if there is data leak.

    Warning

    Please always be cautious when handling human genomic data.

    "},{"location":"02_Linux_basics/#others","title":"Others","text":"

    There are a group of very handy and flexible commands which will greatly improve your efficiency. These include | , >, >>,*,.,..,~,and -.

    "},{"location":"02_Linux_basics/#pipe","title":"| (pipe)","text":"

    Pipe basically is used to pass the output of the previous command to the next command as input, instead of printing is in terminal. Using pipe you can do very complicated manipulations of the files.

    An example of Pipe

    cat sumstats.txt | sort | uniq | wc\n
    This means (1) print sumstats, (2) sort the output, (3) then keep the unique lines and finally (4) count the lines and words.

    "},{"location":"02_Linux_basics/#_1","title":">","text":"

    > redirects output to a new file (if the file already exist, it will be overwritten)

    Redirects the output of cat sumstats.txt | sort | uniq | wc to count.txt

    cat sumstats.txt | sort | uniq | wc > count.txt\n
    "},{"location":"02_Linux_basics/#_2","title":">>","text":"

    >> redirects output to a file by appending to the end of the file (if the file already exist, it will not be overwritten)

    Redirects the output of cat sumstats.txt | sort | uniq | wc to count.txt by appending

    cat sumstats.txt | sort | uniq | wc >> count.txt\n

    Other useful commands include :

    Command Description Example Code Example code meaning * represent zero or more characters - - ? represent a single character - - . the current directory - - .. the parent directory of the current directory. cd .. change to the parent directory of the current directory ~ the home directory cd ~ change to the curent user's home directory - the last directory you are working in. cd - change to the last directory you are working in.

    Wildcards

    The asterisk * and the question mark ? are called wildcard characters or wildcards in Linux, which are special symbols that can represent other normal characters. Wildcards are especially useful when handling multiple files with similar pattern in their names.

    Warning

    Be extremely careful when you use rm and *. It is disastrous when you mistakenly type rm *

    "},{"location":"02_Linux_basics/#bash-scripts","title":"Bash scripts","text":"

    If you have a lot of commands to run, or if you want to automate some complex manipulations, bash scripts are a good way to address this issue.

    We can use vim to create a bash script called hello.sh

    A simple example of bash scripts:

    Example

    hello.sh
    #!/bin/bash\necho \"Hello, world1\"\necho \"Hello, world2\"\n

    #! is called shebang, which tells the system which interpreter to use to execute the shell script.

    Then use chmod to give it permission to execute.

    chmod +x hello.sh \n

    Now we can run the srcipt by ./hello.sh:

    ./hello.sh\n\"Hello, world1\" \n\"Hello, world2\" \n
    "},{"location":"02_Linux_basics/#advanced-text-editing","title":"Advanced text editing","text":"

    (optional: awk, sed, cut, sort, join, uniq)

    Advanced commands:

    "},{"location":"02_Linux_basics/#git-and-github","title":"Git and Github","text":"

    Git is a powerful version control software and github is a platform where you can share your codes.

    Currently you just need to learn git clone, which simply downloads an existing repository.

    git clone https://github.com/Cloufield/GWASTutorial.git

    You can also check here for more information.

    Quote

    "},{"location":"02_Linux_basics/#download","title":"Download","text":"

    We can use wget [option] [url] command to download files to local machine.

    -O option specify the file name you want to change for the downloaded file.

    Use wget to download the hg19 reference genome from UCSC

    # Download hg19 reference genome from UCSC\nwget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n\n# Download hg19 reference genome from UCSC and rename it to  my_refgenome.fa.gz\nwget -O my_refgenome.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n
    "},{"location":"02_Linux_basics/#exercise","title":"Exercise","text":"

    The questions are generated by Microsoft Bing!

    What is the command to list all files and directories in your current working directory?

    What is the command to create a new directory named \u201ctest\u201d?

    What is the command to copy a file named \u201cdata.txt\u201d from your current working directory to another directory named \u201cbackup\u201d?

    What is the command to display the first 10 lines of a file named \u201cresults.csv\u201d?

    What is the command to count the number of lines, words, and characters in a file named \u201creport.txt\u201d?

    What is the command to search for a pattern in a file named \u201clog.txt\u201d and print only the matching lines?

    What is the command to sort the contents of a file named \u201cnames.txt\u201d in alphabetical order and save the output to a new file named \u201csorted_names.txt\u201d?

    What is the command to display the difference between two files named \u201cold_version.py\u201d and \u201cnew_version.py\u201d?

    What is the command to change the permissions of a file named \u201cscript.sh\u201d to make it executable by everyone?

    What is the command to run a program named \u201cprogram.exe\u201d in the background and redirect its output to a file named \u201coutput.log\u201d?

    "},{"location":"03_Data_formats/","title":"Data format","text":"

    This section lists some of the most commonly used formats in complex trait genomic analysis.

    "},{"location":"03_Data_formats/#table-of-contents","title":"Table of Contents","text":""},{"location":"03_Data_formats/#data-formats-for-general-purposes","title":"Data formats for general purposes","text":""},{"location":"03_Data_formats/#txt","title":"txt","text":"

    Simple text file

    .txt

    cat sample_text.txt \nLorem ipsum dolor sit amet, consectetur adipiscing elit. In ut sem congue, tristique tortor et, ullamcorper elit. Nulla elementum, erat ac fringilla mattis, nisi tellus euismod dui, interdum laoreet orci velit vel leo. Vestibulum neque mi, pharetra in tempor id, malesuada at ipsum. Duis tellus enim, suscipit sit amet vestibulum in, ultricies vitae erat. Proin consequat id quam sed sodales. Ut a magna non tellus dictum aliquet vitae nec mi. Suspendisse potenti. Vestibulum mauris sem, viverra ac metus sed, scelerisque ornare arcu. Vivamus consequat, libero vitae aliquet tempor, lorem leo mattis arcu, et viverra erat ligula sit amet tortor. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Praesent ut massa ac tortor lobortis placerat. Pellentesque aliquam tortor augue, at rutrum magna molestie et. Etiam congue nulla in venenatis congue. Nunc ac felis pharetra, cursus leo et, finibus eros.\n
    Random texts are generated using - https://www.lipsum.com/

    "},{"location":"03_Data_formats/#tsv","title":"tsv","text":"

    Tab-separated values Tabular data format

    .tsv

    head sample_data.tsv\n#CHROM  POS ID  REF ALT A1  FIRTH?  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   N   ADD 503 0.750168    0.280794    -1.02373    0.305961    .\n1   14599   1:14599:T:A T   A   A   N   ADD 503 1.80972 0.231595    2.56124 0.0104299   .\n1   14604   1:14604:A:G A   G   G   N   ADD 503 1.80972 0.231595    2.56124 0.0104299   .\n1   14930   1:14930:A:G A   G   G   N   ADD 503 1.70139 0.240245    2.21209 0.0269602   .\n1   69897   1:69897:T:C T   C   T   N   ADD 503 1.58002 0.194774    2.34855 0.0188466   .\n1   86331   1:86331:A:G A   G   G   N   ADD 503 1.47006 0.236102    1.63193 0.102694    .\n1   91581   1:91581:G:A G   A   A   N   ADD 503 0.924422    0.122991    -0.638963   0.522847    .\n1   122872  1:122872:T:G    T   G   G   N   ADD 503 1.07113 0.180776    0.380121    0.703856    .\n1   135163  1:135163:C:T    C   T   T   N   ADD 503 0.711822    0.23908 -1.42182    0.155079    .\n
    "},{"location":"03_Data_formats/#csv","title":"csv","text":"

    Comma-separated values Tabular data format

    .csv

    head sample_data.csv \n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
    "},{"location":"03_Data_formats/#data-formats-in-bioinformatics","title":"Data formats in bioinformatics","text":"

    A typical workflow for generating genotype data for genome-wide association analysis.

    "},{"location":"03_Data_formats/#sequence","title":"Sequence","text":""},{"location":"03_Data_formats/#fasta","title":"fasta","text":"

    text-based format for representing either nucleotide sequences or amino acid (protein) sequences

    .fa or .fasta

    >SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n
    "},{"location":"03_Data_formats/#fastq","title":"fastq","text":"

    text-based format for storing both a nucleotide sequence and its corresponding quality scores

    .fastq

    @SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n+\n!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65\n
    Reference: https://en.wikipedia.org/wiki/FASTQ_format

    "},{"location":"03_Data_formats/#alingment","title":"Alingment","text":""},{"location":"03_Data_formats/#sambam","title":"SAM/BAM","text":"

    Sequence Alignment/Map Format is a TAB-delimited text file format consisting of a header section and an alignment section.

    .sam

    @HD VN:1.6 SO:coordinate\n@SQ SN:ref LN:45\nr001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *\nr002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *\nr003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;\nr004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *\nr003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;\nr001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1\n
    Reference : https://samtools.github.io/hts-specs/SAMv1.pdf

    "},{"location":"03_Data_formats/#variant-and-genotype","title":"Variant and genotype","text":""},{"location":"03_Data_formats/#vcf-vcfgz-vcfgztbi","title":"vcf / vcf.gz / vcf.gz.tbi","text":"

    VCF is a text file format consisting of meta-information lines, a header line, and then data lines. Each data line contains information about a variant in the genome (and the genotype information on samples for each variant).

    .vcf

    ##fileformat=VCFv4.2\n##fileDate=20090805\n##source=myImputationProgramV3.1\n##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta\n##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>\n##phasing=partial\n##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">\n##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">\n##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">\n##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">\n##INFO=<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">\n##INFO=<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">\n##FILTER=<ID=q10,Description=\"Quality below 10\">\n##FILTER=<ID=s50,Description=\"Less than 50% of samples have data\">\n##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">\n##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">\n##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">\n#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003\n20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.\n20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3\n20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4\n20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2\n20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3\n
    Reference : https://samtools.github.io/hts-specs/VCFv4.2.pdf

    "},{"location":"03_Data_formats/#plink-format","title":"PLINK format","text":"

    The figure shows how genotypes are stored in files.

    We have 3 parts of information:

    1. Individual information
    2. Variant information
    3. Genotype matrix

    And there are different ways (format sets) to represent this information in PLINK1.9 and PLINK2:

    1. ped / map
    2. fam / bim / bed
    3. psam / pvar / pgen

    "},{"location":"03_Data_formats/#ped-map","title":"ped / map","text":"

    .ped (PLINK/MERLIN/Haploview text pedigree + genotype table)

    Original standard text format for sample pedigree information and genotype calls.Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.

    .ped

    # check the first 16 rows and 16 columns of the ped file\ncut -d \" \" -f 1-16 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped | head\n0 HG00403 0 0 0 -9 G G T T A A G A C C\n0 HG00404 0 0 0 -9 G G T T A A G A T C\n0 HG00406 0 0 0 -9 G G T T A A G A T C\n0 HG00407 0 0 0 -9 G G T T A A A A C C\n0 HG00409 0 0 0 -9 G G T T A A G A C C\n0 HG00410 0 0 0 -9 G G T T A A G A C C\n0 HG00419 0 0 0 -9 G G T T A A A A T C\n0 HG00421 0 0 0 -9 G G T T A A G A C C\n0 HG00422 0 0 0 -9 G G T T A A G A C C\n0 HG00428 0 0 0 -9 G G T T A A G A C C\n0 HG00436 0 0 0 -9 G G A T G A A A C C\n0 HG00437 0 0 0 -9 C G T T A A G A C C\n0 HG00442 0 0 0 -9 G G T T A A G A C C\n0 HG00443 0 0 0 -9 G G T T A A G A C C\n0 HG00445 0 0 0 -9 G G T T A A G A C C\n0 HG00446 0 0 0 -9 C G T T A A G A T C\n

    .map (PLINK text fileset variant information file)

    Variant information file accompanying a .ped text pedigree + genotype table. A text file with no header line, and one line per variant with the following 3-4 fields:

    .map

    head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n1       1:13273:G:C     0       13273\n1       1:14599:T:A     0       14599\n1       1:14604:A:G     0       14604\n1       1:14930:A:G     0       14930\n1       1:69897:T:C     0       69897\n1       1:86331:A:G     0       86331\n1       1:91581:G:A     0       91581\n1       1:122872:T:G    0       122872\n1       1:135163:C:T    0       135163\n1       1:233473:C:G    0       233473\n

    Reference: https://www.cog-genomics.org/plink/1.9/formats

    "},{"location":"03_Data_formats/#bed-fam-bim","title":"bed / fam /bim","text":"

    bed/fam/bim formats are the binary implementation of ped/map formats. bed/bim/fam files contain the same information as ped/map but are much smaller in size.

    -rw-r----- 1 yunye yunye 135M Dec 23 11:45 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed\n-rw-r----- 1 yunye yunye  36M Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n-rw-r----- 1 yunye yunye 9.4K Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n-rw-r--r-- 1 yunye yunye  32M Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n-rw-r--r-- 1 yunye yunye 2.2G Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped\n

    .fam

    head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n0 HG00403 0 0 0 -9\n0 HG00404 0 0 0 -9\n0 HG00406 0 0 0 -9\n0 HG00407 0 0 0 -9\n0 HG00409 0 0 0 -9\n0 HG00410 0 0 0 -9\n0 HG00419 0 0 0 -9\n0 HG00421 0 0 0 -9\n0 HG00422 0 0 0 -9\n0 HG00428 0 0 0 -9\n

    .bim

    head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n1       1:13273:G:C     0       13273   C       G\n1       1:14599:T:A     0       14599   A       T\n1       1:14604:A:G     0       14604   G       A\n1       1:14930:A:G     0       14930   G       A\n1       1:69897:T:C     0       69897   C       T\n1       1:86331:A:G     0       86331   G       A\n1       1:91581:G:A     0       91581   A       G\n1       1:122872:T:G    0       122872  G       T\n1       1:135163:C:T    0       135163  T       C\n1       1:233473:C:G    0       233473  G       C\n

    .bed

    \"Primary representation of genotype calls at biallelic variants The first three bytes should be 0x6c, 0x1b, and 0x01 in that order. The rest of the file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.\"

    hexdump -C 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed | head\n00000000  6c 1b 01 ff ff bf bf ff  ff ff ef fb ff ff ff fe  |l...............|\n00000010  ff ff ff ff fb ff bb ff  ff fb af ff ff fe fb ff  |................|\n00000020  ff ff ff fe ff ff ff ff  ff bf ff ff ef ff ff ef  |................|\n00000030  bb ff ff ff ff ff ff ff  fa ff ff ff ff ff ff ff  |................|\n00000040  ff ff ff fb ff ff ff ff  ff ff ff ff ff ff ff ef  |................|\n00000050  ff ff ff fb fe ef fe ff  ff ff ff eb ff ff fe fe  |................|\n00000060  ff ff fe ff bf ff fa fb  fb eb be ff ff 3b ff be  |.............;..|\n00000070  fe be bf ef fe ff ef ee  ff ff bf ea fe bf fe ff  |................|\n00000080  bf ff ff ef ff ff ff ff  ff fa ff ff eb ff ff ff  |................|\n00000090  ff ff fb fe af ff bf ff  ff ff ff ff ff ff ff ff  |................|\n

    Reference: https://www.cog-genomics.org/plink/1.9/formats

    "},{"location":"03_Data_formats/#imputation-dosage","title":"Imputation dosage","text":""},{"location":"03_Data_formats/#bgen-bgi","title":"bgen / bgi","text":"

    Reference: https://www.well.ox.ac.uk/~gav/bgen_format/

    "},{"location":"03_Data_formats/#pgenpsampvar","title":"pgen,psam,pvar","text":"

    Reference: https://www.cog-genomics.org/plink/2.0/formats#pgen

    NOTE: pgen only saved the dosage for each individual (a scalar ranged from 0 to 2). It could not been converted back to the genotype probability (a vector of length 3) or allele probability (a matrix of dimension 2 x 2) saved in bgen.

    "},{"location":"03_Data_formats/#summary","title":"Summary","text":""},{"location":"04_Data_QC/","title":"PLINK basics","text":"

    In this module, we will learn the basics of genotype data QC using PLINK, which is one of the most commonly used software in complex trait genomics. (Huge thanks to the developers: PLINK1.9 and PLINK2)

    "},{"location":"04_Data_QC/#table-of-contents","title":"Table of Contents","text":""},{"location":"04_Data_QC/#preparation","title":"Preparation","text":""},{"location":"04_Data_QC/#plink-192-installation","title":"PLINK 1.9&2 installation","text":"

    To get prepared for genotype QC, we will need to make directories, download software and add the software to your environment path.

    First, we will simply create some directories to keep the tools we need to use.

    Create directories

    cd ~\nmkdir tools\ncd tools\nmkdir bin\nmkdir plink\nmkdir plink2\n

    You can download each tool into its corresponding directories.

    The bin directory here is for keeping all the symbolic links to the executable files of each tool.

    In this way, it is much easier to manage and organize the paths and tools. We will only add the bin directory here to the environment path.

    "},{"location":"04_Data_QC/#download-plink19-and-plink2-and-then-unzip","title":"Download PLINK1.9 and PLINK2 and then unzip","text":"

    Next, go to the Plink webpage to download the software. We will need both PLINK1.9 and PLINK2.

    Download PLINK1.9 and PLINK2 from the following webpage to the corresponding directories:

    Info

    If you are using Mac or Windows, then please download the Mac or Windows version. In this tutorial, we will use a Linux system and the Linux version of PLINK.

    Find the suitable version on the PLINK website, right-click and copy the link address.

    Download PLINK2 (Linux AVX2 AMD)

    cd ~/tools/plink2\nwget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_amd_avx2_20231212.zip\nunzip plink2_linux_amd_avx2_20231212.zip\n

    Then do the same for PLINK1.9

    Download PLINK1.9 (Linux 64-bit)

    cd ~/tools/plink\nwget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip\nunzip plink_linux_x86_64_20231211.zip\n
    "},{"location":"04_Data_QC/#create-symbolic-links","title":"Create symbolic links","text":"

    After downloading and unzipping, we will create symbolic links for the plink binary files, and then move the link to ~/tools/bin/.

    Create symbolic links

    cd ~\nln -s ~/tools/plink2/plink2 ~/tools/bin/plink2\nln -s ~/tools/plink/plink ~/tools/bin/plink\n
    "},{"location":"04_Data_QC/#add-paths-to-the-environment-path","title":"Add paths to the environment path","text":"

    Then add ~/tools/bin/ to the environment path.

    Example

    export PATH=$PATH:~/tools/bin/\n
    This command will add the path to your current shell.

    If you restart the terminal, it will be lost. So you may need to add it to the Bash configuration file. Then run

    echo \"export PATH=$PATH:~/tools/bin/\" >> ~/.bashrc\n

    This will add a new line at the end of .bashrc, which will be run every time you open a new bash shell.

    All done. Let's test if we installed PLINK successfully or not.

    Check if PLINK is installed successfully.

    ./plink\nPLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\n\nplink <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink --help [flag name(s)...]\n\nCommands include --make-bed, --recode, --flip-scan, --merge-list,\n--write-snplist, --list-duplicate-vars, --freqx, --missing, --test-mishap,\n--hardy, --mendel, --ibc, --impute-sex, --indep-pairphase, --r2, --show-tags,\n--blocks, --distance, --genome, --homozyg, --make-rel, --make-grm-gz,\n--rel-cutoff, --cluster, --pca, --neighbour, --ibs-test, --regress-distance,\n--model, --bd, --gxe, --logistic, --dosage, --lasso, --test-missing,\n--make-perm-pheno, --tdt, --qfam, --annotate, --clump, --gene-report,\n--meta-analysis, --epistasis, --fast-epistasis, and --score.\n\n\"plink --help | more\" describes all functions (warning: long).\n
    ./plink2\nPLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023)       www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\n\nplink2 <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink2 --help [flag name(s)...]\n\nCommands include --rm-dup list, --make-bpgen, --export, --freq, --geno-counts,\n--sample-counts, --missing, --hardy, --het, --fst, --indep-pairwise, --ld,\n--sample-diff, --make-king, --king-cutoff, --pmerge, --pgen-diff,\n--write-samples, --write-snplist, --make-grm-list, --pca, --glm, --adjust-file,\n--gwas-ssf, --clump, --score, --variant-score, --genotyping-rate, --pgen-info,\n--validate, and --zst-decompress.\n\n\"plink2 --help | more\" describes all functions.\n

    Well done. We have successfully installed plink1.9 and plink2.

    "},{"location":"04_Data_QC/#download-genotype-data","title":"Download genotype data","text":"

    Next, we need to download the sample genotype data. The way to create the sample data is described [here].(https://cloufield.github.io/GWASTutorial/01_Dataset/) This dataset contains 504 EAS individuals from 1000 Genome Project Phase 3v5 with around 1 million variants.

    Simply run download_sampledata.sh in 01_Dataset to download this dataset (from Dropbox). See here

    Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.

    Download sample data

    cd ../01_Dataset\n./download_sampledata.sh\n

    And you will get the following three PLINK files:

    -rw-r--r-- 1 yunye yunye 149M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n-rw-r--r-- 1 yunye yunye  40M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n-rw-r--r-- 1 yunye yunye  13K Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n

    Check the bim file:

    head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1       1:14930:A:G     0       14930   G       A\n1       1:15774:G:A     0       15774   A       G\n1       1:15777:A:G     0       15777   G       A\n1       1:57292:C:T     0       57292   T       C\n1       1:77874:G:A     0       77874   A       G\n1       1:87360:C:T     0       87360   T       C\n1       1:92917:T:A     0       92917   A       T\n1       1:104186:T:C    0       104186  T       C\n1       1:125271:C:T    0       125271  C       T\n1       1:232449:G:A    0       232449  A       G\n

    Check the fam file:

    head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\nHG00403 HG00403 0 0 0 -9\nHG00404 HG00404 0 0 0 -9\nHG00406 HG00406 0 0 0 -9\nHG00407 HG00407 0 0 0 -9\nHG00409 HG00409 0 0 0 -9\nHG00410 HG00410 0 0 0 -9\nHG00419 HG00419 0 0 0 -9\nHG00421 HG00421 0 0 0 -9\nHG00422 HG00422 0 0 0 -9\nHG00428 HG00428 0 0 0 -9\n

    "},{"location":"04_Data_QC/#plink-tutorial","title":"PLINK tutorial","text":"

    Detailed descriptions can be found on plink's website: PLINK1.9 and PLINK2.

    The functions we will learn in this tutorial:

    1. Calculating missing rate (call rate)
    2. Calculating allele Frequency
    3. Conducting Hardy-Weinberg equilibrium exact test
    4. Applying filters
    5. Conducting LD-Pruning
    6. Calculating inbreeding F coefficient
    7. Conducting sample & SNP filtering (extract/exclude/keep/remove)
    8. Estimating IBD / PI_HAT
    9. Calculating LD
    10. Data management (make-bed/recode)

    All sample codes and results for this module are available in ./04_data_QC

    "},{"location":"04_Data_QC/#qc-step-summary","title":"QC Step Summary","text":"

    QC Step Summary

    QC step Option in PLINK Commonly used threshold to exclude Sample missing rate --geno, --missing missing rate > 0.01 (0.02, or 0.05) SNP missing rate --mind, --missing missing rate > 0.01 (0.02, or 0.05) Minor allele frequency --freq, --maf maf < 0.01 Sample Relatedness --genome pi_hat > 0.2 to exclude second-degree relatives Hardy-Weinberg equilibrium --hwe,--hardy hwe < 1e-6 Inbreeding F coefficient --het outside of 3 SD from the mean

    First, we can calculate some basic statistics of our simulated data:

    "},{"location":"04_Data_QC/#missing-rate-call-rate","title":"Missing rate (call rate)","text":"

    The first thing we want to know is the missing rate of our data. Usually, we need to check the missing rate of samples and SNPs to decide a threshold to exclude low-quality samples and SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#missing)

    Missing rate and Call rate

    Suppose we have N samples and M SNPs for each sample.

    For sample \\(j\\) :

    \\[Sample\\ Missing\\ Rate_{j} = {{N_{missing\\ SNPs\\ for\\ j}}\\over{M}} = 1 - Call\\ Rate_{sample, j}\\]

    For SNP \\(i\\) :

    \\[SNP\\ Missing\\ Rate_{i} = {{N_{missing\\ samples\\ at\\ i}}\\over{N}} = 1 - Call\\ Rate_{SNP, i}\\]

    The input is PLINK bed/bim/fam file. Usually, they have the same prefix, and we just need to pass the prefix to --bfile option.

    "},{"location":"04_Data_QC/#plink-syntax","title":"PLINK syntax","text":"

    PLINK syntax

    To calculate the missing rate, we need the flag --missing, which tells PLINK to calculate the missing rate in the dataset specified by --bfile.

    Calculate missing rate

    cd ../04_Data_QC\ngenotypeFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" #!!! Please add your own path here.  \"1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" is the prefix of PLINK bed file. \n\nplink \\\n    --bfile ${genotypeFile} \\\n    --missing \\\n    --out plink_results\n
    Remeber to set the value for ${genotypeFile}.

    This code will generate two files plink_results.imiss and plink_results.lmiss, which contain the missing rate information for samples and SNPs respectively.

    Take a look at the .imiss file. The last column shows the missing rate for samples. Since we used part of the 1000 Genome Project data this time, there are no missing SNPs in the original datasets. But for educational purposes, we randomly make some of the genotypes missing.

    # missing rate for each sample\nhead plink_results.imiss\n    FID       IID MISS_PHENO   N_MISS   N_GENO   F_MISS\nHG00403   HG00403          Y    10020  1235116 0.008113\nHG00404   HG00404          Y     9192  1235116 0.007442\nHG00406   HG00406          Y    15751  1235116  0.01275\nHG00407   HG00407          Y    14653  1235116  0.01186\nHG00409   HG00409          Y     5667  1235116 0.004588\nHG00410   HG00410          Y     6066  1235116 0.004911\nHG00419   HG00419          Y    20000  1235116  0.01619\nHG00421   HG00421          Y    17542  1235116   0.0142\nHG00422   HG00422          Y    18608  1235116  0.01507\n
    # missing rate for each SNP\nhead plink_results.lmiss\n CHR              SNP   N_MISS   N_GENO   F_MISS\n   1      1:14930:A:G        2      504 0.003968\n   1      1:15774:G:A        3      504 0.005952\n   1      1:15777:A:G        3      504 0.005952\n   1      1:57292:C:T        6      504   0.0119\n   1      1:77874:G:A        3      504 0.005952\n   1      1:87360:C:T        1      504 0.001984\n   1      1:92917:T:A        7      504  0.01389\n   1     1:104186:T:C        3      504 0.005952\n   1     1:125271:C:T        2      504 0.003968\n

    Distribution of sample missing rate and SNP missing rate

    Note: The missing values were simulated based on normal distributions for each individual.

    Sample missing rate

    SNP missing rate

    For the meaning of headers, please refer to PLINK documents.

    "},{"location":"04_Data_QC/#allele-frequency","title":"Allele Frequency","text":"

    One of the most important statistics of SNPs is their frequency in a certain population. Many downstream analyses are based on investigating differences in allele frequencies.

    Usually, variants can be categorized into 3 groups based on their Minor Allele Frequency (MAF):

    1. Common variants : MAF>=0.05
    2. Low-frequency variants : 0.01<=MAF<0.05
    3. Rare variants : MAF<0.01

    How to calculate Minor Allele Frequency (MAF)

    Suppose the reference allele(REF) is A and the alternative allele(ALT) is B for a certain SNP. The posible genotypes are AA, AB and BB. In a population of N samples (2N alleles), \\(N = N_{AA} + 2 \\times N_{AB} + N_{BB}\\) :

    So we can calculate the allele frequency:

    The MAF for this SNP in this specific population is defined as:

    \\(MAF = min( AF_{REF}, AF_{ALT} )\\)

    For different downstream analyses, we might use different sets of variants. For example, for PCA, we might use only common variants. For gene-based tests, we might use only rare variants.

    Using PLINK1.9 we can easily calculate the MAF of variants in the input data.

    Calculate the MAF of variants using PLINK1.9

    plink \\\n    --bfile ${genotypeFile} \\\n    --freq \\\n    --out plink_results\n
    # results from plink1.9\nhead plink_results.frq\nCHR              SNP   A1   A2          MAF  NCHROBS\n1      1:14930:A:G    G    A       0.4133     1004\n1      1:15774:G:A    A    G      0.02794     1002\n1      1:15777:A:G    G    A      0.07385     1002\n1      1:57292:C:T    T    C       0.1054      996\n1      1:77874:G:A    A    G      0.01996     1002\n1      1:87360:C:T    T    C      0.02286     1006\n1      1:92917:T:A    A    T     0.003018      994\n1     1:104186:T:C    T    C        0.499     1002\n1     1:125271:C:T    C    T      0.03088     1004\n

    Next, we use plink2 to run the same options to check the difference between the results.

    Calculate the alternative allele frequencies of variants using PLINK2

    plink2 \\\n        --bfile ${genotypeFile} \\\n        --freq \\\n        --out plink_results\n
    # results from plink2\nhead plink_results.afreq\n#CHROM  ID      REF     ALT     PROVISIONAL_REF?        ALT_FREQS       OBS_CT\n1       1:14930:A:G     A       G       Y       0.413347        1004\n1       1:15774:G:A     G       A       Y       0.0279441       1002\n1       1:15777:A:G     A       G       Y       0.0738523       1002\n1       1:57292:C:T     C       T       Y       0.105422        996\n1       1:77874:G:A     G       A       Y       0.0199601       1002\n1       1:87360:C:T     C       T       Y       0.0228628       1006\n1       1:92917:T:A     T       A       Y       0.00301811      994\n1       1:104186:T:C    T       C       Y       0.500998        1002\n1       1:125271:C:T    C       T       Y       0.969124        1004\n

    We need to pay attention to the concepts here.

    In PLINK1.9, the concept here is minor (A1) and major(A2) allele, while in PLINK2 it is the reference (REF) allele and the alternative (ALT) allele.

    "},{"location":"04_Data_QC/#hardy-weinberg-equilibrium-exact-test","title":"Hardy-Weinberg equilibrium exact test","text":"

    For SNP QC, besides checking the missing rate, we also need to check if the SNP is in Hardy-Weinberg equilibrium:

    --hardy will perform Hardy-Weinberg equilibrium exact test for each variant. Variants with low P value usually suggest genotyping errors, or indicate evolutionary selection for these variants.

    The following command can calculate the Hardy-Weinberg equilibrium exact test statistics for all SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#hardy)

    Info

    Suppose we have N unrelated samples (2N alleles). Under HWE, the exact probability of observing \\(n_{AB}\\) sample with genotype AB in N samples is:

    \\[P(N_{AB} = n_{AB} | N, n_A) = {{2^{n_{AB}}}N!\\over{n_{AA}!n_{AB}!n_{BB}!}} \\times {{n_A!n_B!}\\over{n_A!n_B!}} \\]

    To compute the Hardy-Weinberg equilibrium exact test statistics, we will sum up the probabilities of all configurations with probability equal to or less than the observed configuration :

    \\[P_{HWE} = \\sum_{n^{*}_AB} I[P(N_{AB} = n_{AB} | N, n_A) \\geqq P(N_{AB} = n^{*}_{AB} | N, n_A)] \\times P(N_{AB} = n^{*}_{AB} | N, n_A)\\]

    \\(I(x)\\) is the indicator function. If x is true, \\(I(x) = 1\\); otherwise, \\(I(x) = 0\\).

    Reference : Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link

    Calculate the Hardy-Weinberg equilibrium exact test statistics for a single SNP using Python

    This code is converted from here (Jeremy McRae) to python. Orginal citation: Wigginton, JE, Cutler, DJ, and Abecasis, GR (2005) A Note on Exact Tests of Hardy-Weinberg Equilibrium. AJHG 76: 887-893

    def snphwe(obs_hets, obs_hom1, obs_hom2):\n    obs_homr = min(obs_hom1, obs_hom2)\n    obs_homc = max(obs_hom1, obs_hom2)\n\n    rare = 2 * obs_homr + obs_hets\n    genotypes = obs_hets + obs_homc + obs_homr\n\n    probs = [0.0 for i in range(rare +1)]\n\n    mid = rare * (2 * genotypes - rare) // (2 * genotypes)\n    if mid % 2 != rare%2:\n        mid += 1\n\n    probs[mid] = 1.0\n    sum_p = 1 #probs[mid]\n\n    curr_homr = (rare - mid) // 2\n    curr_homc = genotypes - mid - curr_homr\n\n    for curr_hets in range(mid, 1, -2):\n        probs[curr_hets - 2] = probs[curr_hets] * curr_hets * (curr_hets - 1.0)/ (4.0 * (curr_homr + 1.0) * (curr_homc + 1.0))\n        sum_p+= probs[curr_hets - 2]\n        curr_homr += 1\n        curr_homc += 1\n\n    curr_homr = (rare - mid) // 2\n    curr_homc = genotypes - mid - curr_homr\n\n    for curr_hets in range(mid, rare-1, 2):\n        probs[curr_hets + 2] = probs[curr_hets] * 4.0 * curr_homr * curr_homc/ ((curr_hets + 2.0) * (curr_hets + 1.0))\n        sum_p += probs[curr_hets + 2]\n        curr_homr -= 1\n        curr_homc -= 1\n\n    target = probs[obs_hets]\n    p_hwe = 0.0\n    for p in probs:\n        if p <= target :\n            p_hwe += p / sum_p  \n\n    return min(p_hwe,1)\n

    Calculate the Hardy-Weinberg equilibrium exact test statistics using PLINK

    plink \\\n    --bfile ${genotypeFile} \\\n    --hardy \\\n    --out plink_results\n
    head plink_results.hwe\n    CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P\n1      1:14930:A:G  ALL(NP)    G    A             4/407/91   0.8108    0.485    4.864e-61\n1      1:15774:G:A  ALL(NP)    A    G             0/28/473  0.05589  0.05433            1\n1      1:15777:A:G  ALL(NP)    G    A             1/72/428   0.1437   0.1368       0.5053\n1      1:57292:C:T  ALL(NP)    T    C             3/99/396   0.1988   0.1886       0.3393\n1      1:77874:G:A  ALL(NP)    A    G             0/20/481  0.03992  0.03912            1\n1      1:87360:C:T  ALL(NP)    T    C             0/23/480  0.04573  0.04468            1\n1      1:92917:T:A  ALL(NP)    A    T              0/3/494 0.006036 0.006018            1\n1     1:104186:T:C  ALL(NP)    T    C            74/352/75   0.7026      0.5    6.418e-20\n1     1:125271:C:T  ALL(NP)    C    T             1/29/472  0.05777  0.05985       0.3798\n

    "},{"location":"04_Data_QC/#applying-filters","title":"Applying filters","text":"

    Previously we calculated the basic statistics using PLINK. But when performing certain analyses, we just want to exclude the bad-quality samples or SNPs instead of calculating the statistics for all samples and SNPs.

    In this case we can apply the following filters for example:

    We will apply these filters in the following example if LD-pruning.

    "},{"location":"04_Data_QC/#ld-pruning","title":"LD Pruning","text":"

    There is often strong Linkage disequilibrium(LD) among SNPs, for some analysis we don't need all SNPs and we need to remove the redundant SNPs to avoid bias in genetic estimations. For example, for relatedness estimation, we will use only LD-Pruned SNP set.

    We can use --indep-pairwise 50 5 0.2 to filter out those in strong LD and keep only the independent SNPs.

    Meaning of --indep-pairwise x y z

    Please check https://www.cog-genomics.org/plink/1.9/ld#indep for details.

    Combined with the filters we just introduced, we can run:

    Example

    plink \\\n    --bfile ${genotypeFile} \\\n    --maf 0.01 \\\n    --geno 0.02 \\\n    --mind 0.02 \\\n    --hwe 1e-6 \\\n    --indep-pairwise 50 5 0.2 \\\n    --out plink_results\n
    This command generates two outputs: plink_results.prune.in and plink_results.prune.out plink_results.prune.in is the independent set of SNPs we will use in the following analysis.

    You can check the PLINK log for how many variants were removed based on the filters you applied:

    Total genotyping rate in remaining samples is 0.993916.\n108837 variants removed due to missing genotype data (--geno).\n--hwe: 9754 variants removed due to Hardy-Weinberg exact test.\n87149 variants removed due to minor allele threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1029376 variants and 501 people pass filters and QC.\n

    Let's take a look at the LD-pruned SNP file. Basically, it just contains one SNP id per line.

    head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
    "},{"location":"04_Data_QC/#inbreeding-f-coefficient","title":"Inbreeding F coefficient","text":"

    Next, we can check the heterozygosity F of samples (https://www.cog-genomics.org/plink/1.9/basic_stats#ibc) :

    -het option will compute observed and expected autosomal homozygous genotype counts for each sample. Usually, we need to exclude individuals with high or low heterozygosity coefficients, which suggests that the sample might be contaminated.

    Inbreeding F coefficient calculation by PLINK

    \\[F = {{O(HOM) - E(HOM)}\\over{ M - E(HOM)}}\\]

    High F may indicate a relatively high level of inbreeding.

    Low F may suggest the sample DNA was contaminated.

    Performing LD-pruning beforehand since these calculations do not take LD into account.

    Calculate inbreeding F coefficient

    plink \\\n    --bfile ${genotypeFile} \\\n    --extract plink_results.prune.in \\\n    --het \\\n    --out plink_results\n

    Check the output:

    head plink_results.het\n    FID       IID       O(HOM)       E(HOM)        N(NM)            F\nHG00403   HG00403       180222    1.796e+05       217363      0.01698\nHG00404   HG00404       180127    1.797e+05       217553      0.01023\nHG00406   HG00406       178891    1.789e+05       216533   -0.0001138\nHG00407   HG00407       178992     1.79e+05       216677   -0.0008034\nHG00409   HG00409       179918    1.801e+05       218045    -0.006049\nHG00410   HG00410       179782    1.801e+05       218028    -0.009268\nHG00419   HG00419       178362    1.783e+05       215849     0.001315\nHG00421   HG00421       178222    1.785e+05       216110    -0.008288\nHG00422   HG00422       178316    1.784e+05       215938      -0.0022\n

    A commonly used method is to exclude samples with heterozygosity F deviating more than 3 standard deviations (SD) from the mean. Some studies used a fixed value such as +-0.15 or +-0.2.

    Usually we will use only LD-pruned SNPs for the calculation of F.

    We can plot the distribution of F:

    Distribution of \\(F_{het}\\) in sample data

    Here we use +-0.1 as the \\(F_{het}\\) threshold for convenience.

    Create sample list of individuals with extreme F using awk

    # only one sample\nawk 'NR>1 && $6>0.1 || $6<-0.1 {print $1,$2}' plink_results.het > high_het.sample\n
    "},{"location":"04_Data_QC/#sample-snp-filtering-extractexcludekeepremove","title":"Sample & SNP filtering (extract/exclude/keep/remove)","text":"

    Sometimes we will use only a subset of samples or SNPs included the original dataset. In this case, we can use --extract or --exclude to select or exclude SNPs from analysis, --keep or --remove to select or exclude samples.

    For --keep or --remove, the input is the filename of a sample FID and IID file. For --extract or --exclude, the input is the filename of an SNP list file.

    head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
    "},{"location":"04_Data_QC/#ibd-pi_hat-kinship-coefficient","title":"IBD / PI_HAT / kinship coefficient","text":"

    --genome will estimate IBS/IBD. Usually, for this analysis, we need to prune our data first since the strong LD will cause bias in the results. (This step is computationally intensive)

    Combined with the --extract, we can run:

    How PLINK estimates IBD

    The prior probability of IBS sharing can be modeled as:

    \\[P(I=i) = \\sum^{z=i}_{z=0}P(I=i|Z=z)P(Z=z)\\]

    So the proportion of alleles shared IBD (\\(\\hat{\\pi}\\)) can be estimated by:

    \\[\\hat{\\pi} = {{P(Z=1)}\\over{2}} + P(Z=2)\\]

    Estimate IBD

    plink \\\n    --bfile ${genotypeFile} \\\n    --extract plink_results.prune.in \\\n    --genome \\\n    --out plink_results\n

    PI_HAT is the IBD estimation. Please check https://www.cog-genomics.org/plink/1.9/ibd for more details.

    head plink_results.genome\n    FID1     IID1     FID2     IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO\nHG00403  HG00403  HG00404  HG00404 UN    NA  1.0000  0.0000  0.0000  0.0000  -1  0.858562  0.3679  1.9774\nHG00403  HG00403  HG00406  HG00406 UN    NA  0.9805  0.0044  0.0151  0.0173  -1  0.858324  0.8183  2.0625\nHG00403  HG00403  HG00407  HG00407 UN    NA  0.9790  0.0000  0.0210  0.0210  -1  0.857794  0.8034  2.0587\nHG00403  HG00403  HG00409  HG00409 UN    NA  0.9912  0.0000  0.0088  0.0088  -1  0.857024  0.2637  1.9578\nHG00403  HG00403  HG00410  HG00410 UN    NA  0.9699  0.0235  0.0066  0.0184  -1  0.858194  0.6889  2.0335\nHG00403  HG00403  HG00419  HG00419 UN    NA  1.0000  0.0000  0.0000  0.0000  -1  0.857643  0.8597  2.0745\nHG00403  HG00403  HG00421  HG00421 UN    NA  0.9773  0.0218  0.0010  0.0118  -1  0.857276  0.2186  1.9484\nHG00403  HG00403  HG00422  HG00422 UN    NA  0.9880  0.0000  0.0120  0.0120  -1  0.857224  0.8277  2.0652\nHG00403  HG00403  HG00428  HG00428 UN    NA  0.9801  0.0069  0.0130  0.0164  -1  0.858162  0.9812  2.1471\n

    KING-robust kinship estimator

    PLINK2 uses KING-robust kinship estimator, which is more robust in the presence of population substructure. See here.

    Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W. M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

    Since the samples are unrelated, we do not need to remove any samples at this step. But remember to check this for your dataset.

    "},{"location":"04_Data_QC/#ld-calculation","title":"LD calculation","text":"

    We can also use our data to estimate the LD between a pair of SNPs.

    Details on LD can be found here

    --chr option in PLINK allows us to include SNPs on a specific chromosome. To calculate LD r2 for SNPs on chr22 , we can run:

    Example

    plink \\\n        --bfile ${genotypeFile} \\\n        --chr 22 \\\n        --r2 \\\n        --out plink_results\n
    head plink_results.ld\n CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2\n22     16069141   22:16069141:C:G     22     16071624   22:16071624:A:G     0.771226\n22     16069784   22:16069784:A:T     22     16149743   22:16149743:T:A     0.217197\n22     16069784   22:16069784:A:T     22     16150589   22:16150589:C:A     0.224992\n22     16069784   22:16069784:A:T     22     16159060   22:16159060:G:A       0.2289\n22     16149743   22:16149743:T:A     22     16150589   22:16150589:C:A     0.965109\n22     16149743   22:16149743:T:A     22     16152606   22:16152606:T:C     0.692157\n22     16149743   22:16149743:T:A     22     16159060   22:16159060:G:A     0.721796\n22     16149743   22:16149743:T:A     22     16193549   22:16193549:C:T     0.336477\n22     16149743   22:16149743:T:A     22     16212542   22:16212542:C:T     0.442424\n
    "},{"location":"04_Data_QC/#data-management-make-bedrecode","title":"Data management (make-bed/recode)","text":"

    By far the input data we use is in binary form, but sometimes we may want the text version.

    Info

    To convert the formats, we can run:

    Convert PLINK formats

    #extract the 1000 samples with the pruned SNPs, and make a bed file.\nplink \\\n    --bfile ${genotypeFile} \\\n    --extract plink_results.prune.in \\\n    --make-bed \\\n    --out plink_1000_pruned\n\n#convert the bed/bim/fam to ped/map\nplink \\\n        --bfile plink_1000_pruned \\\n        --recode \\\n        --out plink_1000_pruned\n
    "},{"location":"04_Data_QC/#apply-all-the-filters-to-obtain-a-clean-dataset","title":"Apply all the filters to obtain a clean dataset","text":"

    We can then apply the filters and remove samples with high \\(F_{het}\\) to get a clean dataset for later use.

    plink \\\n        --bfile ${genotypeFile} \\\n        --maf 0.01 \\\n        --geno 0.02 \\\n        --mind 0.02 \\\n        --hwe 1e-6 \\\n        --remove high_het.sample \\\n        --keep-allele-order \\\n        --make-bed \\\n        --out sample_data.clean\n
    1224104 variants and 500 people pass filters and QC.\n
    -rw-r--r--  1 yunye yunye 146M Dec 26 15:40 sample_data.clean.bed\n-rw-r--r--  1 yunye yunye  39M Dec 26 15:40 sample_data.clean.bim\n-rw-r--r--  1 yunye yunye  13K Dec 26 15:40 sample_data.clean.fam\n
    "},{"location":"04_Data_QC/#other-common-qc-steps-not-included-in-this-tutorial","title":"Other common QC steps not included in this tutorial","text":""},{"location":"04_Data_QC/#exercise","title":"Exercise","text":""},{"location":"04_Data_QC/#additional-resources","title":"Additional resources","text":""},{"location":"04_Data_QC/#reference","title":"Reference","text":""},{"location":"05_PCA/","title":"Principle component analysis (PCA)","text":"

    PCA aims to find the orthogonal directions of maximum variance and project the data onto a new subspace with equal or fewer dimensions than the original one. Simply speaking, GRM (genetic relationship matrix; covariance matrix) is first estimated and then PCA is applied to this matrix to generate eigenvectors and eigenvalues. Finally, the \\(k\\) eigenvectors with the largest eigenvalues are used to transform the genotypes to a new feature subspace.

    Genetic relationship matrix (GRM)

    Citation: Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.

    A simple PCA

    Source data:

    cov = np.array([[6, -3], [-3, 3.5]])\npts = np.random.multivariate_normal([0, 0], cov, size=800)\n

    The red arrow shows the first principal component axis (PC1) and the blue arrow shows the second principal component axis (PC2). The two axes are orthogonal.

    Interpretation of PCs

    The first principal component of a set of p variables, presumed to be jointly normally distributed, is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p iterations until all the variance is explained.

    PCA is by far the most commonly used dimension reduction approach used in population genetics which could identify the difference in ancestry among the sample individuals. The population outliers could be excluded from the main cluster. For GWAS we also need to include top PCs to adjust for the population stratification.

    Please read the following paper on how we apply PCA to genetic data: Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904\u2013909 (2006). https://doi.org/10.1038/ng1847 https://www.nature.com/articles/ng1847

    So before association analysis, we will learn how to run PCA analysis first.

    "},{"location":"05_PCA/#preparation","title":"Preparation","text":""},{"location":"05_PCA/#exclude-snps-in-high-ld-or-hla-regions","title":"Exclude SNPs in high-LD or HLA regions","text":"

    For PCA, we first exclude SNPs in high-LD or HLA regions from the genotype data.

    The reason why we want to exclude such high-LD or HLA regions

    "},{"location":"05_PCA/#download-bed-like-files-for-high-ld-or-hla-regions","title":"Download BED-like files for high-LD or HLA regions","text":"

    You can simply copy the list of high-LD or HLA regions in Genome build version(.bed format) to a text file high-ld.txt.

    High LD regions were obtained from

    https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)

    High LD regions of hg19

    high-ld-hg19.txt
    1   48000000    52000000    highld\n2   86000000    100500000   highld\n2   134500000   138000000   highld\n2   183000000   190000000   highld\n3   47500000    50000000    highld\n3   83500000    87000000    highld\n3   89000000    97500000    highld\n5   44500000    50500000    highld\n5   98000000    100500000   highld\n5   129000000   132000000   highld\n5   135500000   138500000   highld\n6   25000000    35000000    highld\n6   57000000    64000000    highld\n6   140000000   142500000   highld\n7   55000000    66000000    highld\n8   7000000 13000000    highld\n8   43000000    50000000    highld\n8   112000000   115000000   highld\n10  37000000    43000000    highld\n11  46000000    57000000    highld\n11  87500000    90500000    highld\n12  33000000    40000000    highld\n12  109500000   112000000   highld\n20  32000000    34500000    highld\n
    "},{"location":"05_PCA/#create-a-list-of-snps-in-high-ld-or-hla-regions","title":"Create a list of SNPs in high-LD or HLA regions","text":"

    Next, use high-ld.txt to extract all SNPs which are located in the regions described in the file using the code as follows:

    plink --file ${plinkFile} --make-set high-ld.txt --write-set --out hild\n

    Create a list of SNPs in the regions specified in high-ld.txt

    plinkFile=\"../04_Data_QC/sample_data.clean\"\n\nplink \\\n    --bfile ${plinkFile} \\\n    --make-set high-ld-hg19.txt \\\n    --write-set \\\n    --out hild\n

    And all SNPs in the regions will be extracted to hild.set.

    $head hild.set\nhighld\n1:48000156:C:G\n1:48002096:C:G\n1:48003081:T:C\n1:48004776:C:T\n1:48006500:A:G\n1:48006546:C:T\n1:48008102:T:G\n1:48009994:C:T\n1:48009997:C:A\n

    For downstream analysis, we can exclude these SNPs using --exclude hild.set.

    "},{"location":"05_PCA/#pca-steps","title":"PCA steps","text":"

    Steps to perform a typical genomic PCA analysis

    MAF filter for LD-pruning and PCA

    For LD-pruning and PCA, we usually only use variants with MAF > 0.01 or MAF>0.05 ( --maf 0.01 or --maf 0.05) for robust estimation.

    "},{"location":"05_PCA/#sample-codes","title":"Sample codes","text":"

    Sample codes for performing PCA

    plinkFile=\"\" #please set this to your own path\noutPrefix=\"plink_results\"\nthreadnum=2\nhildset = hild.set \n\n# LD-pruning, excluding high-LD and HLA regions\nplink2 \\\n        --bfile ${plinkFile} \\\n        --maf 0.01 \\\n        --threads ${threadnum} \\\n        --exclude ${hildset} \\ \n        --indep-pairwise 500 50 0.2 \\\n        --out ${outPrefix}\n\n# Remove related samples using king-cuttoff\nplink2 \\\n        --bfile ${plinkFile} \\\n        --extract ${outPrefix}.prune.in \\\n        --king-cutoff 0.0884 \\\n        --threads ${threadnum} \\\n        --out ${outPrefix}\n\n# PCA after pruning and removing related samples\nplink2 \\\n        --bfile ${plinkFile} \\\n        --keep ${outPrefix}.king.cutoff.in.id \\\n        --extract ${outPrefix}.prune.in \\\n        --freq counts \\\n        --threads ${threadnum} \\\n        --pca approx allele-wts 10 \\     \n        --out ${outPrefix}\n\n# Projection (related and unrelated samples)\nplink2 \\\n        --bfile ${plinkFile} \\\n        --threads ${threadnum} \\\n        --read-freq ${outPrefix}.acount \\\n        --score ${outPrefix}.eigenvec.allele 2 5 header-read no-mean-imputation variance-standardize \\\n        --score-col-nums 6-15 \\\n        --out ${outPrefix}_projected\n

    --pca and --pca approx

    For step 3, please note that approx flag is only recommended for analysis of >5000 samples. (It was applied in the sample code anyway because in real analysis you usually have a much larger sample size, though the sample size of our data is just ~500)

    After step 3, the allele-wts 10 modifier requests an additional one-line-per-allele .eigenvec.allele file with the first 10 PCs expressed as allele weights instead of sample weights.

    We will get the plink_results.eigenvec.allele file, which will be used to project onto all samples along with an allele count plink_results.acount file.

    In the projection, score ${outPrefix}.eigenvec.allele 2 5 sets the ID (2nd column) and A1 (5th column), score-col-nums 6-15 sets the first 10 PCs to be projected.

    Please check https://www.cog-genomics.org/plink/2.0/score#pca_project for more details on the projection.

    Allele weight and count files

    plink_results.eigenvec.allele
    #CHROM  ID      REF     ALT     PROVISIONAL_REF?        A1      PC1     PC2     PC3     PC4     PC5     PC6     PC7PC8      PC9     PC10\n1       1:15774:G:A     G       A       Y       G       0.57834 -1.03002        0.744557        -0.161887       0.389223    -0.0514592      0.133195        -0.0336162      -0.846376       0.0542876\n1       1:15774:G:A     G       A       Y       A       -0.57834        1.03002 -0.744557       0.161887        -0.389223   0.0514592       -0.133195       0.0336162       0.846376        -0.0542876\n1       1:15777:A:G     A       G       Y       A       -0.585215       0.401872        -0.393071       -1.79583   0.89579  -0.700882       -0.103729       -0.694495       -0.007313       0.513223\n1       1:15777:A:G     A       G       Y       G       0.585215        -0.401872       0.393071        1.79583 -0.89579    0.700882        0.103729        0.694495        0.007313        -0.513223\n1       1:57292:C:T     C       T       Y       C       -0.123768       0.912046        -0.353606       -0.220148  -0.893017        -0.374505       -0.141002       -0.249335       0.625097        0.206104\n1       1:57292:C:T     C       T       Y       T       0.123768        -0.912046       0.353606        0.220148   0.893017 0.374505        0.141002        0.249335        -0.625097       -0.206104\n1       1:77874:G:A     G       A       Y       G       1.49202 -1.12567        1.19915 0.0755314       0.401134   -0.015842        0.0452086       0.273072        -0.00716098     0.237545\n1       1:77874:G:A     G       A       Y       A       -1.49202        1.12567 -1.19915        -0.0755314      -0.401134   0.015842        -0.0452086      -0.273072       0.00716098      -0.237545\n1       1:87360:C:T     C       T       Y       C       -0.191803       0.600666        -0.513208       -0.0765155 -0.656552        0.0930399       -0.0238774      -0.330449       -0.192037       -0.727729\n
    plink_results.acount
    #CHROM  ID      REF     ALT     PROVISIONAL_REF?        ALT_CTS OBS_CT\n1       1:15774:G:A     G       A       Y       28      994\n1       1:15777:A:G     A       G       Y       73      994\n1       1:57292:C:T     C       T       Y       104     988\n1       1:77874:G:A     G       A       Y       19      994\n1       1:87360:C:T     C       T       Y       23      998\n1       1:125271:C:T    C       T       Y       967     996\n1       1:232449:G:A    G       A       Y       185     996\n1       1:533113:A:G    A       G       Y       129     992\n1       1:565697:A:G    A       G       Y       334     996\n

    Eventually, we will get the PCA results for all samples.

    PCA results for all samples

    plink_results_projected.sscore
    #FID    IID     ALLELE_CT       NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG     PC9_AVG PC10_AVG\nHG00403 HG00403 390256  390256  0.00290265      -0.0248649      0.0100408       0.00957591      0.00694349      -0.00222251 0.0082228       -0.00114937     0.00335249      0.00437471\nHG00404 HG00404 390696  390696  -0.000141221    -0.027965       0.025389        -0.00582538     -0.00274707     0.00658501  0.0113803       0.0077766       0.0159976       0.0178927\nHG00406 HG00406 388524  388524  0.00707397      -0.0315445      -0.00437011     -0.0012621      -0.0114932      -0.00539483 -0.00620153     0.00452379      -0.000870627    -0.00227979\nHG00407 HG00407 388808  388808  0.00683977      -0.025073       -0.00652723     0.00679729      -0.0116 -0.0102328 0.0139572        0.00618677      0.0138063       0.00825269\nHG00409 HG00409 391646  391646  0.000398695     -0.0290334      -0.0189352      -0.00135977     0.0290436       0.00942829  -0.0171194      -0.0129637      0.0253596       0.022907\nHG00410 HG00410 391600  391600  0.00277094      -0.0280021      -0.0209991      -0.00799085     0.0318038       -0.00284209 -0.031517       -0.0010026      0.0132541       0.0357565\nHG00419 HG00419 387118  387118  0.00684154      -0.0326244      0.00237159      0.0167284       -0.0119737      -0.0079637  -0.0144339      0.00712756      0.0114292       0.00404426\nHG00421 HG00421 387720  387720  0.00157095      -0.0338115      -0.00690541     0.0121058       0.00111378      0.00530794  -0.0017545      -0.00121793     0.00393407      0.00414204\nHG00422 HG00422 387466  387466  0.00439167      -0.0332386      0.000741526     0.0124843       -0.00362248     -0.00343393 -0.00735112     0.00944759      -0.0107516      0.00376537\n
    "},{"location":"05_PCA/#plotting-the-pcs","title":"Plotting the PCs","text":"

    You can now create scatterplots of the PCs using R or Python.

    For plotting using Python: plot_PCA.ipynb

    Scatter plot of PC1 and PC2 using 1KG EAS individuals

    Note : We only used a small proportion of all available variants. This figure only very roughly shows the population structure in East Asia.

    Requirements: - python>3 - numpy,pandas,seaborn,matplotlib

    "},{"location":"05_PCA/#pca-umap","title":"PCA-UMAP","text":"

    (optional) We can also apply another non-linear dimension reduction algorithm called UMAP to the PCs to further identify the local structures. (PCA-UMAP)

    For more details, please check: - https://umap-learn.readthedocs.io/en/latest/index.html

    An example of PCA and PCA-UMAP for population genetics: - Sakaue, S., Hirata, J., Kanai, M., Suzuki, K., Akiyama, M., Lai Too, C., ... & Okada, Y. (2020). Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nature communications, 11(1), 1-11.

    "},{"location":"05_PCA/#references","title":"References","text":""},{"location":"06_Association_tests/","title":"Association test","text":""},{"location":"06_Association_tests/#overview","title":"Overview","text":""},{"location":"06_Association_tests/#genetic-models","title":"Genetic models","text":"

    To test the association between a phenotype and genotypes, we need to group the genotypes based on genetic models.

    There are three basic genetic models:

    Three genetic models

    For example, suppose we have a biallelic SNP whose reference allele is A and the alternative allele is G.

    There are three possible genotypes for this SNP: AA, AG, and GG.

    This table shows how we group different genotypes under each genetic model for association tests using linear or logistic regressions.

    Genetic models AA AG GG Additive model 0 1 2 Dominant model 0 1 1 Recessive model 0 0 1

    Contingency table and non-parametric tests

    A simple way to test association is to use the 2x2 or 2x3 contingency table. For dominant and recessive models, Chi-square tests are performed using the 2x2 table. For the additive model, Cochran-Armitage trend tests are performed for the 2x3 table. However, the non-parametric tests do not adjust for the bias caused by other covariates like sex, age and so forth.

    "},{"location":"06_Association_tests/#association-testing-basics","title":"Association testing basics","text":"

    For quantitative traits, we can employ a simple linear regression model to test associations:

    \\[ y = G\\beta_G + X\\beta_X + e \\]

    Interpretation of linear regression

    For binary traits, we can utilize the logistic regression model to test associations:

    \\[ logit(p) = G\\beta_G + X\\beta_X + e \\]

    Linear regression and logistic regression

    "},{"location":"06_Association_tests/#file-preparation","title":"File Preparation","text":"

    To perform genome-wide association tests, usually, we need the following files:

    Phenotype and covariate files

    Phenotype file for a simulated binary trait; B1 is the phenotype name; 1 means the control, 2 means the case.

    1kgeas_binary.txt
    FID IID B1\nHG00403 HG00403 1\nHG00404 HG00404 2\nHG00406 HG00406 1\nHG00407 HG00407 1\nHG00409 HG00409 2\nHG00410 HG00410 2\nHG00419 HG00419 1\nHG00421 HG00421 1\nHG00422 HG00422 1\n\nCovariate file (only top PCs calculated in the previous PCA section)\n\n```txt title=\"plink_results_projected.sscore\"\n#FID    IID     ALLELE_CT       NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVGPC9_AVG  PC10_AVG\nHG00403 HG00403 390256  390256  0.00290265      -0.0248649      -0.0100407      0.00957595      0.00694056      0.00222996      0.00823028      0.00116497      -0.00334937     0.00434627\nHG00404 HG00404 390696  390696  -0.000141221    -0.027965       -0.025389       -0.00582553     -0.00274711     -0.00657958     0.0113769       -0.00778919     -0.0159685      0.0180678\nHG00406 HG00406 388524  388524  0.00707397      -0.0315445      0.00437013      -0.00126195     -0.0114938      0.00538932      -0.00619657     -0.00454686     0.000969112     -0.00217617\nHG00407 HG00407 388808  388808  0.00683977      -0.025073       0.00652723      0.00679731      -0.0116001      0.0102403       0.0139674       -0.00621948     -0.013797       0.00827744\nHG00409 HG00409 391646  391646  0.000398695     -0.0290334      0.0189352       -0.00135996     0.0290464       -0.00941851     -0.0171911      0.01293 -0.0252628      0.0230819\nHG00410 HG00410 391600  391600  0.00277094      -0.0280021      0.0209991       -0.00799089     0.0318043       0.00283456      -0.0315157      0.000978664     -0.0133768      0.0356721\nHG00419 HG00419 387118  387118  0.00684154      -0.0326244      -0.00237159     0.0167284       -0.0119684      0.00795149      -0.0144241      -0.00716183     -0.0115059      0.0038652\nHG00421 HG00421 387720  387720  0.00157095      -0.0338115      0.00690542      0.0121058       0.00111448      -0.00531714     -0.00175494     0.00118513      -0.00391494     0.00414682\nHG00422 HG00422 387466  387466  0.00439167      -0.0332386      -0.000741482    0.0124843       -0.00362885     0.00342491      -0.0073205      -0.00939123     0.010718        0.00360906\n
    "},{"location":"06_Association_tests/#association-tests-using-plink","title":"Association tests using PLINK","text":"

    Please check https://www.cog-genomics.org/plink/2.0/assoc for more details.

    We will perform logistic regression with firth correction for a simulated binary trait under the additive model using the 1KG East Asian individuals.

    Firth correction

    Adding a penalty term to the log-likelihood function when fitting the logistic model results in less bias. - Firth, David. \"Bias reduction of maximum likelihood estimates.\" Biometrika 80.1 (1993): 27-38.

    Quantitative traits

    For quantitative traits, linear regressions will be performed and in this case, we do not need to add firth (since Firth correction is not appliable).

    Sample codes for association test using plink for binary traits

    genotypeFile=\"../04_Data_QC/sample_data.clean\" # the clean dataset we generated in previous section\nphenotypeFile=\"../01_Dataset/1kgeas_binary.txt\" # the phenotype file\ncovariateFile=\"../05_PCA/plink_results_projected.sscore\" # the PC score file\n\ncovariateCols=6-10\ncolName=\"B1\"\nthreadnum=2\n\nplink2 \\\n    --bfile ${genotypeFile} \\\n    --pheno ${phenotypeFile} \\\n    --pheno-name ${colName} \\\n    --maf 0.01 \\\n    --covar ${covariateFile} \\\n    --covar-col-nums ${covariateCols} \\\n    --glm hide-covar firth  firth-residualize single-prec-cc \\\n    --threads ${threadnum} \\\n    --out 1kgeas\n

    Note

    Using the latest version of PLINK2, you need to add firth-residualize single-prec-cc to generate the results. (The algorithm and precision have been changed since 2023 for firth regression)

    You will see a similar log like:

    Log

    1kgeas.log
    PLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023)       www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to 1kgeas.log.\nOptions in effect:\n--bfile ../04_Data_QC/sample_data.clean\n--covar ../05_PCA/plink_results_projected.sscore\n--covar-col-nums 6-10\n--glm hide-covar firth firth-residualize single-prec-cc\n--maf 0.01\n--out 1kgeas\n--pheno ../01_Dataset/1kgeas_binary.txt\n--pheno-name B1\n--threads 2\n\nStart time: Tue Dec 26 15:52:10 2023\n31934 MiB RAM detected, ~30479 available; reserving 15967 MiB for main\nworkspace.\nUsing up to 2 compute threads.\n500 samples (0 females, 0 males, 500 ambiguous; 500 founders) loaded from\n../04_Data_QC/sample_data.clean.fam.\n1224104 variants loaded from ../04_Data_QC/sample_data.clean.bim.\n1 binary phenotype loaded (248 cases, 250 controls).\n5 covariates loaded from ../05_PCA/plink_results_projected.sscore.\nCalculating allele frequencies... done.\n95372 variants removed due to allele frequency threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1128732 variants remaining after main filters.\n--glm Firth regression on phenotype 'B1': done.\nResults written to 1kgeas.B1.glm.firth .\nEnd time: Tue Dec 26 15:53:49 2023\n

    Let's check the first lines of the output:

    Association test results

    1kgeas.B1.glm.firth
        #CHROM  POS     ID      REF     ALT     PROVISIONAL_REF?        A1      OMITTED A1_FREQ TEST    OBS_CT  OR      LOG(OR)_SE  Z_STAT  P       ERRCODE\n1       15774   1:15774:G:A     G       A       Y       A       G       0.0282828       ADD     495     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       15777   1:15777:A:G     A       G       Y       G       A       0.0737374       ADD     495     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       57292   1:57292:C:T     C       T       Y       T       C       0.104675        ADD     492     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       77874   1:77874:G:A     G       A       Y       A       G       0.0191532       ADD     496     1.12228 0.46275     0.249299        0.80313 .\n1       87360   1:87360:C:T     C       T       Y       T       C       0.0231388       ADD     497     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       125271  1:125271:C:T    C       T       Y       C       T       0.0292339       ADD     496     1.53387 0.373358    1.1458  0.25188 .\n1       232449  1:232449:G:A    G       A       Y       A       G       0.185484        ADD     496     0.884097   0.168961 -0.729096       0.465943        .\n1       533113  1:533113:A:G    A       G       Y       G       A       0.129555        ADD     494     0.90593 0.196631    -0.50243        0.615365        .\n1       565697  1:565697:A:G    A       G       Y       G       A       0.334677        ADD     496     1.04653 0.15286     0.297509        0.766078        .\n

    Usually, other options are added to enhance the sumstats

    "},{"location":"06_Association_tests/#genomic-control","title":"Genomic control","text":"

    Genomic control (GC) is a basic method for controlling for confounding factors including population stratification.

    We will calculate the genomic control factor (lambda GC) to evaluate the inflation. The genomic control factor is calculated by dividing the median of observed Chi square statistics by the median of Chi square distribution with the degree of freedom being 1 (which is approximately 0.455).

    \\[ \\lambda_{GC} = {median(\\chi^{2}_{observed}) \\over median(\\chi^{2}_1)} \\]

    Then, we can used the genomic control factor to correct observed Chi suqare statistics.

    \\[ \\chi^{2}_{corrected} = {\\chi^{2}_{observed} \\over \\lambda_{GC}} \\]

    Genomic inflation is based on the idea that most of the variants are not associated, thus no deviation between the observed and expected Chi square distribution, except the spikes at the end. However, if the trait is highly polygenic, this assumption may be violated.

    Reference: Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997-1004.

    "},{"location":"06_Association_tests/#significant-loci","title":"Significant loci","text":"

    Please check Visualization using gwaslab

    Loci that reached genome-wide significance threshold (P value < 5e-8) :

    SNPID   CHR POS EA  NEA EAF SE  Z   P   OR  N   STATUS  REF ALT\n1:167562605:G:A 1   167562605   A   G   0.391481    0.159645    7.69462 1.419150e-14    3.415780    493 9999999 G   A\n2:55513738:C:T  2   55513738    C   T   0.376008    0.153159    -7.96244    1.686760e-15    0.295373    496 9999999 C   T\n7:134368632:T:G 7   134368632   G   T   0.138105    0.225526    6.89025 5.569440e-12    4.730010    496 9999999 T   G\n20:42758834:T:C 20  42758834    T   C   0.227273    0.184323    -7.76902    7.909780e-15    0.238829    495 9999999 T   C\n

    Warning

    This is just to show the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result is meaningless here.

    Allele frequency and Effect size

    "},{"location":"06_Association_tests/#visualization","title":"Visualization","text":"

    To visualize the sumstats, we will create the Manhattan plot, QQ plot and regional plot.

    Please check for codes : Visualization using gwaslab

    "},{"location":"06_Association_tests/#manhattan-plot","title":"Manhattan plot","text":"

    Manhattan plot is the most classic visualization of GWAS summary statistics. It is a form of scatter plot. Each dot represents the test result for a variant. variants are sorted by their genome coordinates and are aligned along the X axis. Y axis shows the -log10(P value) for tests of variants in GWAS.

    Note

    This kind of plot was named after Manhattan in New York City since it resembles the Manhattan skyline.

    A real Manhattan plot

    I took this photo in 2020 just before the COVID-19 pandemic. It was a cloudy and misty day. Those birds formed a significance threshold line. And the skyscrapers above that line resembled the significant signals in your GWAS. I believe you could easily get how the GWAS Manhattan plot was named.

    Data we need from sumstats to create Manhattan plots:

    Steps to create Manhattan plot

    1. sort the variants by genome coordinates.
    2. map the genome coordinates of variants to the x axis.
    3. convert P value to -log10(P).
    4. create the scatter plot.
    "},{"location":"06_Association_tests/#quantile-quantile-plot","title":"Quantile-quantile plot","text":"

    Quantile-quantile plot (also known as Q-Q plot), is commonly used to compare an observed distribution with its expected distribution. For a specific point (x,y) on Q-Q plot, its y coordinate corresponds to one of the quantiles of the observed distribution, while its x coordinate corresponds to the same quantile of the expected distribution.

    Quantile-quantile plot is used to check if there is any significant inflation in P value distribution, which usually indicates population stratification or cryptic relatedness.

    Data we need from sumstats to create the Manhattan plot:

    Steps to create Q-Q plot

    Suppose we have n variants in our sumstats,

    1. convert the n P value to -log10(P).
    2. sort the -log10(P) values in asending order.
    3. get n numbers from (0,1) with equal intervals.
    4. convert the n numbers to -log10(P) and sort in ascending order.
    5. create scatter plot using the sorted -log10(P) of sumstats as Y and sorted -log10(P) we generated as X.

    Note

    The expected distribution of P value is a Uniform distribution from 0 to 1.

    \\[P_{expected} \\sim U(0,1)\\]"},{"location":"06_Association_tests/#regional-plot","title":"Regional plot","text":"

    Manhattan plot is very useful to check the overview of our sumstats. But if we want to check a specific genomic locus, we need a plot with finer resolution. This kind of plot is called a regional plot. It is basically the Manhattan plot of only a small region on the genome, with points colored by its LD r2 with the lead variant in this region.

    Such a plot is especially helpful to understand the signal and loci, e.g., LD structure, independent signals, and genes.

    The regional plot for the loci of 2:55513738:C:T.

    Please check Visualization using gwaslab

    "},{"location":"06_Association_tests/#gwas-ssf","title":"GWAS-SSF","text":"

    To standardize the format of GWAS summary statistics for sharing, GWAS-SSF format was proposed in 2022. This format is now used as the standard format for GWAS Catalog.

    GWAS-SSF consists of :

    1. a tab-separated data file with well-defined fields (shown in the following figure)
    2. an accompanying metadata file describing the study (such as sample ancestry, genotyping method, md5sum, and so forth)

    Schematic representation of GWAS-SSF data file

    GWAS-SSF

    Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv, 2022-07.

    For details, please check:

    "},{"location":"07_Annotation/","title":"Variant Annotation","text":""},{"location":"07_Annotation/#table-of-contents","title":"Table of Contents","text":""},{"location":"07_Annotation/#annovar","title":"ANNOVAR","text":"

    ANNOVAR is a simple and efficient command line tool for variant annotation.

    In this tutorial, we will use ANNOVAR to annotate the variants in our summary statistics (hg19).

    "},{"location":"07_Annotation/#install","title":"Install","text":"

    Download ANNOVAR from here (registration required; freely available to personal, academic and non-profit use only.)

    You will receive an email with the download link after registration. Download it and decompress:

    tar -xvzf annovar.latest.tar.gz\n

    For refGene annotation for hg19, we do not need to download additional files.

    "},{"location":"07_Annotation/#format-input-file","title":"Format input file","text":"

    The default input file for ANNOVAR is a 1-based coordinate file.

    We will only use the first 100000 variants as an example.

    annovar_input

    awk 'NR>1 && NR<100000 {print $1,$2,$2,$4,$5}' ../06_Association_tests/1kgeas.B1.glm.logistic.    hybrid > annovar_input.txt\n
    head annovar_input.txt \n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n

    With -vcfinput option, ANNOVAR can accept input files in VCF format.

    "},{"location":"07_Annotation/#annotation","title":"Annotation","text":"

    Annotate the variants with gene information.

    A minimal example of annotation using refGene

    input=annovar_input.txt\nhumandb=/home/he/tools/annovar/annovar/humandb\ntable_annovar.pl ${input} ${humandb} -buildver hg19 -out myannotation -remove -protocol refGene     -operation g -nastring . -polish\n
    Chr Start   End Ref Alt Func.refGene    Gene.refGene    GeneDetail.refGene  ExonicFunc.refGene  AAChange.    refGene\n1   13273   13273   G   C   ncRNA_exonic    DDX11L1;LOC102725121    .   .   .\n1   14599   14599   T   A   ncRNA_exonic    WASH7P  .   .   .\n1   14604   14604   A   G   ncRNA_exonic    WASH7P  .   .   .\n1   14930   14930   A   G   ncRNA_intronic  WASH7P  .   .   .\n1   69897   69897   T   C   exonic  OR4F5   .   synonymous SNV  OR4F5:NM_001005484:exon1:c.T807C:p.S269S\n1   86331   86331   A   G   intergenic  OR4F5;LOC729737 dist=16323;dist=48442   .   .\n1   91581   91581   G   A   intergenic  OR4F5;LOC729737 dist=21573;dist=43192   .   .\n1   122872  122872  T   G   intergenic  OR4F5;LOC729737 dist=52864;dist=11901   .   .\n1   135163  135163  C   T   ncRNA_exonic    LOC729737   .   .   .\n
    "},{"location":"07_Annotation/#additional-databases","title":"Additional databases","text":"

    ANNOVAR supports a wide range of commonly used databases including dbsnp , dbnsfp, clinvar, gnomad, 1000g, cadd and so forth. For details, please check ANNOVAR's official documents

    You can check the Table Name listed in the link above and download the database you need using the following command.

    Example: Downloading avsnp150 for hg19 from ANNOVAR

    annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb/\n

    An example of annotation using multiple databases

    # input file is in vcf format\ntable_annovar.pl \\\n  ${in_vcf} \\\n  ${humandb} \\\n  -buildver hg19 \\\n  -protocol refGene,avsnp150,clinvar_20200316,gnomad211_exome \\\n  -operation g,f,f,f \\\n  -remove \\\n  -out ${out_prefix} \\ \n  -vcfinput\n
    "},{"location":"07_Annotation/#vep-under-construction","title":"VEP (under construction)","text":""},{"location":"07_Annotation/#install_1","title":"Install","text":"
    git clone https://github.com/Ensembl/ensembl-vep.git\ncd ensembl-vep\nperl INSTALL.pl\n
    Hello! This installer is configured to install v108 of the Ensembl API for use by the VEP.\nIt will not affect any existing installations of the Ensembl API that you may have.\n\nIt will also download and install cache files from Ensembl's FTP server.\n\nChecking for installed versions of the Ensembl API...done\n\nSetting up directories\nDestination directory ./Bio already exists.\nDo you want to overwrite it (if updating VEP this is probably OK) (y/n)? y\n - fetching BioPerl\n - unpacking ./Bio/tmp/release-1-6-924.zip\n - moving files\n\nDownloading required Ensembl API files\n - fetching ensembl\n - unpacking ./Bio/tmp/ensembl.zip\n - moving files\n - getting version information\n - fetching ensembl-variation\n - unpacking ./Bio/tmp/ensembl-variation.zip\n - moving files\n - getting version information\n - fetching ensembl-funcgen\n - unpacking ./Bio/tmp/ensembl-funcgen.zip\n - moving files\n - getting version information\n - fetching ensembl-io\n - unpacking ./Bio/tmp/ensembl-io.zip\n - moving files\n - getting version information\n\nTesting VEP installation\n - OK!\n\nThe VEP can either connect to remote or local databases, or use local cache files.\nUsing local cache files is the fastest and most efficient way to run the VEP\nCache files will be stored in /home/he/.vep\nDo you want to install any cache files (y/n)? y\n\nThe following species/files are available; which do you want (specify multiple separated by spaces or 0 for all): \n1 : acanthochromis_polyacanthus_vep_108_ASM210954v1.tar.gz (69 MB)\n2 : accipiter_nisus_vep_108_Accipiter_nisus_ver1.0.tar.gz (55 MB)\n...\n466 : homo_sapiens_merged_vep_108_GRCh37.tar.gz (16 GB)\n467 : homo_sapiens_merged_vep_108_GRCh38.tar.gz (26 GB)\n468 : homo_sapiens_refseq_vep_108_GRCh37.tar.gz (13 GB)\n469 : homo_sapiens_refseq_vep_108_GRCh38.tar.gz (22 GB)\n470 : homo_sapiens_vep_108_GRCh37.tar.gz (14 GB)\n471 : homo_sapiens_vep_108_GRCh38.tar.gz (22 GB)\n\n  Total: 221 GB for all 471 files\n\n? 470\n - downloading https://ftp.ensembl.org/pub/release-108/variation/indexed_vep_cache/homo_sapiens_vep_108_GRCh37.tar.gz\n
    "},{"location":"08_LDSC/","title":"LD score regression","text":""},{"location":"08_LDSC/#table-of-contents","title":"Table of Contents","text":""},{"location":"08_LDSC/#introduction","title":"Introduction","text":"

    LDSC is one of the most commonly used command line tool to estimate inflation, hertability, genetic correlation and cell/tissue type specificity from GWAS summary statistics.

    "},{"location":"08_LDSC/#ld-linkage-disequilibrium","title":"LD: Linkage disequilibrium","text":"

    Linkage disequilibrium (LD) : non-random association of alleles at different loci in a given population. (Wiki)

    "},{"location":"08_LDSC/#ld-score","title":"LD score","text":"

    LD score \\(l_j\\) for a SNP \\(j\\) is defined as the sum of \\(r^2\\) for the SNP and other SNPs in a region.

    \\[ l_j= \\Sigma_k{r^2_{j,k}} \\]"},{"location":"08_LDSC/#ld-score-regression_1","title":"LD score regression","text":"

    Key idea: A variant will have higher test statistics if it is in LD with causal variant, and the elevation is proportional to the correlation ( \\(r^2\\) ) with the causal variant.

    \\[ E[\\chi^2|l_j] = {{Nh^2l_j}\\over{M}} + Na + 1 \\]

    For more details of LD score regression, please refer to : - Bulik-Sullivan, Brendan K., et al. \"LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.\" Nature genetics 47.3 (2015): 291-295.

    "},{"location":"08_LDSC/#install-ldsc","title":"Install LDSC","text":"

    LDSC can be downloaded from github (GPL-3.0 license): https://github.com/bulik/ldsc

    For ldsc, we need anaconda to create virtual environment (for python2). If you haven't installed Anaconda, please check how to install anaconda.

    # change to your directory for tools\ncd ~/tools\n\n# clone the ldsc github repository\ngit clone https://github.com/bulik/ldsc.git\n\n# create a virtual environment for ldsc (python2)\ncd ldsc\nconda env create --file environment.yml  \n\n# activate ldsc environment\nconda activate ldsc\n
    "},{"location":"08_LDSC/#data-preparation","title":"Data Preparation","text":"

    In this tutoial, we will use sample summary statistics for HDLC and LDLC from Jenger. - Kanai, Masahiro, et al. \"Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases.\" Nature genetics 50.3 (2018): 390-400.

    The Miami plot for the two traits:

    "},{"location":"08_LDSC/#download-sample-summary-statistics","title":"Download sample summary statistics","text":"
    # HDL-c and LDL-c in Biobank Japan\nwget -O BBJ_LDLC.txt.gz http://jenger.riken.jp/61analysisresult_qtl_download/\nwget -O BBJ_HDLC.txt.gz http://jenger.riken.jp/47analysisresult_qtl_download/\n
    "},{"location":"08_LDSC/#download-reference-files","title":"Download reference files","text":"

    # change to your ldsc directory\ncd ~/tools/ldsc\nmkdir resource\ncd ./resource\n\n# snplist\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/w_hm3.snplist.bz2\n\n# EAS ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/eas_ldscores.tar.bz2\n\n# EAS weight\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_weights_hm3_no_MHC.tgz\n\n# EAS frequency\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_plinkfiles.tgz\n\n# EAS baseline model\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_baseline_v1.2_ldscores.tgz\n\n# Cell type ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/LDSC_SEG_ldscores/Cahoy_EAS_1000Gv3_ldscores.tar.gz\n
    You can then decompress the files and organize them.

    "},{"location":"08_LDSC/#munge-sumstats","title":"Munge sumstats","text":"

    Before the analysis, we need to format and clean the raw sumstats.

    Note

    Rsid is used here. If the sumstats only contained id like CHR:POS:REF:ALT, annotate it first.

    snplist=~/tools/ldsc/resource/w_hm3.snplist\nmunge_sumstats.py \\\n    --sumstats BBJ_HDLC.txt.gz \\\n    --merge-alleles $snplist \\\n    --a1 ALT \\\n    --a2 REF \\\n    --chunksize 500000 \\\n    --out BBJ_HDLC\nmunge_sumstats.py \\\n  --sumstats BBJ_LDLC.txt.gz \\\n    --a1 ALT \\\n  --a2 REF \\\n  --chunksize 500000 \\\n  --merge-alleles $snplist \\\n  --out BBJ_LDLC\n

    After munging, you will get two munged and formatted files:

    BBJ_HDLC.sumstats.gz\nBBJ_LDLC.sumstats.gz\n
    And these are the files we will use to run LD score regression.

    "},{"location":"08_LDSC/#ld-score-regression_2","title":"LD score regression","text":"

    Univariate LD score regression is utilized to estimate heritbility and confuding factors (cryptic relateness and population stratification) of a certain trait.

    Using the munged sumstats, we can run:

    ldsc.py \\\n  --h2 BBJ_HDLC.sumstats.gz \\\n  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --out BBJ_HDLC\n\nldsc.py \\\n  --h2 BBJ_LDLC.sumstats.gz \\\n  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --out BBJ_LDLC\n

    Lest's check the results for HDLC:

    cat BBJ_HDLC.log\n*********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--h2 BBJ_HDLC.sumstats.gz \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Sat Dec 24 20:40:34 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nUsing two-step estimator with cutoff at 30.\nTotal Observed scale h2: 0.1583 (0.0281)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.0563 (0.0114)\nRatio: 0.1981 (0.0402)\nAnalysis finished at Sat Dec 24 20:40:41 2022\nTotal time elapsed: 6.57s\n

    We can see that from the log:

    According to LDSC documents, Ratio measures the proportion of the inflation in the mean chi^2 that the LD Score regression intercept ascribes to causes other than polygenic heritability. The value of ratio should be close to zero, though in practice values of 10-20% are not uncommon.

    \\[ Ratio = {{intercept-1}\\over{mean(\\chi^2)-1}} \\]"},{"location":"08_LDSC/#distribution-of-h2-and-intercept-across-traits-in-ukb","title":"Distribution of h2 and intercept across traits in UKB","text":"

    The Neale Lab estimated SNP heritability using LDSC across more than 4,000 primary GWAS in UKB. You can check the distributions of SNP heritability and intercept estimates using the following link to get the idea of what you can expect from LD score regresion:

    https://nealelab.github.io/UKBB_ldsc/viz_h2.html

    "},{"location":"08_LDSC/#cross-trait-ld-score-regression","title":"Cross-trait LD score regression","text":"

    Cross-trait LD score regression is employed to estimate the genetic correlation between a pair of traits.

    Key idea: replace \\chi^2 in univariate LD score regression and the relationship (SNPs with high LD ) still holds.

    \\[ E[z_{1j}z_{2j}] = {{\\sqrt{N_1N_2}\\rho_g}\\over{M}}l_j + {{\\rho N_s}\\over{\\sqrt{N_1N_2}}} \\]

    Then we can get the genetic correlation by :

    \\[ r_g = {{\\rho_g}\\over{\\sqrt{h_1^2h_2^2}}} \\]

    ldsc.py \\\n  --rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --out BBJ_HDLC_LDLC\n
    Let's check the results:

    *********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC_LDLC \\\n--rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Thu Dec 29 21:02:37 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nComputing rg for phenotype 2/2\nReading summary statistics from BBJ_LDLC.sumstats.gz ...\nRead summary statistics for 1217311 SNPs.\nAfter merging with summary statistics, 1012040 SNPs remain.\n1012040 SNPs with valid alleles.\n\nHeritability of phenotype 1\n---------------------------\nTotal Observed scale h2: 0.1054 (0.0383)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.1234 (0.0607)\nRatio: 0.4342 (0.2134)\n\nHeritability of phenotype 2/2\n-----------------------------\nTotal Observed scale h2: 0.0543 (0.0211)\nLambda GC: 1.0833\nMean Chi^2: 1.1465\nIntercept: 1.0583 (0.0335)\nRatio: 0.398 (0.2286)\n\nGenetic Covariance\n------------------\nTotal Observed scale gencov: 0.0121 (0.0106)\nMean z1*z2: -0.001\nIntercept: -0.0198 (0.0121)\n\nGenetic Correlation\n-------------------\nGenetic Correlation: 0.1601 (0.1821)\nZ-score: 0.8794\nP: 0.3792\n\n\nSummary of Genetic Correlation Results\np1                    p2      rg      se       z       p  h2_obs  h2_obs_se  h2_int  h2_int_se  gcov_int  gcov_int_se\nBBJ_HDLC.sumstats.gz  BBJ_LDLC.sumstats.gz  0.1601  0.1821  0.8794  0.3792  0.0543     0.0211  1.0583     0.0335   -0.0198       0.0121\n\nAnalysis finished at Thu Dec 29 21:02:47 2022\nTotal time elapsed: 10.39s\n
    "},{"location":"08_LDSC/#partitioned-ld-regression","title":"Partitioned LD regression","text":"

    Partitioned LD regression is utilized to evaluate the contribution of each functional group to the total SNP heriatbility.

    \\[ E[\\chi^2] = N \\sum\\limits_C \\tau_C l(j,C) + Na + 1 \\]
    ldsc.py \\\n  --h2 BBJ_HDLC.sumstats.gz \\\n  --overlap-annot \\\n  --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n  --frqfile-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_plinkfiles/1000G.EAS.QC. \\\n  --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n  --out BBJ_HDLC_baseline\n
    "},{"location":"08_LDSC/#celltype-specificity-ld-regression","title":"Celltype specificity LD regression","text":"

    LDSC-SEG : LD score regression applied to specifically expressed genes

    An extension of Partitioned LD regression. Categories are defined by tissue or cell-type specific genes.

    ldsc.py \\\n  --h2-cts BBJ_HDLC.sumstats.gz \\\n  --ref-ld-chr-cts ~/tools/ldsc/resource/Cahoy_EAS_1000Gv3_ldscores/Cahoy.EAS.ldcts \\\n  --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n  --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n  --out BBJ_HDLC_baseline_cts\n
    "},{"location":"08_LDSC/#reference","title":"Reference","text":""},{"location":"09_Gene_based_analysis/","title":"Gene and gene-set analysis","text":""},{"location":"09_Gene_based_analysis/#table-of-contents","title":"Table of Contents","text":""},{"location":"09_Gene_based_analysis/#magma-introduction","title":"MAGMA Introduction","text":"

    MAGMA is one the most commonly used tools for gene-based and gene-set analysis.

    Gene-level analysis in MAGMA uses two models:

    1.Multiple linear principal components regression

    MAGMA employs a multiple linear principal components regression, and F test to obtain P values for genes. The multiple linear principal components regression:

    \\[ Y = \\alpha_{0,g} + X_g \\alpha_g + W \\beta_g + \\epsilon_g \\]

    \\(X_g\\) is obtained by first projecting the variant matrix of a gene onto its PC, and removing PCs with samll eigenvalues.

    Note

    The linear principal components regression model requires raw genotype data.

    2.SNP-wise models

    SNP-wise Mean: perform tests on mean SNP association

    Note

    SNP-wise models use summary statistics and reference LD panel

    Gene-set analysis

    Quote

    Competitive gene-set analysis tests whether the genes in a gene-set are more strongly associated with the phenotype of interest than other genes.

    P values for each gene were converted to Z scores to perform gene-set level analysis.

    \\[ Z = \\beta_{0,S} + S_S \\beta_S + \\epsilon \\] "},{"location":"09_Gene_based_analysis/#install-magma","title":"Install MAGMA","text":"

    Dowload MAGMA for your operating system from the following url:

    MAGMA: https://ctg.cncr.nl/software/magma

    For example:

    cd ~/tools\nmkdir MAGMA\ncd MAGMA\nwget https://ctg.cncr.nl/software/MAGMA/prog/magma_v1.10.zip\nunzip magma_v1.10.zip\n
    Add magma to your environment path.

    Test if it is successfully installed.

    $ magma --version\nMAGMA version: v1.10 (linux)\n

    "},{"location":"09_Gene_based_analysis/#download-reference-files","title":"Download reference files","text":"

    We nedd the following reference files:

    The gene location files and LD reference panel can be downloaded from magma website.

    -> https://ctg.cncr.nl/software/magma

    The third one can be downloaded form MsigDB.

    -> https://www.gsea-msigdb.org/gsea/msigdb/

    "},{"location":"09_Gene_based_analysis/#format-input-files","title":"Format input files","text":"
    zcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,$2,$3}' > HDLC_chr3.magma.input.snp.chr.pos.txt\nzcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,10^(-$11)}' >  HDLC_chr3.magma.input.p.txt\n
    "},{"location":"09_Gene_based_analysis/#annotate-snps","title":"Annotate SNPs","text":"
    snploc=./HDLC_chr3.magma.input.snp.chr.pos.txt\nncbi37=~/tools/magma/NCBI37/NCBI37.3.gene.loc\nmagma --annotate \\\n      --snp-loc ${snploc} \\\n      --gene-loc ${ncbi37} \\\n      --out HDLC_chr3\n

    Tip

    Usually to capture the variants in the regulatory regions, we will add windows upstream and downstream of the genes with --annotate window.

    For example, --annotate window=35,10 set a 35 kilobase pair(kb) upstream and 10kb downstream window.

    "},{"location":"09_Gene_based_analysis/#gene-based-analysis","title":"Gene-based analysis","text":"
    ref=~/tools/magma/g1000_eas/g1000_eas\nmagma \\\n    --bfile $ref \\\n    --pval ./HDLC_chr3.magma.input.p.txt N=70657 \\\n    --gene-annot HDLC_chr3.genes.annot \\\n    --out HDLC_chr3\n
    "},{"location":"09_Gene_based_analysis/#gene-set-level-analysis","title":"Gene-set level analysis","text":"
    geneset=/home/he/tools/magma/MSigDB/msigdb_v2022.1.Hs_files_to_download_locally/msigdb_v2022.1.Hs_GMTs/msigdb.v2022.1.Hs.entrez.gmt\nmagma \\\n    --gene-results HDLC_chr3.genes.raw \\\n    --set-annot ${geneset} \\\n    --out HDLC_chr3\n
    "},{"location":"09_Gene_based_analysis/#reference","title":"Reference","text":""},{"location":"10_PRS/","title":"Polygenic risk scores","text":""},{"location":"10_PRS/#definition","title":"Definition","text":"

    Polygenic risk score(PRS), as known as polygenic score (PGS) or genetic risk score (GRS), is a score that summarizes the effect sizes of genetic variants on a certain disease or trait (weighted sum of disease/trait-associated alleles).

    To calculate the PRS for sample j,

    \\[PRS_j = \\sum_{i=0}^{i=M} x_{i,j} \\beta_{i}\\] "},{"location":"10_PRS/#prs-analysis-workflow","title":"PRS Analysis Workflow","text":"
    1. Developing PRS model using base data
    2. Performing validation to obtain best-fit parameters
    3. Evaluation in an independent population
    "},{"location":"10_PRS/#methods","title":"Methods","text":"Category Description Representative Methods P value thresholding P + T C+T, PRSice Beta shrinkage genome-wide PRS model LDpred, PRS-CS

    In this tutorial, we will first briefly introduce how to develop PRS model using the sample data and then demonstrate how we can download PRS models from PGS Catalog and apply to our sample genotype data.

    "},{"location":"10_PRS/#ctpt-using-plink","title":"C+T/P+T using PLINK","text":"

    P+T stands for Pruning + Thresholding, also known as Clumping and Thresholding(C+T), which is a very simple and straightforward approach to constructing PRS models.

    Clumping

    Clumping: LD-pruning based on P value. It is a approach to select variants when there are multiple significant associations in high LD in the same region.

    The three important parameters for clumping in PLINK are:

    Clumping using PLINK

    #!/bin/bash\n\nplinkFile=../04_Data_QC/sample_data.clean\nsumStats=../06_Association_tests/1kgeas.B1.glm.firth\n\nplink \\\n    --bfile ${plinkFile} \\\n    --clump-p1 0.0001 \\\n    --clump-r2 0.1 \\\n    --clump-kb 250 \\\n    --clump ${sumStats} \\\n    --clump-snp-field ID \\\n    --clump-field P \\\n    --out 1kg_eas\n

    log

    --clump: 40 clumps formed from 307 top variants.\n
    check only the header and the first \"clump\" of SNPs.

    head -n 2 1kg_eas.clumped\n  CHR    F              SNP         BP        P    TOTAL   NSIG    S05    S01   S001  S0001    SP2\n2    1   2:55513738:C:T   55513738   1.69e-15       52      0      3      1      6     42 2:55305475:A:T(1),2:55338196:T:C(1),2:55347135:G:A(1),2:55351853:A:G(1),2:55363460:G:A(1),2:55395372:A:G(1),2:55395578:G:A(1),2:55395807:C:T(1),2:55405847:C:A(1),2:55408556:C:A(1),2:55410835:C:T(1),2:55413644:C:G(1),2:55435439:C:T(1),2:55449464:T:C(1),2:55469819:A:T(1),2:55492154:G:A(1),2:55500529:A:G(1),2:55502651:A:G(1),2:55508333:G:C(1),2:55563020:A:G(1),2:55572944:T:C(1),2:55585915:A:G(1),2:55599810:C:T(1),2:55605943:A:G(1),2:55611766:T:C(1),2:55612986:G:C(1),2:55619923:C:T(1),2:55622624:G:A(1),2:55624520:C:T(1),2:55628936:G:C(1),2:55638830:T:C(1),2:55639023:A:T(1),2:55639980:C:T(1),2:55640649:G:A(1),2:55641045:G:A(1),2:55642887:C:T(1),2:55647729:A:G(1),2:55650512:G:A(1),2:55659155:A:G(1),2:55665620:A:G(1),2:55667476:G:T(1),2:55670729:A:G(1),2:55676257:C:T(1),2:55685927:C:A(1),2:55689569:A:T(1),2:55689913:T:C(1),2:55693097:C:G(1),2:55707583:T:C(1),2:55720135:C:G(1)\n
    "},{"location":"10_PRS/#beta-shrinkage-using-prs-cs","title":"Beta shrinkage using PRS-CS","text":"\\[ \\beta_j | \\Phi_j \\sim N(0,\\phi\\Phi_j) , \\Phi_j \\sim g \\]

    Reference: Ge, T., Chen, C. Y., Ni, Y., Feng, Y. C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications, 10(1), 1-10.

    "},{"location":"10_PRS/#parameter-tuning","title":"Parameter tuning","text":"Method Description Cross-validation 10-fold cross validation. This method usually requires large-scale genotype dataset. Independent population Perform validation in an independent population of the same ancestry. Pseudo-validation A few methods can estimate a single optimal shrinkage parameter using only the base GWAS summary statistics."},{"location":"10_PRS/#pgs-catalog","title":"PGS Catalog","text":"

    Just like GWAS Catalog, you can now download published PRS models from PGS catalog.

    URL: http://www.pgscatalog.org/

    Reference: Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.

    "},{"location":"10_PRS/#calculate-prs-using-plink","title":"Calculate PRS using PLINK","text":"
    plink --score <score_filename> [variant ID col.] [allele col.] [score col.] ['header']\n

    Please check here for detailed documents on plink --score.

    Example

    # genotype data\nplinkFile=../04_Data_QC/sample_data.clean\n# summary statistics for scoring\nsumStats=./t2d_plink_reduced.txt\n# SNPs after clumpping\nawk 'NR!=1{print $3}' 1kgeas.clumped >  1kgeas.valid.snp\n\nplink \\\n    --bfile ${plinkFile} \\\n    --score ${sumStats} 1 2 3 header \\\n    --extract 1kgeas.valid.snp \\\n    --out 1kgeas\n

    For thresholding using P values, we can create a range file and a p-value file.

    The options we use:

    --q-score-range <range file> <data file> [variant ID col.] [data col.] ['header']\n

    Example

    # SNP - P value file for thresholding\nawk '{print $1,$4}' ${sumStats} > SNP.pvalue\n\n# create a range file with 3 columns: range label, p-value lower bound, p-value upper bound\nhead range_list\npT0.001 0 0.001\npT0.05 0 0.05\npT0.1 0 0.1\npT0.2 0 0.2\npT0.3 0 0.3\npT0.4 0 0.4\npT0.5 0 0.5\n

    and then calculate the scores using the p-value ranges:

    plink2 \\\n--bfile ${plinkFile} \\\n--score ${sumStats} 1 2 3 header cols=nallele,scoreavgs,denom,scoresums\\\n--q-score-range range_list SNP.pvalue \\\n--extract 1kgeas.valid.snp \\\n--out 1kgeas\n

    You will get the following files:

    1kgeas.pT0.001.sscore\n1kgeas.pT0.05.sscore\n1kgeas.pT0.1.sscore\n1kgeas.pT0.2.sscore\n1kgeas.pT0.3.sscore\n1kgeas.pT0.4.sscore\n1kgeas.pT0.5.sscore\n

    Take a look at the files:

    head 1kgeas.pT0.1.sscore\n#IID    ALLELE_CT       DENOM   SCORE1_AVG      SCORE1_SUM\nHG00403 54554   54976   2.84455e-05     1.56382\nHG00404 54574   54976   5.65172e-05     3.10709\nHG00406 54284   54976   -3.91872e-05    -2.15436\nHG00407 54348   54976   -9.87606e-05    -5.42946\nHG00409 54760   54976   1.67157e-05     0.918963\nHG00410 54656   54976   3.74405e-05     2.05833\nHG00419 54052   54976   -6.4035e-05     -3.52039\nHG00421 54210   54976   -1.55942e-05    -0.857305\nHG00422 54102   54976   5.28824e-05     2.90726\n
    "},{"location":"10_PRS/#meta-scoring-methods-for-prs","title":"Meta-scoring methods for PRS","text":"

    It has been shown recently that the PRS models generated from multiple traits using a meta-scoring method potentially outperforms PRS models generated from a single trait. Inouye et al. first used this approach for generating a PRS model for CAD from multiple PRS models.

    Potential advantages of meta-score for PRS generation

    Reference: Inouye, M., Abraham, G., Nelson, C. P., Wood, A. M., Sweeting, M. J., Dudbridge, F., ... & UK Biobank CardioMetabolic Consortium CHD Working Group. (2018). Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology, 72(16), 1883-1893.

    elastic net

    Elastic net is a common approach for variable selection when there are highly correlated variables (for example, PRS of correlated diseases are often highly correlated.). When fitting linear or logistic models, L1 and L2 penalties are added (regularization).

    \\[ \\hat{\\beta} \\equiv argmin({\\parallel y- X \\beta \\parallel}^2 + \\lambda_2{\\parallel \\beta \\parallel}^2 + \\lambda_1{\\parallel \\beta \\parallel} ) \\]

    After validation, PRS can be generated from distinct PRS for other genetically correlated diseases :

    \\[PRS_{meta} = {w_1}PRS_{Trait1} + {w_2}PRS_{Trait2} + {w_3}PRS_{Trait3} + ... \\]

    An example: Abraham, G., Malik, R., Yonova-Doing, E., Salim, A., Wang, T., Danesh, J., ... & Dichgans, M. (2019). Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nature communications, 10(1), 1-10.

    "},{"location":"10_PRS/#reference","title":"Reference","text":""},{"location":"11_meta_analysis/","title":"Meta-analysis","text":""},{"location":"11_meta_analysis/#aims","title":"Aims","text":"

    Meta-analysis is one of the most commonly used statistical methods to combine the evidence from multiple studies into a single result.

    Potential problems for small-scale genome-wide association studies

    To address these problems, meta-analysis is a powerful approach to integrate multiple GWAS summary statistics, especially when more and more summary statistics are publicly available. . This method allows us to obtain increases in statistical power as sample size increases.

    What we could achieve by conducting meta-analysis

    "},{"location":"11_meta_analysis/#a-typical-workflow-of-meta-analysis","title":"A typical workflow of meta-analysis","text":""},{"location":"11_meta_analysis/#harmonization-and-qc-for-gwa-meta-analysis","title":"Harmonization and QC for GWA meta-analysis","text":"

    Before performing any type of meta-analysis, we need to make sure our datasets contain sufficient information and the datasets are QCed and harmonized. It is important to perform this step to avoid any unexpected errors and heterogeneity.

    Key points for Dataset selection

    Key points for Quality control

    Key points for Harmonization

    "},{"location":"11_meta_analysis/#fixed-effects-meta-analysis","title":"Fixed effects meta-analysis","text":"

    Simply speaking, the fixed effects we mentioned here mean that the between-study variance is zero. Under the fixed effect model, we assume a common effect size across studies for a certain SNP.

    Fixed effect model

    \\[ \\bar{\\beta_{ij}} = {{\\sum_{i=1}^{k} {w_{ij} \\beta_{ij}}}\\over{\\sum_{i=1}^{k} {w_{ij}}}} \\] "},{"location":"11_meta_analysis/#heterogeneity-test","title":"Heterogeneity test","text":"

    Cochran's Q test and \\(I^2\\)

    \\[ Q = \\sum_{i=1}^{k} {w_i (\\beta_i - \\bar{\\beta})^2} \\] \\[ I_j^2 = {{Q_j - df_j}\\over{Q_j}}\\times 100% = {{Q - (k - 1)}\\over{Q}}\\times 100% \\]"},{"location":"11_meta_analysis/#metal","title":"METAL","text":"

    METAL is one of the most commonly used tools for GWA meta-analysis. Its official documentation can be found here. METAL supports two models: (1) Sample size based approach and (2) Inverse variance based approach.

    A minimal example of meta-analysis using the IVW method

    metal_script.txt
    # classical approach, uses effect size estimates and standard errors\nSCHEME STDERR  \n\n# === DESCRIBE AND PROCESS THE FIRST INPUT FILE ===\nMARKER SNP\nALLELE REF_ALLELE OTHER_ALLELE\nEFFECT BETA\nPVALUE PVALUE \nSTDERR SE \nPROCESS inputfile1.txt\n\n# === THE SECOND INPUT FILE HAS THE SAME FORMAT AND CAN BE PROCESSED IMMEDIATELY ===\nPROCESS inputfile2.txt\n\nANALYZE\n

    Then, just run the following command to execute the metal script.

    metal meta_input.txt\n
    "},{"location":"11_meta_analysis/#random-effects-meta-analysis","title":"Random effects meta-analysis","text":"

    On the other hand, random effects mean that we need to model the between-study variance, which is not zero in this case. Under the random effect model, we assume the true effect size for a certain SNP varies across studies.

    If heterogeneity of effects exists across studies, we need to model the between-study variance to correct for the deflation of variance in fixed-effect estimates.

    "},{"location":"11_meta_analysis/#gwama","title":"GWAMA","text":"

    Random effect model

    The random effect variance component can be estimated by:

    \\[ r_j^2 = max\\left(0, {{Q_j - (N_j -1)}\\over{\\sum_iw_{ij} - ({{\\sum_iw_{ij}^2} \\over {\\sum_iw_ {ij}}})}}\\right)\\]

    Then the effect size for SNP j can be obtained by:

    \\[ \\bar{\\beta_j}^* = {{\\sum_{i=1}^{k} {w_{ij}^* \\beta_i}}\\over{\\sum_{i=1}^{k} {w_{ij}^*}}} \\]

    The weights are estimated by:

    \\[w_{ij}^* = {{1}\\over{r_j^2 + Var(\\beta_{ij})}} \\]

    The random effect model was implemented in GWAMA, which is another very popular GWA meta-analysis tool. Its official documentation can be found here.

    A minimal example of random effect meta-analysis using GWAMA

    The input file for GWAMA contains the path to each sumstats. Column names need to be standardized.

    GWAMA_script.in
    Pop1.txt\nPop2.txt\nPop3.txt\n
    GWAMA \\\n    -i GWAMA_script.in \\\n    --random \\\n    -o myresults\n
    "},{"location":"11_meta_analysis/#cross-ancestry-meta-analysis","title":"Cross-ancestry meta-analysis","text":""},{"location":"11_meta_analysis/#mantra","title":"MANTRA","text":"

    MANTRA (Meta-ANalysis of Transethnic Association studies) is one of the early efforts to address the heterogeneity for cross-ancestry meta-analysis.

    MANTRA implements a Bayesian partition model where GWASs were clustered into ancestry clusters based on a prior model of similarity between them. MANTRA then uses Markov chain Monte Carlo (MCMC) algorithms to approximate the posterior distribution of parameters (which might be quite computationally intensive). MANTRA has been shown to increase power and mapping resolution over random-effects meta-analysis over a range of models of heterogeneity situations.

    "},{"location":"11_meta_analysis/#mr-mega","title":"MR-MEGA","text":"

    MR-MEGA employs meta-regression to model the heterogeneity in effect sizes across ancestries. Its official documentation can be found here (The same first author as GWAMA).

    Meta-regression implemented in MR-MEGA

    It will first construct a matrix \\(D\\) of pairwise Euclidean distances between GWAS across autosomal variants. The elements of D , $d_{k'k} $ for a pair of studies can be expressed as the following. For each variant \\(j\\), \\(p_{kj}\\) is the allele frequency of j in study k, then:

    \\[d_{k'k} = {{\\sum_jI_j(p_{kj}-p_{k'j})^2}\\over{\\sum_jI_j}}\\]

    Then multi-dimensional scaling (MDS) will be performed to derive T axes of genetic variation (\\(x_k\\) for study k)

    For each variant j, the effect size of the reference allele can be modeled in a linear regression model as :

    \\[E[\\beta_{kj}] = \\beta_j + \\sum_{t=1}^T\\beta_{tj}x_{kj}\\]

    A minimal example of meta-analysis using MR-MEGA

    The input file for MR-MEGA contains the path to each sumstats. Column names need to be standardized like GWAMA.

    MRMEGA_script.in
    Pop1.txt.gz\nPop2.txt.gz\nPop3.txt.gz\nPop4.txt.gz\nPop5.txt.gz\nPop6.txt.gz\nPop7.txt.gz\nPop8.txt.gz\n
    MR-MEGA \\\n    -i MRMEGA_script.in \\\n    --pc 4 \\\n    -o myresults\n
    "},{"location":"11_meta_analysis/#global-biobank-meta-analysis-initiative-gbmi","title":"Global Biobank Meta-analysis Initiative (GBMI)","text":"

    As a recent success achieved by meta-analysis, GBMI showed an example of the improvement of our understanding of diseases by taking advantage of large-scale meta-analyses.

    For more details, you check check here.

    "},{"location":"11_meta_analysis/#reference","title":"Reference","text":""},{"location":"12_fine_mapping/","title":"Fine-mapping","text":""},{"location":"12_fine_mapping/#introduction","title":"Introduction","text":"

    Fine-mapping : Fine-mapping aims to identify the causal variant(s) within a locus for a disease, given the evidence of the significant association of the locus (or genomic region) in GWAS of a disease.

    Fine-mapping using individual data is usually performed by fitting the multiple linear regression model:

    \\[y = Xb + e\\]

    Fine-mapping (using Bayesian methods) aims to estimate the PIP (posterior inclusion probability), which indicates the evidence for SNP j having a non-zero effect (namely, causal).

    PIP(Posterior Inclusion Probability)

    PIP is often calculated by the sum of the posterior probabilities over all models that include variant j as causal.

    \\[ PIP_j:=Pr(b_j\\neq0|X,y) \\]

    Bayesian methods and Posterior probability

    \\[ Pr(M_m | O) = {{Pr(O | M_m) Pr(M_m)}\\over{\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}} \\]

    \\(O\\) : Observed data

    \\(M\\) : Models (the configurations of causal variants in the context of fine-mapping).

    \\(Pr(M_m | O)\\): Posterior Probability of Model m

    \\(Pr(O | M_m)\\): Likelihood (the probability of observing your dataset given Model m is true.)

    \\(Pr(M_m)\\): Prior distribution of Model m (the probability of Model m being true)

    \\({\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}\\): Evidence (the probability of observing your dataset), namely \\(Pr(O)\\)

    Credible sets

    A credible set refers to the minimum set of variants that contains all causal SNPs with probability \\(\u03b1\\). (Under the single-causal-variant-per-locus assumption, the credible set is calculated by ranking variants based on their posterior probabilities, and then summing these until the cumulative sum is \\(>\u03b1\\)). We usually report 95% credible sets (\u03b1=95%) for fine-mapping analysis.

    Commonly used tools for fine-mapping

    Methods assuming only one causal variant in the locus

    Methods assuming multiple causal variants in the locus

    Methods assuming a small number of larger causal effects with a large number of infinitesimal effects

    Methods for Cross-ancestry fine-mapping

    You can check here for more information.

    In this tutorial, we will introduce SuSiE as an example. SuSiE stands for Sum of Single Effects\u201d model.

    The key idea behind SuSiE is :

    \\[b = \\sum_{l=1}^L b_l \\]

    where each vector \\(b_l = (b_{l1}, \u2026, b_{lJ})^T\\) is a so-called single effect vector (a vector with only one non-zero element). L is the upper bound of number of causal variants. And this model could be fitted using Iterative Bayesian Stepwise Selection (IBSS).

    For fine-mapping with summary statistics using Susie (SuSiE-RSS), IBSS was modified (IBSS-ss) to take sufficient statistics (which can be computed from other combinations of summary statistics) as input. SuSie will then approximate the sufficient statistics to run fine-mapping.

    Quote

    For details of SuSiE and SuSiE-RSS, please check : Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the \u201cSum of Single Effects\u201d model. PLoS Genetics, 18(7), e1010299. Link

    "},{"location":"12_fine_mapping/#file-preparation","title":"File Preparation","text":"

    Using python to check novel loci and extract the files.

    import gwaslab as gl\nimport pandas as pd\nimport numpy as np\n\nsumstats = gl.Sumstats(\"../06_Association_tests/1kgeas.B1.glm.firth\",fmt=\"plink2\")\n...\n\nsumstats.basic_check()\n...\n\nsumstats.get_lead()\n\nFri Jan 13 23:31:43 2023 Start to extract lead variants...\nFri Jan 13 23:31:43 2023  -Processing 1122285 variants...\nFri Jan 13 23:31:43 2023  -Significance threshold : 5e-08\nFri Jan 13 23:31:43 2023  -Sliding window size: 500  kb\nFri Jan 13 23:31:44 2023  -Found 59 significant variants in total...\nFri Jan 13 23:31:44 2023  -Identified 3 lead variants!\nFri Jan 13 23:31:44 2023 Finished extracting lead variants successfully!\n\nSNPID CHR POS EA  NEA SE  Z P OR  N STATUS\n110723  2:55574452:G:C  2 55574452  C G 0.160948  -5.98392  2.178320e-09  0.381707  503 9960099\n424615  6:29919659:T:C  6 29919659  T C 0.155457  -5.89341  3.782970e-09  0.400048  503 9960099\n635128  9:36660672:A:G  9 36660672  G A 0.160275  5.63422 1.758540e-08  2.467060  503 9960099\n
    We will perform fine-mapping for the first significant loci whose lead variant is 2:55574452:G:C.

    # filter in the variants in the this locus.\n\nlocus = sumstats.filter_value('CHR==2 & POS>55074452 & POS<56074452')\nlocus.fill_data(to_fill=[\"BETA\"])\nlocus.harmonize(basic_check=False, ref_seq=\"/Users/he/mydata/Reference/Genome/human_g1k_v37.fasta\")\nlocus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\n

    check in terminal:

    head sig_locus.tsv\nSNPID   CHR     POS     EA      NEA     BETA    SE      Z       P       OR      N       STATUS\n2:54535206:C:T  2       54535206        T       C       0.30028978      0.142461        2.10786 0.0350429       1.35025 503     9960099\n2:54536167:C:G  2       54536167        G       C       0.14885099      0.246871        0.602952        0.546541        1.1605  503     9960099\n2:54539096:A:G  2       54539096        G       A       -0.0038474211   0.288489        -0.0133355      0.98936 0.99616 503     9960099\n2:54540264:G:A  2       54540264        A       G       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n2:54540614:G:T  2       54540614        T       G       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n2:54540621:A:G  2       54540621        G       A       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n2:54540970:T:C  2       54540970        C       T       -0.049506452    0.149053        -0.332144       0.739781        0.951699        503     9960099\n2:54544229:T:C  2       54544229        C       T       -0.14338203     0.151172        -0.948468       0.342891        0.866423        503     9960099\n2:54545593:T:C  2       54545593        C       T       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n\nhead  sig_locus.snplist\n2:54535206:C:T\n2:54536167:C:G\n2:54539096:A:G\n2:54540264:G:A\n2:54540614:G:T\n2:54540621:A:G\n2:54540970:T:C\n2:54544229:T:C\n2:54545593:T:C\n2:54546032:C:G\n

    "},{"location":"12_fine_mapping/#ld-matrix-calculation","title":"LD Matrix Calculation","text":"

    Example

    #!/bin/bash\n\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\n\n# LD r matrix\nplink \\\n  --bfile ${plinkFile} \\\n  --keep-allele-order \\\n  --r square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt\n\n# LD r2 matrix\nplink \\\n  --bfile ${plinkFile} \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt_r2\n
    Take a look at the LD matrix (first 5 rows and columns)

    head -5 sig_locus_mt.ld | cut -f 1-5\n1       -0.145634       0.252616        -0.0876317      -0.0876317\n-0.145634       1       -0.0916734      -0.159635       -0.159635\n0.252616        -0.0916734      1       0.452333        0.452333\n-0.0876317      -0.159635       0.452333        1       1\n-0.0876317      -0.159635       0.452333        1       1\n\nhead -5 sig_locus_mt_r2.ld | cut -f 1-5\n1       0.0212091       0.0638148       0.00767931      0.00767931\n0.0212091       1       0.00840401      0.0254833       0.0254833\n0.0638148       0.00840401      1       0.204605        0.204605\n0.00767931      0.0254833       0.204605        1       1\n0.00767931      0.0254833       0.204605        1       1\n
    Heatmap of the LD matrix:

    "},{"location":"12_fine_mapping/#fine-mapping-with-summary-statistics-using-susier","title":"Fine-mapping with summary statistics using SusieR","text":"

    Note

    install.packages(\"susieR\")\n\n# Fine-mapping with summary statistics\nfitted_rss2 = susie_rss(bhat = sumstats$betahat, shat = sumstats$sebetahat, R = R, n = n, L = 10)\n

    R : a p x p LD r matrix. N : Sample size. bhat : Alternative summary data giving the estimated effects (a vector of length p). This, together with shat, may be provided instead of z. shat : Alternative summary data giving the standard errors of the estimated effects (a vector of length p). This, together with bhat, may be provided instead of z. L : Maximum number of non-zero effects in the susie regression model. (defaul : L = 10)

    Quote

    For deatils, please check SusieR tutorial - Fine-mapping with susieR using summary statistics

    Use susieR in jupyter notebook (with Python):

    Please check : https://github.com/Cloufield/GWASTutorial/blob/main/12_fine_mapping/finemapping_susie.ipynb

    "},{"location":"12_fine_mapping/#reference","title":"Reference","text":""},{"location":"13_heritability/","title":"Heritability","text":"

    Heritability is a term used in genetics to describe how much phenotypic variation can be explained by genetic variation.

    For any phenotype, its variation \\(Var(P)\\) can be modeled as the combination of genetic effects \\(Var(G)\\) and environmental effects \\(Var(E)\\).

    \\[ Var(P) = Var(G) + Var(E) \\]"},{"location":"13_heritability/#broad-sense-heritability","title":"Broad-sense Heritability","text":"

    The broad-sense heritability \\(H^2_{broad-sense}\\) is mathmatically defined as :

    \\[ H^2_{broad-sense} = {Var(G)\\over{Var(P)}} \\]"},{"location":"13_heritability/#narrow-sense-heritability","title":"Narrow-sense Heritability","text":"

    Genetic effects \\(Var(G)\\) is composed of multiple effects including additive effects \\(Var(A)\\), dominant effects, recessive effects, epistatic effects and so forth.

    Narrrow-sense heritability is defined as:

    \\[ h^2_{narrow-sense} = {Var(A)\\over{Var(P)}} \\]"},{"location":"13_heritability/#snp-heritability","title":"SNP Heritability","text":"

    SNP heritability \\(h^2_{SNP}\\) : the proportion of phenotypic variance explained by tested SNPs in a GWAS.

    Common methods to estimate SNP heritability includes:

    "},{"location":"13_heritability/#liability-and-threshold-model","title":"Liability and Threshold model","text":""},{"location":"13_heritability/#observed-scale-heritability-and-liability-scaled-heritability","title":"Observed-scale heritability and liability-scaled heritability","text":"

    Issue for binary traits :

    The scale issue for binary traits

    Conversion formula (Equation 23 from Lee. 2011):

    \\[ h^2_{liability-scale} = h^2_{observed-scale} * {{K(1-K)}\\over{Z^2}} * {{K(1-K)}\\over{P(1-P)}} \\] "},{"location":"13_heritability/#further-reading","title":"Further Reading","text":""},{"location":"14_gcta_greml/","title":"SNP-Heritability estimation by GCTA-GREML","text":""},{"location":"14_gcta_greml/#introduction","title":"Introduction","text":"

    The basic model behind GCTA-GREML is the linear mixed model (LMM):

    \\[y = X\\beta + Wu + e\\] \\[ Var(y) = V = WW^{'}\\delta^2_u + I \\delta^2_e\\]

    GCTA defines \\(A = WW^{'}/N\\) and \\(\\delta^2_g\\) as the variance explained by SNPs.

    So the oringinal model can be written as:

    \\[y = X\\beta + g + e\\] \\[ Var(y) = V = A\\delta^2_g + I \\delta^2_e\\]

    Quote

    For details, please check Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82. link.

    "},{"location":"14_gcta_greml/#donwload","title":"Donwload","text":"

    Download the version of GCTA for your system from : https://yanglab.westlake.edu.cn/software/gcta/#Download

    Example

    wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip\nunzip gcta-1.94.1-linux-kernel-3-x86_64.zip\ncd gcta-1.94.1-linux-kernel-3-x86_64.zip\n\n./gcta-1.94.1\n*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 12:22:19 JST on Sun Jan 15 2023.\nHostname: Home-Desktop\n\nError: no analysis has been launched by the option(s)\nPlease see online documentation at https://yanglab.westlake.edu.cn/software/gcta/\n

    Tip

    Add GCTA to your environment

    "},{"location":"14_gcta_greml/#make-grm","title":"Make GRM","text":"
    #!/bin/bash\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\ngcta \\\n  --bfile ${plinkFile} \\\n  --autosome \\\n  --maf 0.01 \\\n  --make-grm \\\n  --out 1kg_eas\n
    *******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:21:24 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nOptions:\n\n--bfile ../04_Data_QC/sample_data.clean\n--autosome\n--maf 0.01\n--make-grm\n--out 1kg_eas\n\nNote: GRM is computed using the SNPs on the autosomes.\nReading PLINK FAM file from [../04_Data_QC/sample_data.clean.fam]...\n500 individuals to be included from FAM file.\n500 individuals to be included. 0 males, 0 females, 500 unknown.\nReading PLINK BIM file from [../04_Data_QC/sample_data.clean.bim]...\n1224104 SNPs to be included from BIM file(s).\nThreshold to filter variants: MAF > 0.010000.\nComputing the genetic relationship matrix (GRM) v2 ...\nSubset 1/1, no. subject 1-500\n  500 samples, 1224104 markers, 125250 GRM elements\nIDs for the GRM file have been saved in the file [1kg_eas.grm.id]\nComputing GRM...\n  100% finished in 7.4 sec\n1224104 SNPs have been processed.\n  Used 1128732 valid SNPs.\nThe GRM computation is completed.\nSaving GRM...\nGRM has been saved in the file [1kg_eas.grm.bin]\nNumber of SNPs in each pair of individuals has been saved in the file [1kg_eas.grm.N.bin]\n\nAnalysis finished at 17:21:32 JST on Tue Dec 26 2023\nOverall computational time: 8.51 sec.\n
    "},{"location":"14_gcta_greml/#estimation","title":"Estimation","text":"
    #!/bin/bash\n\n#the grm we calculated in step1\nGRM=1kg_eas\n\n# phenotype file\nphenotypeFile=../01_Dataset/1kgeas_binary_gcta.txt\n\n# disease prevalence used for conversion to liability-scale heritability\nprevalence=0.5\n\n# use 5PCs as covariates \nawk '{print $1,$2,$5,$6,$7,$8,$9}' ../05_PCA/plink_results_projected.sscore > 5PCs.txt\n\ngcta \\\n  --grm ${GRM} \\\n  --pheno ${phenotypeFIile} \\\n  --prevalence ${prevalence} \\\n  --qcovar  5PCs.txt \\\n  --reml \\\n  --out 1kg_eas\n
    "},{"location":"14_gcta_greml/#results","title":"Results","text":"

    Warning

    This is just to show the analysis pipeline. The trait was simulated under an unreal condition (effect size is extremely large) so the result is meaningless here.

    For real analysis, you need a larger sample size to get robust estimation. Please see the GCTA FAQ

    *******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:36:37 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nAccepted options:\n--grm 1kg_eas\n--pheno ../01_Dataset/1kgeas_binary_gcta.txt\n--prevalence 0.5\n--qcovar 5PCs.txt\n--reml\n--out 1kg_eas\n\nNote: This is a multi-thread program. You could specify the number of threads by the --thread-num option to speed up the computation if there are multiple processors in your machine.\n\nReading IDs of the GRM from [1kg_eas.grm.id].\n500 IDs are read from [1kg_eas.grm.id].\nReading the GRM from [1kg_eas.grm.bin].\nGRM for 500 individuals are included from [1kg_eas.grm.bin].\nReading phenotypes from [../01_Dataset/1kgeas_binary_gcta.txt].\nNon-missing phenotypes of 503 individuals are included from [../01_Dataset/1kgeas_binary_gcta.txt].\nReading quantitative covariate(s) from [5PCs.txt].\n5 quantitative covariate(s) of 501 individuals are included from [5PCs.txt].\nAssuming a disease phenotype for a case-control study: 248 cases and 250 controls\n5 quantitative variable(s) included as covariate(s).\n498 individuals are in common in these files.\n\nPerforming  REML analysis ... (Note: may take hours depending on sample size).\n498 observations, 6 fixed effect(s), and 2 variance component(s)(including residual variance).\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values:  0.12498 0.124846\nlogL: 95.34\nRunning AI-REML algorithm ...\nIter.   logL    V(G)    V(e)\n1       95.34   0.14264 0.10708\n2       95.37   0.18079 0.06875\n3       95.40   0.18071 0.06888\n4       95.40   0.18071 0.06888\nLog-likelihood ratio converged.\n\nCalculating the logLikelihood for the reduced model ...\n(variance component 1 is dropped from the model)\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values: 0.24901\nlogL: 94.78319\nRunning AI-REML algorithm ...\nIter.   logL    V(e)\n1       94.79   0.24900\n2       94.79   0.24899\nLog-likelihood ratio converged.\n\nSummary result of REML analysis:\nSource  Variance        SE\nV(G)    0.180708        0.164863\nV(e)    0.068882        0.162848\nVp      0.249590        0.016001\nV(G)/Vp 0.724021        0.654075\nThe estimate of variance explained on the observed scale is transformed to that on the underlying liability scale:\n(Proportion of cases in the sample = 0.497992; User-specified disease prevalence = 0.500000)\nV(G)/Vp_L       1.137308        1.027434\n\nSampling variance/covariance of the estimates of variance components:\n2.717990e-02    -2.672171e-02\n-2.672171e-02   2.651955e-02\n\nSummary result of REML analysis has been saved in the file [1kg_eas.hsq].\n\nAnalysis finished at 17:36:38 JST on Tue Dec 26 2023\nOverall computational time: 0.08 sec.\n
    "},{"location":"14_gcta_greml/#reference","title":"Reference","text":""},{"location":"15_winners_curse/","title":"Winner's curse","text":""},{"location":"15_winners_curse/#winners-curse-definition","title":"Winner's curse definition","text":"

    Winner's curse refers to the phenomenon that genetic effects are systematically overestimated by thresholding or selection process in genetic association studies.

    Winner's curse in auctions

    This term was initially used to describe a phenomenon that occurs in auctions. The winning bid is very likely to overestimate the intrinsic value of an item even if all the bids are unbiased (the auctioned item is of equal value to all bidders). The thresholding process in GWAS resembles auctions, where the lead variants are the winning bids.

    Reference:

    "},{"location":"15_winners_curse/#wc-correction","title":"WC correction","text":"

    The asymptotic distribution of \\(\\beta_{Observed}\\) is:

    \\[\\beta_{Observed} \\sim N(\\beta_{True},\\sigma^2)\\]

    An example of distribution of \\(\\beta_{Observed}\\)

    It is equivalent to:

    \\[{{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}} \\sim N(0,1)\\]

    An example of distribution of \\({{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}}\\)

    We can obtain the asymptotic sampling distribution (which is a truncated normal distribution) for \\(\\beta_{Observed}\\) by:

    \\[f(x,\\beta_{True}) ={{1}\\over{\\sigma}} {{\\phi({{{x - \\beta_{True}}\\over{\\sigma}}})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]

    when

    \\[|{{x}\\over{\\sigma}}|\\geq c\\]

    From the asymptotic sampling distribution, the expectation of effect sizes for the selected variants can then be approximated by:

    \\[ E(\\beta_{Observed}; \\beta_{True}) = \\beta_{True} + \\sigma {{\\phi({{{\\beta_{True}}\\over{\\sigma}}-c}) - \\phi({{{-\\beta_{True}}\\over{\\sigma}}-c})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]

    Derivation of this equation can be found in the Appendix A of Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074.

    Reference:

    Also see reference: https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html

    "},{"location":"16_mendelian_randomization/","title":"Mendelian randomization","text":""},{"location":"16_mendelian_randomization/#mendelian-randomization-introduction","title":"Mendelian randomization introduction","text":"

    Comparison between RCT and MR

    "},{"location":"16_mendelian_randomization/#fundamental-assumption-gene-environment-equivalence","title":"Fundamental assumption: gene-environment equivalence","text":"

    (cited from George Davey Smith Mendelian Randomization - 25th April 2024)

    The fundamental assumption of mendelian randomization (MR) is of gene-environment equivalence. MR reflects the phenocopy/ genocopy dialectic (Goldschmidt, Schmalhausen). The idea here is that all environmental effects can be mimicked by one or several mutations. (Zuckerkandl and Villet, PNAS 1988)

    Gene-environment equivalence

    If we consider BMI as the outcome, let's think about whether genetic variants related to the following exposures meet the gene-environment equivalence assumption:

    "},{"location":"16_mendelian_randomization/#methods-instrumental-variables-iv","title":"Methods: Instrumental Variables (IV)","text":"

    Instrumental variable (IV) can be defined as a variable that is correlated with the exposure X and uncorrelated with the error \\(\\epsilon\\) in the following regression:

    \\[ Y = X\\beta + \\epsilon \\]

    "},{"location":"16_mendelian_randomization/#iv-assumptions","title":"IV Assumptions","text":"

    Key Assumptions

    Assumptions Description Relevance Instrumental variables are strongly associated with the exposure.(IVs are not independent of X) Exclusion restriction Instrumental variables do not affect the outcome except through the exposure.(IV is independent of Y, conditional on X and C) Independence There are no confounders of the instrumental variables and the outcome.(IV is independent of C) Monotonicity Variants affect the exposure in the same direction for all individuals No assortative mating Assortative mating might cause bias in MR"},{"location":"16_mendelian_randomization/#two-stage-least-squares-2sls","title":"Two-stage least-squares (2SLS)","text":"\\[ X = \\mu_1 + \\beta_{IV} IV + \\epsilon_1 \\] \\[ Y = \\mu_2 + \\beta_{2SLS} \\hat{X} + \\epsilon_2 \\]"},{"location":"16_mendelian_randomization/#two-sample-mr","title":"Two-sample MR","text":"

    Two-sample MR refers to the approach that the genetic effects of the instruments on the exposure can be estimated in an independent sample other than that used to estimate effects between instruments on the outcome. As more and more GWAS summary statistics become publicly available, the scope of MR also expands with Two-sample MR methods.

    \\[ \\hat{\\beta}_{X,Y} = {{\\hat{\\beta}_{IV,Y}}\\over{\\hat{\\beta}_{IV,X}}} \\]

    Caveats

    For two-sample MR, there is an additional key assumption:

    The two samples used for MR are from the same underlying populations. (The effect size of instruments on exposure should be the same in both samples.)

    Therefore, for two-sample MR, we usually use datasets from similar non-overlapping populations in terms of not only ancestry but also contextual factors.

    "},{"location":"16_mendelian_randomization/#iv-selection","title":"IV selection","text":"

    One of the first things to do when you plan to perform any type of MR is to check the associations of instrumental variables with the exposure to avoid bias caused by weak IVs.

    The most commonly used method here is the F-statistic, which tests the association of instrumental variables with the exposure.

    "},{"location":"16_mendelian_randomization/#practice","title":"Practice","text":"

    In this tutorial, we will walk you through how to perform a minimal TwoSampleMR analysis. We will use the R package TwoSampleMR, which provides easy-to-use functions for formatting, clumping and harmonizing GWAS summary statistics.

    This package integrates a variety of commonly used MR methods for analysis, including:

    > mr_method_list()\n                             obj\n1                  mr_wald_ratio\n2               mr_two_sample_ml\n3            mr_egger_regression\n4  mr_egger_regression_bootstrap\n5               mr_simple_median\n6             mr_weighted_median\n7   mr_penalised_weighted_median\n8                         mr_ivw\n9                  mr_ivw_radial\n10                    mr_ivw_mre\n11                     mr_ivw_fe\n12                mr_simple_mode\n13              mr_weighted_mode\n14         mr_weighted_mode_nome\n15           mr_simple_mode_nome\n16                       mr_raps\n17                       mr_sign\n18                        mr_uwr\n\n                                                        name PubmedID\n1                                                 Wald ratio\n2                                         Maximum likelihood\n3                                                   MR Egger 26050253\n4                                       MR Egger (bootstrap) 26050253\n5                                              Simple median\n6                                            Weighted median\n7                                  Penalised weighted median\n8                                  Inverse variance weighted\n9                                                 IVW radial\n10 Inverse variance weighted (multiplicative random effects)\n11                 Inverse variance weighted (fixed effects)\n12                                               Simple mode\n13                                             Weighted mode\n14                                      Weighted mode (NOME)\n15                                        Simple mode (NOME)\n16                      Robust adjusted profile score (RAPS)\n17                                     Sign concordance test\n18                                     Unweighted regression\n

    "},{"location":"16_mendelian_randomization/#inverse-variance-weighted-fixed-effects","title":"Inverse variance weighted (fixed effects)","text":"

    Assumption: the underlying 'true' effect is fixed across variants

    Weight for the effect of ith variant:

    \\[W_i = {1 \\over Var(\\beta_i)}\\]

    Effect size:

    \\[\\beta = {{\\sum_{i=1}^N{w_i \\beta_i}}\\over{\\sum_{i=1}^Nw_i}}\\]

    SE:

    \\[SE = {\\sqrt{{1}\\over{\\sum_{i=1}^Nw_i}}}\\]"},{"location":"16_mendelian_randomization/#file-preparation","title":"File Preparation","text":"

    To perform two-sample MR analysis, we need summary statistics for exposure and outcome generated from independent populations with the same ancestry.

    In this tutorial, we will use sumstats from Biobank Japan pheweb and KoGES pheweb.

    "},{"location":"16_mendelian_randomization/#r-package-twosamplemr","title":"R package TwoSampleMR","text":"

    First, to use TwosampleMR, we need R>= 4.1. To install the package, run:

    library(remotes)\ninstall_github(\"MRCIEU/TwoSampleMR\")\n
    "},{"location":"16_mendelian_randomization/#loading-package","title":"Loading package","text":"
    library(TwoSampleMR)\n
    "},{"location":"16_mendelian_randomization/#reading-exposure-sumstats","title":"Reading exposure sumstats","text":"
    #format exposures dataset\n\nexp_raw <- fread(\"koges_bmi.txt.gz\")\n
    "},{"location":"16_mendelian_randomization/#extracting-instrumental-variables","title":"Extracting instrumental variables","text":"
    # select only significant variants\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_dat <- format_data( exp_raw,\n    type = \"exposure\",\n    snp_col = \"rsids\",\n    beta_col = \"beta\",\n    se_col = \"sebeta\",\n    effect_allele_col = \"alt\",\n    other_allele_col = \"ref\",\n    eaf_col = \"af\",\n    pval_col = \"pval\",\n)\n
    "},{"location":"16_mendelian_randomization/#clumping-exposure-variables","title":"Clumping exposure variables","text":"
    clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\") \n
    "},{"location":"16_mendelian_randomization/#outcome","title":"outcome","text":"
    out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n                    select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\"))\nout_dat <- format_data( out_raw,\n    type = \"outcome\",\n    snp_col = \"SNPID\",\n    beta_col = \"BETA\",\n    se_col = \"SE\",\n    effect_allele_col = \"Allele2\",\n    other_allele_col = \"Allele1\",\n    pval_col = \"p.value\",\n)\n
    "},{"location":"16_mendelian_randomization/#harmonizing-data","title":"Harmonizing data","text":"
    harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\n
    "},{"location":"16_mendelian_randomization/#perform-mr-analysis","title":"Perform MR analysis","text":"
    res <- mr(harmonized_data)\n\nid.exposure id.outcome  outcome exposure    method  nsnp    b   se  pval\n<chr>   <chr>   <chr>   <chr>   <chr>   <int>   <dbl>   <dbl>   <dbl>\n9J8pv4  IyUv6b  outcome exposure    MR Egger    28  1.3337580   0.69485260  6.596064e-02\n9J8pv4  IyUv6b  outcome exposure    Weighted median 28  0.6298980   0.09401352  2.083081e-11\n9J8pv4  IyUv6b  outcome exposure    Inverse variance weighted   28  0.5598956   0.23225806  1.592361e-02\n9J8pv4  IyUv6b  outcome exposure    Simple mode 28  0.6097842   0.15180476  4.232158e-04\n9J8pv4  IyUv6b  outcome exposure    Weighted mode   28  0.5946778   0.12820220  8.044488e-05\n
    "},{"location":"16_mendelian_randomization/#sensitivity-analysis","title":"Sensitivity analysis","text":""},{"location":"16_mendelian_randomization/#heterogeneity","title":"Heterogeneity","text":"

    Test if there is heterogeneity among the causal effect of x on y estimated from each variants.

    mr_heterogeneity(harmonized_data)\n\nid.exposure id.outcome  outcome exposure    method  Q   Q_df    Q_pval\n<chr>   <chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <dbl>\n9J8pv4  IyUv6b  outcome exposure    MR Egger    670.7022    26  1.000684e-124\n9J8pv4  IyUv6b  outcome exposure    Inverse variance weighted   706.6579    27  1.534239e-131\n
    "},{"location":"16_mendelian_randomization/#horizontal-pleiotropy","title":"Horizontal Pleiotropy","text":"

    Intercept in MR-Egger

    mr_pleiotropy_test(harmonized_data)\n\nid.exposure id.outcome  outcome exposure    egger_intercept se  pval\n<chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <dbl>\n9J8pv4  IyUv6b  outcome exposure    -0.03603697 0.0305241   0.2484472\n
    "},{"location":"16_mendelian_randomization/#single-snp-mr-and-leave-one-out-mr","title":"Single SNP MR and leave-one-out MR","text":"

    Single SNP MR

    res_single <- mr_singlesnp(harmonized_data)\nres_single\n\nexposure    outcome id.exposure id.outcome  samplesize  SNP b   se  p\n<chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <dbl>   <dbl>   <dbl>\n1   exposure    outcome 9J8pv4  IyUv6b  NA  rs10198356  0.6323140   0.2082837   2.398742e-03\n2   exposure    outcome 9J8pv4  IyUv6b  NA  rs10209994  0.9477808   0.3225814   3.302164e-03\n3   exposure    outcome 9J8pv4  IyUv6b  NA  rs10824329  0.6281765   0.3246214   5.297739e-02\n4   exposure    outcome 9J8pv4  IyUv6b  NA  rs10938397  1.2376316   0.2775854   8.251150e-06\n5   exposure    outcome 9J8pv4  IyUv6b  NA  rs11066132  0.6024303   0.2232401   6.963693e-03\n6   exposure    outcome 9J8pv4  IyUv6b  NA  rs12522139  0.2905201   0.2890240   3.148119e-01\n7   exposure    outcome 9J8pv4  IyUv6b  NA  rs12591730  0.8930490   0.3076687   3.700413e-03\n8   exposure    outcome 9J8pv4  IyUv6b  NA  rs13013021  1.4867889   0.2207777   1.646925e-11\n9   exposure    outcome 9J8pv4  IyUv6b  NA  rs1955337   0.5442640   0.2994146   6.910079e-02\n10  exposure    outcome 9J8pv4  IyUv6b  NA  rs2076308   1.1176226   0.2657969   2.613132e-05\n11  exposure    outcome 9J8pv4  IyUv6b  NA  rs2278557   0.6238587   0.2968184   3.556906e-02\n12  exposure    outcome 9J8pv4  IyUv6b  NA  rs2304608   1.5054682   0.2968905   3.961740e-07\n13  exposure    outcome 9J8pv4  IyUv6b  NA  rs2531995   1.3972908   0.3130157   8.045689e-06\n14  exposure    outcome 9J8pv4  IyUv6b  NA  rs261967    1.5303384   0.2921192   1.616714e-07\n15  exposure    outcome 9J8pv4  IyUv6b  NA  rs35332469  -0.2307314  0.3479219   5.072217e-01\n16  exposure    outcome 9J8pv4  IyUv6b  NA  rs35560038  -1.5730870  0.2018968   6.619637e-15\n17  exposure    outcome 9J8pv4  IyUv6b  NA  rs3755804   0.5314915   0.2325073   2.225933e-02\n18  exposure    outcome 9J8pv4  IyUv6b  NA  rs4470425   0.6948046   0.3079944   2.407689e-02\n19  exposure    outcome 9J8pv4  IyUv6b  NA  rs476828    1.1739083   0.1568550   7.207355e-14\n20  exposure    outcome 9J8pv4  IyUv6b  NA  rs4883723   0.5479721   0.2855004   5.494141e-02\n21  exposure    outcome 9J8pv4  IyUv6b  NA  rs509325    0.5491040   0.1598196   5.908641e-04\n22  exposure    outcome 9J8pv4  IyUv6b  NA  rs55872725  1.3501891   0.1259791   8.419325e-27\n23  exposure    outcome 9J8pv4  IyUv6b  NA  rs6089309   0.5657525   0.3347009   9.096620e-02\n24  exposure    outcome 9J8pv4  IyUv6b  NA  rs6265  0.6457693   0.1901871   6.851804e-04\n25  exposure    outcome 9J8pv4  IyUv6b  NA  rs6736712   0.5606962   0.3448784   1.039966e-01\n26  exposure    outcome 9J8pv4  IyUv6b  NA  rs7560832   0.6032080   0.2904972   3.785077e-02\n27  exposure    outcome 9J8pv4  IyUv6b  NA  rs825486    -0.6152759  0.3500334   7.878772e-02\n28  exposure    outcome 9J8pv4  IyUv6b  NA  rs9348441   -4.9786332  0.2572782   1.992909e-83\n29  exposure    outcome 9J8pv4  IyUv6b  NA  All - Inverse variance weighted 0.5598956   0.2322581   1.592361e-02\n30  exposure    outcome 9J8pv4  IyUv6b  NA  All - MR Egger  1.3337580   0.6948526   6.596064e-02\n

    leave-one-out MR

    res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\n\nexposure    outcome id.exposure id.outcome  samplesize  SNP b   se  p\n<chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <dbl>   <dbl>   <dbl>\n1   exposure    outcome 9J8pv4  IyUv6b  NA  rs10198356  0.5562834   0.2424917   2.178871e-02\n2   exposure    outcome 9J8pv4  IyUv6b  NA  rs10209994  0.5520576   0.2388122   2.079526e-02\n3   exposure    outcome 9J8pv4  IyUv6b  NA  rs10824329  0.5585335   0.2390239   1.945341e-02\n4   exposure    outcome 9J8pv4  IyUv6b  NA  rs10938397  0.5412688   0.2388709   2.345460e-02\n5   exposure    outcome 9J8pv4  IyUv6b  NA  rs11066132  0.5580606   0.2417275   2.096381e-02\n6   exposure    outcome 9J8pv4  IyUv6b  NA  rs12522139  0.5667102   0.2395064   1.797373e-02\n7   exposure    outcome 9J8pv4  IyUv6b  NA  rs12591730  0.5524802   0.2390990   2.085075e-02\n8   exposure    outcome 9J8pv4  IyUv6b  NA  rs13013021  0.5189715   0.2386808   2.968017e-02\n9   exposure    outcome 9J8pv4  IyUv6b  NA  rs1955337   0.5602635   0.2394505   1.929468e-02\n10  exposure    outcome 9J8pv4  IyUv6b  NA  rs2076308   0.5431355   0.2394403   2.330758e-02\n11  exposure    outcome 9J8pv4  IyUv6b  NA  rs2278557   0.5583634   0.2394924   1.972992e-02\n12  exposure    outcome 9J8pv4  IyUv6b  NA  rs2304608   0.5372557   0.2377325   2.382639e-02\n13  exposure    outcome 9J8pv4  IyUv6b  NA  rs2531995   0.5419016   0.2379712   2.277590e-02\n14  exposure    outcome 9J8pv4  IyUv6b  NA  rs261967    0.5358761   0.2376686   2.415093e-02\n15  exposure    outcome 9J8pv4  IyUv6b  NA  rs35332469  0.5735907   0.2378345   1.587739e-02\n16  exposure    outcome 9J8pv4  IyUv6b  NA  rs35560038  0.6734906   0.2217804   2.391474e-03\n17  exposure    outcome 9J8pv4  IyUv6b  NA  rs3755804   0.5610215   0.2413249   2.008503e-02\n18  exposure    outcome 9J8pv4  IyUv6b  NA  rs4470425   0.5568993   0.2392632   1.993549e-02\n19  exposure    outcome 9J8pv4  IyUv6b  NA  rs476828    0.5037555   0.2443224   3.922224e-02\n20  exposure    outcome 9J8pv4  IyUv6b  NA  rs4883723   0.5602050   0.2397325   1.945000e-02\n21  exposure    outcome 9J8pv4  IyUv6b  NA  rs509325    0.5608429   0.2468506   2.308693e-02\n22  exposure    outcome 9J8pv4  IyUv6b  NA  rs55872725  0.4419446   0.2454771   7.180543e-02\n23  exposure    outcome 9J8pv4  IyUv6b  NA  rs6089309   0.5597859   0.2388902   1.911519e-02\n24  exposure    outcome 9J8pv4  IyUv6b  NA  rs6265  0.5547068   0.2436910   2.282978e-02\n25  exposure    outcome 9J8pv4  IyUv6b  NA  rs6736712   0.5598815   0.2387602   1.902944e-02\n26  exposure    outcome 9J8pv4  IyUv6b  NA  rs7560832   0.5588113   0.2396229   1.969836e-02\n27  exposure    outcome 9J8pv4  IyUv6b  NA  rs825486    0.5800026   0.2367545   1.429330e-02\n28  exposure    outcome 9J8pv4  IyUv6b  NA  rs9348441   0.7378967   0.1366838   6.717515e-08\n29  exposure    outcome 9J8pv4  IyUv6b  NA  All 0.5598956   0.2322581   1.592361e-02\n
    "},{"location":"16_mendelian_randomization/#visualization","title":"Visualization","text":""},{"location":"16_mendelian_randomization/#scatter-plot","title":"Scatter plot","text":"
    res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\n
    "},{"location":"16_mendelian_randomization/#single-snp","title":"Single SNP","text":"
    res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\n
    "},{"location":"16_mendelian_randomization/#leave-one-out","title":"Leave-one-out","text":"
    res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\n
    "},{"location":"16_mendelian_randomization/#funnel-plot","title":"Funnel plot","text":"
    res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\n
    "},{"location":"16_mendelian_randomization/#mr-steiger-directionality-test","title":"MR Steiger directionality test","text":"

    MR Steiger directionality test is a method to test the causal direction.

    Steiger test: test whether the SNP-outcome correlation is greater than the SNP-exposure correlation.

    harmonized_data$\"r.outcome\" <- get_r_from_lor(\n  harmonized_data$\"beta.outcome\",\n  harmonized_data$\"eaf.outcome\",\n  45383,\n  132032,\n  0.26,\n  model = \"logit\",\n  correction = FALSE\n)\n\nout <- directionality_test(harmonized_data)\nout\n\nid.exposure id.outcome  exposure    outcome snp_r2.exposure snp_r2.outcome  correct_causal_direction    steiger_pval\n<chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <lgl>   <dbl>\nrvi6Om  ETcv15  BMI T2D 0.02125453  0.005496427 TRUE    NA\n

    Reference: Hemani, G., Tilling, K., & Davey Smith, G. (2017). Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS genetics, 13(11), e1007081.

    "},{"location":"16_mendelian_randomization/#mr-base-web-app","title":"MR-Base (web app)","text":"

    MR-Base web app

    "},{"location":"16_mendelian_randomization/#strobe-mr","title":"STROBE-MR","text":"

    Before reporting any MR results, please check the STROBE-MR Checklist first, which consists of 20 things that should be addressed when reporting a mendelian randomization study.

    "},{"location":"16_mendelian_randomization/#references","title":"References","text":""},{"location":"17_colocalization/","title":"Colocalization","text":""},{"location":"17_colocalization/#co-localization","title":"Co-localization","text":""},{"location":"17_colocalization/#coloc-assuming-a-single-causal-variant","title":"Coloc assuming a single causal variant","text":"

    Coloc uses the assumption of 0 or 1 causal variant in each trait, and tests for whether they share the same causal variant.

    Note

    Actually such a assumption is different from fine-mapping. In fine-mapping, the aim is to find the putative causal variants, which is determined at birth. In colocalization, the aim is to find the \"signal overlapping\" to support the causality inference, like eQTL --> A trait. It is possible that the causal variants are different in two traits.

    Datasets used:

    Result interpretation:

    Basically, five configurations are calculated,

    ## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf \n##  1.73e-08  7.16e-07  2.61e-05  8.20e-05  1.00e+00 \n## [1] \"PP abf for shared variant: 100%\"\n

    \\(H_0\\): neither trait has a genetic association in the region

    \\(H_1\\): only trait 1 has a genetic association in the region

    \\(H_2\\): only trait 2 has a genetic association in the region

    \\(H_3\\): both traits are associated, but with different causal variants

    \\(H_4\\): both traits are associated and share a single causal variant

    PP.H4.abf is the posterior probability that two traits share a same causal variant.

    Then based on H4 is true, a 95% credible set could be constructed (as a shared causal variant does not necessarily mean a specific variant).

    o <- order(my.res$results$SNP.PP.H4,decreasing=TRUE)\ncs <- cumsum(my.res$results$SNP.PP.H4[o])\nw <- which(cs > 0.95)[1]\nmy.res$results[o,][1:w,]$snp\n

    References:

    Coloc: a package for colocalisation analyses

    "},{"location":"17_colocalization/#coloc-assuming-multiple-causal-variants-or-multiple-signals","title":"Coloc assuming multiple causal variants or multiple signals","text":"

    When the single-causal variant assumption is violeted, several ways could be used to relieve it.

    1. Assuming multiple causal variants in SuSiE-Coloc pipeline. In this pipeline, putative causal variants are fine-mapped, then each signal is passed to the coloc engine.

    2. Conditioning analysis using GCTA-COJO-Coloc pipeline. In this pipeline, signals are segregated, then passed to the coloc engine.

    "},{"location":"17_colocalization/#other-pipelines","title":"Other pipelines","text":"

    Many other strategies and pipelines are available for colocalization and prioritize the variants/genes/traits. For example: * HyPrColoc * OpenTargets *

    "},{"location":"18_Conditioning_analysis/","title":"Conditioning analysis","text":"

    Multiple association signals could exist in one locus, especially when observing complex LD structures in the regional plot. Conditioning on one signal allows the separation of independent signals.

    Several ways to perform the conditioning analysis:

    "},{"location":"18_Conditioning_analysis/#adding-the-lead-variant-to-the-covariates","title":"Adding the lead variant to the covariates","text":"

    First, extract the individual genotype (dosage) to the text file. Then add it to covariates.

    plink2 \\\n  --pfile chr1.dose.Rsq0.3 vzs \\\n  --extract chr1.list \\\n  --threads 1 \\\n  --export A \\\n  --out genotype/chr1\n

    The exported format could be found in Export non-PLINK 2 fileset.

    Note

    Major allele dosage would be outputted. If adding ref-first, REF allele would be outputted. It does not matter as a covariate.

    Then just paste it to the covariates table and run the association test.

    Note

    Some association test software will also provide options for condition analysis. For example, in PLINK, you can use --condition <variant ID> for condition analysis. You can simply provide a list of variant IDs to run the condition analysis.

    "},{"location":"18_Conditioning_analysis/#gcta-cojo","title":"GCTA-COJO","text":"

    If raw genotypes and phenotypes are not available, GCTA-COJO performs conditioning analysis using sumstats and external LD reference.

    cojo-top-SNPs 10 will perform a step-wise model selection to select 10 independently associated SNPs (including non-significant ones).

    gcta \\\n  --bfile chr1 \\\n  --chr 1 \\\n  --maf 0.001 \\\n  --cojo-file chr1_cojo.input \\\n  --cojo-top-SNPs 10 \\\n  --extract-region-bp 1 152383617 5000 \\\n  --out chr1_cojo.output\n

    Note

    bfile is used to generate LD. A size of > 4000 unrelated samples is suggested. Estimation of LD in GATC is based on the hard-call genotype.

    Input file format less chr1_cojo.input:

    ID      ALLELE1 ALLELE0 A1FREQ  BETA    SE      P       N\nchr1:11171:CCTTG:C      C       CCTTG   0.0831407       -0.0459889      0.0710074       0.5172  180590\nchr1:13024:G:A  A       G       1.63957e-05     -3.2714 3.26302 0.3161  180590\n
    Here A1 is the effect allele.

    Then --cojo-cond could be used to generate new sumstats conditioned on the above-selected variant(s).

    Reference:

    "},{"location":"19_ld/","title":"Linkage disequilibrium(LD)","text":""},{"location":"19_ld/#ld-definition","title":"LD Definition","text":"

    In meiosis, homologous chromosomes are recombined. Recombination rates at different DNA regions are not equal. The fragments can be detected after tens of generations, causing Linkage disequilibrium, which refers to the non-random association of alleles of different loci.

    Factors affecting LD

    "},{"location":"19_ld/#ld-estimation","title":"LD Estimation","text":"

    Suppose we have two SNPs whose alleles are \\(A/a\\) and \\(B/b\\).

    The haplotype frequencies are:

    Haplotype Frequency AB \\(p_{AB}\\) Ab \\(p_{Ab}\\) aB \\(p_{aB}\\) ab \\(p_{ab}\\)

    The allele frequencies are:

    Allele Frequency A \\(p_A=p_{AB}+p_{Ab}\\) a \\(p_A=p_{aB}+p_{ab}\\) B \\(p_A=p_{AB}+p_{aB}\\) b \\(p_A=p_{Ab}+p_{ab}\\)

    D : the level of LD between A and B can be estimated using coefficient of linkage disequilibrium (D), which is defined as:

    \\[D_{AB} = p_{AB} - p_Ap_B\\]

    If A and B are in linkage equilibrium, we can get

    \\[D_{AB} = p_{AB} - p_Ap_B = 0\\]

    which means the coefficient of linkage disequilibrium is 0 in this case.

    D can be calculated for each pair of alleles and their relationships can be expressed as:

    \\[D_{AB} = -D_{Ab} = -D_{aB} = D_{ab} \\]

    So we can simply denote \\(D = D_{AB}\\), and the relationship between haplotype frequencies and allele frequencies can be summarized in the following table.

    Allele A a Total B \\(p_{AB}=p_Ap_B+D\\) \\(p_{aB}=p_ap_B-D\\) \\(p_B\\) b \\(p_{AB}=p_Ap_b-D\\) \\(p_{AB}=p_ap_b+D\\) \\(p_b\\) Total \\(p_A\\) \\(p_a\\) 1

    The range of possible values of D depends on the allele frequencies, which is not suitable for comparison between different pairs of alleles.

    Lewontin suggested a method for the normalization of D :

    \\[D_{normalized} = {{D}\\over{D_{max}}}\\]

    where

    \\[ D_{max} = \\begin{cases} max\\{-p_Ap_B, -(1-p_A)(1-p_B)\\} & \\text{when } D \\lt 0 \\\\ min\\{ p_A(1-p_B), p_B(1-p_A) \\} & \\text{when } D \\gt 0 \\\\ \\end{cases} \\]

    It measures how much proportion of the haplotypes had undergone recombination.

    In practice, the most commonly used alternative metric to \\(D_{normalized}\\) is \\(r^2\\), the correlation coefficient, which can be obtained by:

    \\[ r^2 = {{D^2}\\over{p_A(1-p_A)p_B(1-p_B)}} \\]

    Reference: Slatkin, M. (2008). Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477-485.

    "},{"location":"19_ld/#ld-calculation-using-software","title":"LD Calculation using software","text":""},{"location":"19_ld/#ldstore2","title":"LDstore2","text":"

    LDstore2: http://www.christianbenner.com/#

    Reference: Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).

    "},{"location":"19_ld/#plink-ld","title":"PLINK LD","text":"

    Please check Calculate LD using PLINK.

    "},{"location":"19_ld/#ld-lookup-using-ldlink","title":"LD Lookup using LDlink","text":"

    LDlink

    LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups. Each included application is specialized for querying and displaying unique aspects of linkage disequilibrium.

    https://ldlink.nci.nih.gov/?tab=home

    Reference: Machiela, M. J., & Chanock, S. J. (2015). LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 31(21), 3555-3557.

    LDlink is a very useful tool for quick lookups of any information related to LD.

    "},{"location":"19_ld/#ldlink-ldpair","title":"LDlink-LDpair","text":"

    LDpair

    "},{"location":"19_ld/#ldlink-ldproxy","title":"LDlink-LDproxy","text":"

    LDproxy for rs671

    "},{"location":"19_ld/#query-in-batch-using-ldlink-api","title":"Query in batch using LDlink API","text":"

    LDlink provides API for queries using command line.

    You need to register and get a token first.

    https://ldlink.nci.nih.gov/?tab=apiaccess

    Query LD proxies for variants using LDproxy API

    curl -k -X GET 'https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs3&pop=MXL&r2_d=r2&window=500000&    genome_build=grch37&token=faketoken123'\n
    "},{"location":"19_ld/#ldlinkr","title":"LDlinkR","text":"

    There is also a related R package for LDlink.

    Query LD proxies for variants using LDlinkR

    install.packages(\"LDlinkR\")\n\nlibrary(LDlinkR)\n\nmy_proxies <- LDproxy(snp = \"rs671\", \n                      pop = \"EAS\", \n                      r2d = \"r2\", \n                      token = \"YourTokenHere123\",\n                      genome_build = \"grch38\"\n                     )\n

    Reference: Myers, T. A., Chanock, S. J., & Machiela, M. J. (2020). LDlinkR: an R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Frontiers in genetics, 11, 157.

    "},{"location":"19_ld/#ld-pruning","title":"LD-pruning","text":"

    Please check LD-pruning

    "},{"location":"19_ld/#ld-clumping","title":"LD-clumping","text":"

    Please check LD-clumping

    "},{"location":"19_ld/#ld-score","title":"LD score","text":"

    Definition: https://cloufield.github.io/GWASTutorial/08_LDSC/#ld-score

    "},{"location":"19_ld/#ldsc","title":"LDSC","text":"

    LD score can be estimated with LDSC using PLINK format genotype data as the reference panel.

    plinkPrefix=chr22\n\npython ldsc.py \\\n    --bfile ${plinkPrefix}\n    --l2 \\\n    --ld-wind-cm 1\\\n    --out ${plinkPrefix}\n

    Check here for details.

    "},{"location":"19_ld/#gcta","title":"GCTA","text":"

    GCTA also provides a function to estimate LD scores using PLINK format genotype data.

    plinkPrefix=chr22\n\ngcta64 \\\n    --bfile  ${plinkPrefix} \\\n    --ld-score \\\n    --ld-wind 1000 \\\n    --ld-rsq-cutoff 0.01 \\\n    --out  ${plinkPrefix}\n

    Check here for details.

    "},{"location":"19_ld/#ld-score-regression","title":"LD score regression","text":"

    Please check LD score regression

    "},{"location":"19_ld/#reference","title":"Reference","text":""},{"location":"20_power_analysis/","title":"Power analysis for GWAS","text":""},{"location":"20_power_analysis/#type-i-type-ii-errors-and-statistical-power","title":"Type I, type II errors and Statistical power","text":"

    This table shows the relationship between the null hypothesis \\(H_0\\) and the results of a statistical test (whether or not to reject the null hypothesis \\(H_0\\) ).

    H0 is True H0 is False Do Not Reject True negative : \\(1 - \\alpha\\) Type II error (false negative) : \\(\\beta\\) Reject Type I error (false positive) : \\(\\alpha\\) True positive : \\(1 - \\beta\\)

    \\(\\alpha\\) : significance level

    By definition, the statistical power of a test refers to the probability that the test will correctly reject the null hypothesis, namely the True positive rate in the table above.

    \\(Power = Pr ( Reject\\ | H_0\\ is\\ False) = 1 - \\beta\\)

    Power

    Factors affecting power

    "},{"location":"20_power_analysis/#non-centrality-parameter","title":"Non-centrality parameter","text":"

    NCP describes the degree of difference between the alternative hypothesis \\(H_1\\) and the null hypothesis \\(H_0\\) values.

    Consider a simple linear regression model:

    \\[y = \\mu +\\beta x + \\epsilon\\]

    The variance of the error term:

    \\[\\sigma^2 = Var(y) - Var(x)\\beta^2\\]

    Usually, the phenotypic variance that a single SNP could explain is very limited, so we can approximate \\(\\sigma^2\\) by:

    \\[ \\sigma^2 \\thickapprox Var(y)\\]

    Under Hardy-Weinberg equilibrium, we can get:

    \\[Var(x) = 2f(1-f)\\]

    So the Non-centrality parameter(NCP) \\(\\lambda\\) for \\(\\chi^2\\) distribution with degree of freedom 1:

    \\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2\\]"},{"location":"20_power_analysis/#power-for-quantitative-traits","title":"Power for quantitative traits","text":"\\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2 \\thickapprox N \\times {{Var(x)\\beta^2}\\over{\\sigma^2}} \\thickapprox N \\times {{2f(1-f) \\beta^2 }\\over {Var(y)}} \\]

    Significance threshold: \\(C = CDF_{\\chi^2}^{-1}(1 - \\alpha,df=1)\\)

    \\[ Power = Pr(\\lambda > C ) = 1 - CDF_{\\chi^2}(C, ncp = \\lambda,df=1) \\] "},{"location":"20_power_analysis/#power-for-large-scale-case-control-genome-wide-association-studies","title":"Power for large-scale case-control genome-wide association studies","text":"

    Denote :

    Null hypothesis : \\(P_{case} = P_{control}\\)

    To test whether one proportion \\(P_{case}\\) equals the other proportion \\(P_{control}\\), the test statistic is:

    \\[z = {{P_{case} - P_{control}}\\over {\\sqrt{ {{P_{case}(1 - P_{case})}\\over{2N_{case}}} + {{P_{control}(1 - P_{control})}\\over{2N_{control}}} }}}\\]

    Significance threshold: \\(C = \\Phi^{-1}(1 - \\alpha / 2 )\\)

    \\[ Power = Pr(|Z|>C) = 1 - \\Phi(-C-z) + \\Phi(C-z)\\]

    GAS power calculator

    GAS power calculator implemented this method, and you can easily calculate the power using their website

    "},{"location":"20_power_analysis/#reference","title":"Reference:","text":""},{"location":"21_twas/","title":"TWAS","text":""},{"location":"21_twas/#background","title":"Background","text":"

    Most variants identified in GWAS are located in regulatory regions, and these genetic variants could potentially affect complex traits through gene expression.

    However, due to the limitation of samples and high cost, it is difficult to measure gene expression at a large scale. Consequently, many expression-trait associations have not been detected, especially for those with small effect sizes.

    To address these issues, alternative approaches have been proposed and transcriptome-wide association study (TWAS) has become a common and easy-to-perform approach to identify genes whose expression is significantly associated with complex traits in individuals without directly measured expression levels.

    GWAS and TWAS

    "},{"location":"21_twas/#definition","title":"Definition","text":"

    TWAS is a method to identify significant expression-trait associations using expression imputation from genetic data or summary statistics.

    Individual-level and summary-level TWAS

    "},{"location":"21_twas/#fusion","title":"FUSION","text":"

    In this tutorial, we will introduce FUSION, which is one of the most commonly used tools for performing transcriptome-wide association studies (TWAS) using summary-level data.

    url : http://gusevlab.org/projects/fusion/

    FUSION trains predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. (http://gusevlab.org/projects/fusion/)

    Quote

    Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., ... & Pasaniuc, B. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3), 245-252.

    "},{"location":"21_twas/#algorithm-for-imputing-expression-into-gwas-summary-statistics","title":"Algorithm for imputing expression into GWAS summary statistics","text":"

    ImpG-Summary algorithm was extended to impute the Z scores for the cis genetic component of expression.

    FUSION statistical model

    \\(Z\\) : a vector of standardized effect sizes (z scores) of SNPs for the target trait at a given locus

    We impute the Z score of the expression and trait as a linear combination of elements of \\(Z\\) with weights \\(W\\).

    \\[ W = \\Sigma_{e,s}\\Sigma_{s,s}^{-1} \\]

    Both \\(\\Sigma_{e,s}\\) and \\(\\Sigma_{s,s}\\) are estimated from reference datsets.

    \\[ Z \\sim N(0, \\Sigma_{s,s} ) \\]

    The variance of \\(WZ\\) (imputed z score of expression and trait)

    \\[ Var(WZ) = W\\Sigma_{s,s}W^t \\]

    The imputation Z score can be obtained by:

    \\[ {{WZ}\\over{W\\Sigma_{s,s}W^t}^{1/2}} \\]

    ImpG-Summary algorithm

    Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., ... & Price, A. L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906-2914.

    "},{"location":"21_twas/#installation","title":"Installation","text":"

    Download FUSION from github and install

    wget https://github.com/gusevlab/fusion_twas/archive/master.zip\nunzip master.zip\ncd fusion_twas-master\n

    Download and unzip the LD reference data (1000 genome)

    wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2\ntar xjvf LDREF.tar.bz2\n

    Download and unzip plink2R

    wget https://github.com/gabraham/plink2R/archive/master.zip\nunzip master.zip\n

    Install R packages

    # R >= 4.0\nR\n\ninstall.packages(c('optparse','RColorBrewer'))\ninstall.packages('plink2R-master/plink2R/',repos=NULL)\n

    "},{"location":"21_twas/#example","title":"Example","text":"

    FUSION framework

    Input:

    1. GWAS summary statistics (in LDSC format)
    2. pre-computed gene expression weights (from http://gusevlab.org/projects/fusion/)

    Input GWAS sumstats fromat

    1. SNP (rsID)
    2. A1 (effect allele)
    3. A2 (non-effect allele)
    4. Z (Z score)

    Example:

    SNP A1  A2  N   CHISQ   Z\nrs6671356   C   T   70100.0 0.172612905312  0.415467092935\nrs6604968   G   A   70100.0 0.291125788806  0.539560736902\nrs4970405   A   G   70100.0 0.102204513891  0.319694407037\nrs12726255  G   A   70100.0 0.312418295691  0.558943911042\nrs4970409   G   A   70100.0 0.0524226849517 0.228960007319\n

    Get sample sumstats and weights

    wget https://data.broadinstitute.org/alkesgroup/FUSION/SUM/PGC2.SCZ.sumstats\n\nmkdir WEIGHTS\ncd WEIGHTS\nwget https://data.broadinstitute.org/alkesgroup/FUSION/WGT/GTEx.Whole_Blood.tar.bz2\ntar xjf GTEx.Whole_Blood.tar.bz2\n

    WEIGHTS

    files in each WEIGHTS folder

    RDat weight files for each gene in a tissue type

    GTEx.Whole_Blood.ENSG00000002549.8.LAP3.wgt.RDat         GTEx.Whole_Blood.ENSG00000166394.10.CYB5R2.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002822.11.MAD1L1.wgt.RDat      GTEx.Whole_Blood.ENSG00000166435.11.XRRA1.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002919.10.SNX11.wgt.RDat       GTEx.Whole_Blood.ENSG00000166436.11.TRIM66.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002933.3.TMEM176A.wgt.RDat     GTEx.Whole_Blood.ENSG00000166444.13.ST5.wgt.RDat\nGTEx.Whole_Blood.ENSG00000003137.4.CYP26B1.wgt.RDat      GTEx.Whole_Blood.ENSG00000166471.6.TMEM41B.wgt.RDat\n...\n

    Expression imputation

    Rscript FUSION.assoc_test.R \\\n--sumstats PGC2.SCZ.sumstats \\\n--weights ./WEIGHTS/GTEx.Whole_Blood.pos \\\n--weights_dir ./WEIGHTS/ \\\n--ref_ld_chr ./LDREF/1000G.EUR. \\\n--chr 22 \\\n--out PGC2.SCZ.22.dat\n

    Results

    head PGC2.SCZ.22.dat\nPANEL   FILE    ID  CHR P0  P1  HSQ BEST.GWAS.ID    BEST.GWAS.Z EQTL.ID EQTL.R2 EQTL.Z  EQTL.GWAS.Z NSNP    NWGT    MODEL   MODELCV.R2  MODELCV.PV  TWAS.Z  TWAS.P\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000273311.1.DGCR11.wgt.RDat DGCR11  22  19033675    19035888    0.0551  rs2238767   -2.98   rs2283641    0.013728     4.33   2.5818 408  1  top1    0.014   0.018    2.5818 9.83e-03\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000100075.5.SLC25A1.wgt.RDat    SLC25A1 22  19163095    19166343    0.0740  rs2238767   -2.98   rs762523     0.080367     5.36  -1.8211 406  1  top1    0.08    7.2e-08 -1.8216.86e-02\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000070371.11.CLTCL1.wgt.RDat    CLTCL1  22  19166986    19279239    0.1620  rs4819843    3.04   rs809901     0.072193     5.53  -1.9928 456 19  enet    0.085   2.8e-08 -1.8806.00e-02\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000232926.1.AC000078.5.wgt.RDat AC000078.5  22  19874812    19875493    0.2226  rs5748555   -3.15   rs13057784   0.052796     5.60  -0.1652 514 44  enet    0.099   2e-09  0.0524   9.58e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000185252.13.ZNF74.wgt.RDat ZNF74   22  20748405    20762745    0.1120  rs595272     4.09   rs1005640    0.001422     3.44  -1.3677 301  8  enet    0.008   0.054   -0.8550 3.93e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000099940.7.SNAP29.wgt.RDat SNAP29  22  21213771    21245506    0.1286  rs595272     4.09   rs4820575    0.061763     5.94  -1.1978 416 27  enet    0.079   9.4e-08 -1.0354 3.00e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000272600.1.AC007308.7.wgt.RDat AC007308.7  22  21243494    21245502    0.2076  rs595272     4.09   rs165783     0.100625     6.79  -0.8871 408 12  lasso   0.16    5.4e-1-1.2049   2.28e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000183773.11.AIFM3.wgt.RDat AIFM3   22  21319396    21335649    0.0676  rs595272     4.09   rs565979     0.036672     4.50  -0.4474 362  1  top1    0.037   0.00024 -0.4474 6.55e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000230513.1.THAP7-AS1.wgt.RDat  THAP7-AS1   22  21356175    21357118    0.2382  rs595272     4.09   rs2239961    0.105307    -7.04  -0.3783 347  5  lasso   0.15    7.6e-1 0.2292   8.19e-01\n

    Descriptions of the output (cited from http://gusevlab.org/projects/fusion/ )

    Colume number Column header Value Usage 1 FILE \u2026 Full path to the reference weight file used 2 ID FAM109B Feature/gene identifier, taken from --weights file 3 CHR 22 Chromosome 4 P0 42470255 Gene start (from --weights) 5 P1 42475445 Gene end (from --weights) 6 HSQ 0.0447 Heritability of the gene 7 BEST.GWAS.ID rs1023500 rsID of the most significant GWAS SNP in locus 8 BEST.GWAS.Z -5.94 Z-score of the most significant GWAS SNP in locus 9 EQTL.ID rs5758566 rsID of the best eQTL in the locus 10 EQTL.R2 0.058680 cross-validation R2 of the best eQTL in the locus 11 EQTL.Z -5.16 Z-score of the best eQTL in the locus 12 EQTL.GWAS.Z -5.0835 GWAS Z-score for this eQTL 13 NSNP 327 Number of SNPs in the locus 14 MODEL lasso Best performing model 15 MODELCV.R2 0.058870 cross-validation R2 of the best performing model 16 MODELCV.PV 3.94e-06 cross-validation P-value of the best performing model 17 TWAS.Z 5.1100 TWAS Z-score (our primary statistic of interest) 18 TWAS.P 3.22e-07 TWAS P-value"},{"location":"21_twas/#limitations","title":"Limitations","text":"
    1. Significant loci identified in TWAS also contain multiple tarit-associated genes. GWAS often identifies multiple variants in LD. Similarly, TWAS frequently identifies multiple genes in a locus.

    2. Co-regulation may cause false positive results. Just like SNPs are correlated due to LD, gene expressions are often correlated due to co-regulation.

    3. Sometimes even when co-regulation is not captured, the shared variants (or variants in strong LD) in different expression prediction models may cause false positive results.

    4. Predicted expression account for only a limited portion of total gene expression. Total expression is affected not only by genetic components like cis-eQTL but also by other factors like environmental and technical components.

    5. Other factors. For example, the window size for selecting variants may affect association results.

    "},{"location":"21_twas/#criticism","title":"Criticism","text":"

    TWAS aims to test the relationship of the phenotype with the genetic component of the gene expression. But under current framework, TWAS only test the relationship of the phenotype with the predicted gene expression without accounting for the uncertainty in that prediction. The key point here is that the current framework omits the fact that the gene expression data is also the result of a sampling process from the analysis.

    \"Consequently, the test of association between that predicted genetic component and a phenotype reduces to merely a (weighted) test of joint association of the SNPs with the phenotype, which means that they cannot be used to infer a genetic relationship between gene expression and the phenotype on a population level.\"

    Quote

    de Leeuw, C., Werme, J., Savage, J. E., Peyrot, W. J., & Posthuma, D. (2021). On the interpretation of transcriptome-wide association studies. bioRxiv, 2021-08.

    "},{"location":"21_twas/#reference","title":"Reference","text":""},{"location":"32_whole_genome_regression/","title":"Whole-genome regression : REGENIE","text":""},{"location":"32_whole_genome_regression/#concepts","title":"Concepts","text":""},{"location":"32_whole_genome_regression/#overview","title":"Overview","text":"

    Overview of REGENIE

    Reference: https://rgcgithub.github.io/regenie/overview/

    "},{"location":"32_whole_genome_regression/#whole-genome-model","title":"Whole genome model","text":""},{"location":"32_whole_genome_regression/#stacked-regressions","title":"Stacked regressions","text":""},{"location":"32_whole_genome_regression/#firth-correction","title":"Firth correction","text":""},{"location":"32_whole_genome_regression/#tutorial","title":"Tutorial","text":""},{"location":"32_whole_genome_regression/#installation","title":"Installation","text":"

    Please check here

    "},{"location":"32_whole_genome_regression/#step1","title":"Step1","text":"

    Sample codes for running step 1

    plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\n# revise the header of covariate file\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n  --step 1 \\\n  --bed ${plinkFile} \\\n  --extract ${extract} \\\n  --phenoFile ${phenoFile} \\\n  --covarFile ${covarFile} \\\n  --covarColList ${covarList} \\\n  --bt \\\n  --bsize 1000 \\\n  --lowmem \\\n  --lowmem-prefix tmpdir/regenie_tmp_preds \\\n  --out 1kg_eas_step1_BT\n
    "},{"location":"32_whole_genome_regression/#step2","title":"Step2","text":"

    Sample codes for running step 2

    plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n  --step 2 \\\n  --bed ${plinkFile} \\\n  --ref-first \\\n  --phenoFile ${phenoFile} \\\n  --covarFile ${covarFile} \\\n  --covarColList ${covarList} \\\n  --bt \\\n  --bsize 400 \\\n  --firth --approx --pThresh 0.01 \\\n  --pred 1kg_eas_step1_BT_pred.list \\\n  --out 1kg_eas_step1_BT\n
    "},{"location":"32_whole_genome_regression/#visualization","title":"Visualization","text":""},{"location":"32_whole_genome_regression/#reference","title":"Reference","text":""},{"location":"55_measure_of_effect/","title":"Measure of effect","text":""},{"location":"55_measure_of_effect/#concepts","title":"Concepts","text":""},{"location":"55_measure_of_effect/#risk","title":"Risk","text":"

    Risk: the probability that a subject within a population will develop a given disease, or other health outcome, over a specified follow-up period.

    \\[ R = {{E}\\over{E + N}} \\] "},{"location":"55_measure_of_effect/#odds","title":"Odds","text":"

    Odds: the likelihood of a new event occurring rather than not occurring. It is the probability that an event will occur divided by the probability that the event will not occur.

    \\[ Odds = {E \\over N } \\]"},{"location":"55_measure_of_effect/#hazard","title":"Hazard","text":"

    Hazard function \\(h(t)\\): the event rate at time \\(t\\) conditional on survival until time \\(t\\) (namely, \\(T\u2265t\\))

    \\[ h(t) = Pr(t<=T<t_{+1} | T>=t ) \\]

    T\u00a0is a discrete random variable indicating the time of occurrence of the event.

    "},{"location":"55_measure_of_effect/#relative-risk-rr-and-odds-ratio-or","title":"Relative risk (RR) and Odds ratio (OR)","text":""},{"location":"55_measure_of_effect/#22-contingency-table","title":"2\u00d72 Contingency Table","text":"Intervention I Control C Events E IE CE Non-events N IN CN"},{"location":"55_measure_of_effect/#relative-risk-rr","title":"Relative risk (RR)","text":"

    RR: relative risk (risk ratio), usually used in cohort studies.

    \\[ RR = {{R_{Intervention}}\\over{R_{ conrol}}}={{IE/(IE+IN)}\\over{CE/(CE+CN)}} \\]"},{"location":"55_measure_of_effect/#odds-ratio-or","title":"Odds ratio (OR)","text":"

    OR: usually used in case control studies.

    \\[ OR = {{Odds_{Intervention}}\\over{Odds_{ conrol}}}={{IE/IN}\\over{CE/CN}} = {{IE * CN}\\over{CE * IN}} \\]

    When the event occurs in less than 10% of the unexposed population, the OR provides a reasonable approximation of the RR.

    "},{"location":"55_measure_of_effect/#hazard-ratios-hr","title":"Hazard ratios (HR)","text":"

    Hazard ratios (relative hazard) are usually estimated from Cox proportional hazards model:

    \\[ h_i(t) = h_0(t) \\times e^{\\beta_0 + \\beta_1X_{i1} + ... + \\beta_nX_{in} } = h_0(t) \\times e^{X_i\\beta } \\]

    HR: the ratio of the hazard rates corresponding to the conditions characterised by two distinct levels of a treatment variable of interest.

    \\[ HR = {{h(t | X_i)}\\over{h(t|X_j)}} = {{h_0(t) \\times e^{X_i\\beta }}\\over{h_0(t) \\times e^{X_j\\beta }}} = e^{(X_i-X_j)\\beta} \\]"},{"location":"60_awk/","title":"AWK","text":""},{"location":"60_awk/#awk-introduction","title":"AWK Introduction","text":"

    'awk' is one of the most powerful text processing tools for tabular text files.

    "},{"location":"60_awk/#awk-syntax","title":"AWK syntax","text":"
    awk OPTION 'CONDITION {PROCESS}' FILENAME\n

    Some special variables in awk:

    "},{"location":"60_awk/#examples","title":"Examples","text":"

    Using the sample sumstats, we will demonstrate some simple but useful one-liners.

    # sample sumstats\nhead ../02_Linux_basics/sumstats.txt \n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"60_awk/#example-1","title":"Example 1","text":"

    Select variants on chromosome 2 (keeping the headers)

    awk 'NR==1 ||  $1==2 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n2   22398   2:22398:C:T C   T   T   ADD 503 1.287540.161017 1.56962 0.116503    .\n2   24839   2:24839:C:T C   T   T   ADD 503 1.318170.179754 1.53679 0.124344    .\n2   26844   2:26844:C:T C   T   T   ADD 503 1.3173  0.161302    1.70851 0.0875413   .\n2   28786   2:28786:T:C T   C   C   ADD 503 1.3043  0.161184    1.64822 0.0993082   .\n2   30091   2:30091:C:G C   G   G   ADD 503 1.3043  0.161184    1.64822 0.0993082   .\n2   30762   2:30762:A:G A   G   A   ADD 503 1.099560.158614 0.598369    0.549594    .\n2   34503   2:34503:G:T G   T   T   ADD 503 1.323720.179789 1.55988 0.118789    .\n2   39340   2:39340:A:G A   G   G   ADD 503 1.3043  0.161184    1.64822 0.0993082   .\n2   55237   2:55237:T:C T   C   C   ADD 503 1.314860.161988 1.68983 0.0910614   .\n

    The NR here means row number. The condition here NR==1 || $1==2 means if it is the first row or the first column is equal to 2, conduct the process print $0, which mean print all columns.

    "},{"location":"60_awk/#example-2","title":"Example 2","text":"

    Select all genome-wide significant variants (p<5e-8)

    awk 'NR==1 ||  $13 <5e-8 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"60_awk/#example-3","title":"Example 3","text":"

    Create a bed-like format for annotation

    awk 'NR>1 {print $1,$2,$2,$4,$5}' ../02_Linux_basics/sumstats.txt | head\n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n
    "},{"location":"60_awk/#awk-workflow","title":"AWK workflow","text":"

    The workflow of awk can be summarized in the following figure:

    awk workflow

    "},{"location":"60_awk/#awk-variables","title":"AWK variables","text":"

    Frequently used awk variables

    Variable Desciption NR The number of input records NF The number of input fields FS The input field separator. The default value is \" \" OFS The output field separator. The default value is \" \" RS The input record separator. The default value is \"\\n\" ORS The output record separator.The default value is \"\\n\" FILENAME The name of the current input file. FNR The current record number in the current file

    Handle csv and tsv files

    head ../03_Data_formats/sample_data.csv\n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
    awk -v FS=',' -v OFS=\"\\t\" '{print $1,$2}' sample_data.csv\n#CHROM  POS\n1       13273\n1       14599\n1       14604\n1       14930\n1       69897\n1       86331\n1       91581\n1       122872\n1       135163\n

    convert csv to tsv

    awk 'BEGIN { FS=\",\"; OFS=\"\\t\" } {$1=$1; print}' sample_data.csv\n

    Skip and replace headers

    awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"CHR\\tPOS\"} NR>1 {print $1,$2}' sample_data.csv\n\nCHR     POS\n1       13273\n1       14599\n1       14604\n1       14930\n1       69897\n1       86331\n1       91581\n1       122872\n1       135163\n

    Extract a line

    awk 'NR==4' sample_data.csv\n\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n

    Print the last two columns

    awk -v FS=',' '{print $(NF-1),$(NF)}' sample_data.csv\nP ERRCODE\n0.305961 .\n0.0104299 .\n0.0104299 .\n0.0269602 .\n0.0188466 .\n0.102694 .\n0.522847 .\n0.703856 .\n0.155079 .\n
    "},{"location":"60_awk/#awk-operators","title":"AWK operators","text":"

    Arithmetic Operators

    Arithmetic Operators Desciption + add - subtract * multiply \\ divide % modulus division ** x**y : x raised to the y-th power

    Logical Operators

    Logical Operators Desciption \\|\\| or && and ! not"},{"location":"60_awk/#awk-functions","title":"AWK functions","text":"

    Numeric functions in awk

    Convert OR and P to BETA and -log10(P)

    awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"SNPID\\tBETA\\tMLOG10P\"}NR>1{print $3,log($10),-log($13)/log(10)}' sample_data.csv\nSNPID   BETA    MLOG10P\n1:13273:G:C     -0.287458       0.514334\n1:14599:T:A     0.593172        1.98172\n1:14604:A:G     0.593172        1.98172\n1:14930:A:G     0.531446        1.56928\n1:69897:T:C     0.457438        1.72477\n1:86331:A:G     0.385303        0.988455\n1:91581:G:A     -0.0785866      0.281625\n1:122872:T:G    0.0687142       0.152516\n1:135163:C:T    -0.339927       0.809447\n

    String manipulating functions in awk

    "},{"location":"60_awk/#awk-options","title":"AWK options","text":"
    $ awk --help\nUsage: awk [POSIX or GNU style options] -f progfile [--] file ...\nUsage: awk [POSIX or GNU style options] [--] 'program' file ...\nPOSIX options:          GNU long options: (standard)\n        -f progfile             --file=progfile\n        -F fs                   --field-separator=fs\n        -v var=val              --assign=var=val\nShort options:          GNU long options: (extensions)\n        -b                      --characters-as-bytes\n        -c                      --traditional\n        -C                      --copyright\n        -d[file]                --dump-variables[=file]\n        -D[file]                --debug[=file]\n        -e 'program-text'       --source='program-text'\n        -E file                 --exec=file\n        -g                      --gen-pot\n        -h                      --help\n        -i includefile          --include=includefile\n        -l library              --load=library\n        -L[fatal|invalid]       --lint[=fatal|invalid]\n        -M                      --bignum\n        -N                      --use-lc-numeric\n        -n                      --non-decimal-data\n        -o[file]                --pretty-print[=file]\n        -O                      --optimize\n        -p[file]                --profile[=file]\n        -P                      --posix\n        -r                      --re-interval\n        -S                      --sandbox\n        -t                      --lint-old\n        -V                      --version\n\nTo report bugs, see node `Bugs' in `gawk.info', which is\nsection `Reporting Problems and Bugs' in the printed version.\n\ngawk is a pattern scanning and processing language.\nBy default it reads standard input and writes standard output.\n\nExamples:\n        gawk '{ sum += $1 }; END { print sum }' file\n        gawk -F: '{ print $1 }' /etc/passwd\n
    "},{"location":"60_awk/#reference","title":"Reference","text":""},{"location":"61_sed/","title":"sed","text":"

    sed is also one of the most commonly used test-editing command in Linux, which is short for stream editor. sed command edits the text from standard input in a line-by-line approach.

    "},{"location":"61_sed/#sed-syntax","title":"sed syntax","text":"
    sed [OPTIONS] PROCESS [FILENAME]\n
    "},{"location":"61_sed/#examples","title":"Examples","text":""},{"location":"61_sed/#sample-input","title":"sample input","text":"
    head ../02_Linux_basics/sumstats.txt\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"61_sed/#example-1-replacing-strings","title":"Example 1: Replacing strings","text":"

    s for substitute g for global

    Replacing strings

    \"Replace the separator from : to _\"

    head 02_Linux_basics/sumstats.txt | sed 's/:/_/g'\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1_13273_G_C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1_14599_T_A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1_14604_A_G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1_14930_A_G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1_69897_T_C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1_86331_A_G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1_91581_G_A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1_122872_T_G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1_135163_C_T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n

    "},{"location":"61_sed/#example-2-delete-headerthe-first-line","title":"Example 2: Delete header(the first line)","text":"

    -d for deletion

    Delete header(the first line)

    head 02_Linux_basics/sumstats.txt | sed '1d'\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"69_resources/","title":"Resources","text":""},{"location":"69_resources/#sandbox","title":"Sandbox","text":"

    Sandbox provides tutorials for you to learn how to use bioinformatics tools right from your browser. Everything runs in a sandbox, so you can experiment all you want.

    "},{"location":"69_resources/#explain-shell","title":"Explain Shell","text":"

    explainshell is a tool (with a web interface) capable of parsing man pages, extracting options and explain a given command-line by matching each argument to the relevant help text in the man page.

    "},{"location":"71_python_resources/","title":"Python Resources","text":""},{"location":"71_python_resources/#python","title":"Python\u30d7\u30ed\u30b0\u30e9\u30df\u30f3\u30b0\u5165\u9580","text":""},{"location":"75_R_basics/","title":"R","text":""},{"location":"75_R_basics/#installing-r","title":"Installing R","text":""},{"location":"75_R_basics/#download-r-from-cran","title":"Download R from CRAN","text":"

    R can be downloaded from its official website CRAN (The Comprehensive R Archive Network).

    CRAN

    https://cran.r-project.org/

    "},{"location":"75_R_basics/#install-r-using-conda","title":"Install R using conda","text":"

    It is convenient to use conda to manage your R environment.

    conda install -c conda-forge r-base=4.x.x\n
    "},{"location":"75_R_basics/#ide-for-r-positrstudio","title":"IDE for R: Posit(Rstudio)","text":"

    Posit(Rstudio) is one of the most commonly used Integrated development environment(IDE) for R.

    https://posit.co/

    "},{"location":"75_R_basics/#use-r-in-interactive-mode","title":"Use R in interactive mode","text":"
    R\n
    "},{"location":"75_R_basics/#run-r-script","title":"Run R script","text":"
    Rscript mycode.R\n
    "},{"location":"75_R_basics/#installing-and-using-r-packages","title":"Installing and Using R packages","text":"
    install.packages(\"package_name\")\n\nlibrary(package_name)\n
    "},{"location":"75_R_basics/#basic-syntax","title":"Basic syntax","text":""},{"location":"75_R_basics/#assignment-and-evaluation","title":"Assignment and Evaluation","text":"
    > x <- 1\n\n> x\n[1] 1\n\n> print(x)\n[1] 1\n
    "},{"location":"75_R_basics/#data-types","title":"Data types","text":""},{"location":"75_R_basics/#atomic-data-types","title":"Atomic data types","text":"

    logical, integer, real, complex, string (or character)

    Atomic data types Description Examples logical boolean TRUE, FALSE integer integer 1,2 numeric float number 0.01 complex complex number 1+0i string string or chracter abc"},{"location":"75_R_basics/#vectors","title":"Vectors","text":"
    myvector <- c(1,2,3)\nmyvector < 1:3\n\nmyvector <- c(TRUE,FALSE)\nmyvector <- c(0.01, 0.02)\nmyvector <- c(1+0i, 2+3i)\nmyvector <- c(\"a\",\"bc\")\n
    "},{"location":"75_R_basics/#matrices","title":"Matrices","text":"
    > mymatrix <- matrix(1:6, nrow = 2, ncol = 3)\n> mymatrix\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n\n> ncol(mymatrix)\n[1] 3\n> nrow(mymatrix)\n[1] 2\n> dim(mymatrix)\n[1] 2 3\n> length(mymatrix)\n[1] 6\n
    "},{"location":"75_R_basics/#list","title":"List","text":"

    list() is a special vector-like data type that can contain different data types.

    > mylist <- list(1, 0.02, \"a\", FALSE, c(1,2,3), matrix(1:6,nrow=2,ncol=3))\n> mylist\n[[1]]\n[1] 1\n\n[[2]]\n[1] 0.02\n\n[[3]]\n[1] \"a\"\n\n[[4]]\n[1] FALSE\n\n[[5]]\n[1] 1 2 3\n\n[[6]]\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n
    "},{"location":"75_R_basics/#dataframe","title":"Dataframe","text":"
    > df <- data.frame(score = c(90,80,70,60),  rank = c(\"a\", \"b\", \"c\", \"d\"))\n> df\n  score rank\n1    90    a\n2    80    b\n3    70    c\n4    60    d\n
    "},{"location":"75_R_basics/#subsetting","title":"Subsetting","text":"
    myvector\n[1] 1 2 3\n> myvector[0]\ninteger(0)\n> myvector[1]\n[1] 1\nmyvector[1:2]\n[1] 1 2\n> myvector[-1]\n[1] 2 3\n> myvector[-1:-2]\n[1] 3\n
    > mymatrix\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n> mymatrix[0]\ninteger(0)\n> mymatrix[1]\n[1] 1\n> mymatrix[1,]\n[1] 1 3 5\n> mymatrix[1,2]\n[1] 3\n> mymatrix[1:2,2]\n[1] 3 4\n> mymatrix[,2]\n[1] 3 4\n
    > df\n  score rank\n1    90    a\n2    80    b\n3    70    c\n4    60    d\n> df[score]\nError in `[.data.frame`(df, score) : object 'score' not found\n> df[[score]]\nError in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x,  :\n  object 'score' not found\n> df[[\"score\"]]\n[1] 90 80 70 60\n> df[\"score\"]\n  score\n1    90\n2    80\n3    70\n4    60\n> df[1, \"score\"]\n[1] 90\n> df[1:2, \"score\"]\n[1] 90 80\n> df[1:2,2]\n[1] \"a\" \"b\"\n> df[1:2,1]\n[1] 90 80\n> df[,c(\"rank\",\"score\")]\n  rank score\n1    a    90\n2    b    80\n3    c    70\n4    d    60\n
    "},{"location":"75_R_basics/#data-input-and-output","title":"Data Input and Output","text":"
    mydata <- read.table(\"data.txt\", header=T)\n\nwrite.table(mydata, \"data.txt\")\n
    "},{"location":"75_R_basics/#control-flow","title":"Control flow","text":""},{"location":"75_R_basics/#if","title":"if","text":"
    if (x > y){\n  print (\"x\")\n} else if (x < y){\n  print (\"y\")\n} else {\n  print(\"tie\")\n}\n
    "},{"location":"75_R_basics/#for","title":"for","text":"
    > for (x in 1:5) {\n    print(x)\n}\n\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n
    "},{"location":"75_R_basics/#while","title":"while","text":"
    x<-0\nwhile (x<5)\n{\n    x<-x+1\n    print(\"Hello world\")\n}\n\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n
    "},{"location":"75_R_basics/#functions","title":"Functions","text":"
    myfunction <- function(x){\n  // actual code here\n  return(result)\n}\n\n> my_add_function <- function(x,y){\n  c = x + y\n  return(c)\n}\n> my_add_function(1,3)\n[1] 4\n
    "},{"location":"75_R_basics/#statistical-functions","title":"Statistical functions","text":""},{"location":"75_R_basics/#normal-distribution","title":"Normal distribution","text":"Function Description dnorm(x, mean = 0, sd = 1, log = FALSE) probability density function pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) cumulative density function qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) quantile function rnorm(n, mean = 0, sd = 1) generate random values from normal distribution
    > dnorm(1.96)\n[1] 0.05844094\n\n> pnorm(1.96)\n[1] 0.9750021\n\n> pnorm(1.96, lower.tail=FALSE)\n[1] 0.0249979\n\n> qnorm(0.975)\n[1] 1.959964\n\n> rnorm(10)\n [1] -0.05595019  0.83176199  0.58362601 -0.89434812  0.85722843  0.96199308\n [7]  0.47782706 -0.46322066  0.03525421 -1.00715141\n
    "},{"location":"75_R_basics/#chi-square-distribution","title":"Chi-square distribution","text":"Function Description dchisq(x, df, ncp = 0, log = FALSE) probability density function pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) cumulative density function qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) quantile function rchisq(n, df, ncp = 0) generate random values from normal distribution"},{"location":"75_R_basics/#regression","title":"Regression","text":"
    lm(formula, data, subset, weights, na.action,\n   method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n   singular.ok = TRUE, contrasts = NULL, offset, \u2026)\n\n# linear regression\nresults <- lm(formula = y ~ x1 + x2)\n\n# logistic regression\nresults <- lm(formula = y ~ x1 + x2, family = \"binomial\")\n

    Reference: - https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

    "},{"location":"76_R_resources/","title":"R Resources","text":""},{"location":"80_anaconda/","title":"Anaconda","text":"

    Conda is an open-source package and environment management system.

    It is a very handy tool when you need to manage python packages.

    "},{"location":"80_anaconda/#download","title":"Download","text":"

    https://www.anaconda.com/products/distribution

    For example, download the latest linux version:

    wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh\n

    "},{"location":"80_anaconda/#install","title":"Install","text":"
    # give it permission to execute\nchmod +x Anaconda3-2021.11-Linux-x86_64.sh \n\n# install\nbash ./Anaconda3-2021.11-Linux-x86_64.sh\n

    Follow the instructions on : https://docs.anaconda.com/anaconda/install/linux/

    If everything goes well, then you can see the (base) before the prompt, which indicate the base environment:

    (base) [heyunye@gc019 ~]$\n

    For how to use conda, please check : https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html

    Examples:

    # install a specific version of python package\nconda install pandas==1.5.2\n\n#create a new python 3.9 virtual environment with the name \"mypython39\"\nconda create -n mypython39 python=3.9\n\n#use environment.yml to create a virtual environment\nconda env create --file environment.yml\n\n# activate a virtual environment called ldsc\nconda activate ldsc\n\n# change back to base environment\nconda deactivate\n\n# list all packages in your current environment \nconda list\n\n# list all your current environments \nconda env list\n

    "},{"location":"81_jupyter_notebook/","title":"Jupyter notebook","text":"

    Usyally, the conda will install the jupyter notebook (and the ipykernel) by default.

    If not, using conda to install it:

    conda install jupyter\n

    "},{"location":"81_jupyter_notebook/#using-jupyter-notebook-on-a-local-or-remote-server","title":"Using Jupyter notebook on a local or remote server","text":""},{"location":"81_jupyter_notebook/#using-the-default-configuration","title":"Using the default configuration","text":""},{"location":"81_jupyter_notebook/#local-machine","title":"Local machine","text":"

    You could open it in the Anaconda interface or some other IDE.

    If using the terminal, just typing:

    jupyter-lab --port 9000 &          \n

    Then open the link in the browser.

    http://localhost:9000/lab?token=???\nhttp://127.0.0.1:9000/lab?token=???\n

    "},{"location":"81_jupyter_notebook/#remote-server","title":"Remote server","text":"

    Start in the command line of the remote server, adding a port.

    jupyter-lab --ip 0.0.0.0 --port 9000 --no-browser &\n
    It will generate an address the same as above.

    Then, on the local machine, using ssh to listen to the port.

    ssh -NfL localhost:9000:localhost:9000 user@host\n
    Note that the localhost:9000:localhost:9000 is localmachine:localport:remotemachine:remotehost and user@host is the user id and address of the remote server.

    When this is finished, open the above in the browser.

    "},{"location":"81_jupyter_notebook/#using-customized-configuration","title":"Using customized configuration","text":"

    Steps:

    "},{"location":"81_jupyter_notebook/#create-the-configuration-file","title":"Create the configuration file","text":"

    Create a jupyter notebook configuration file if there is no such file

    jupyter notebook --generate-config\n

    The file is usually stored at:

    ~/.jupyter/jupyter_notebook_config.py\n

    What the first few lines of Configuration file look like:

    head ~/.jupyter/jupyter_notebook_config.py\n# Configuration file for jupyter-notebook.\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n
    "},{"location":"81_jupyter_notebook/#add-the-port-information","title":"Add the port information","text":"

    Simply add c.NotebookApp.port =8889 to the configuration file and then save. Note: you can change the port you want to use.

    # Configuration file for jupyter-notebook.\n\nc.NotebookApp.port = 8889\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n

    "},{"location":"81_jupyter_notebook/#run-jupyter-notebook-server-on-remote-host","title":"Run jupyter notebook server on remote host","text":"

    On host side, set up the jupyter notebook server:

    jupyter notebook\n

    "},{"location":"81_jupyter_notebook/#use-ssh-tunnel-to-connect-to-the-remote-server-from-your-local-machine","title":"Use ssh tunnel to connect to the remote server from your local machine","text":"

    On your local machine, use ssh tunnel to connect to the jupyter notebook server:

    ssh -N -f -L localhost:8889:localhost:8889 username@your_remote_host_name\n
    "},{"location":"81_jupyter_notebook/#use-jupyter-notebook-in-your-browser","title":"Use jupyter notebook in your browser","text":"

    Then you can access juptyer notebook on your local browser using the link generated by jupyter notebook server. http://127.0.0.1:8889/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    "},{"location":"82_windows_linux_subsystem/","title":"Window Linux Subsystem","text":"

    In this section, we will briefly demostrate how to install a linux subsystem on windows.

    "},{"location":"82_windows_linux_subsystem/#official-documents","title":"Official Documents","text":""},{"location":"82_windows_linux_subsystem/#prerequisites","title":"Prerequisites","text":"

    \"You must be running Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11.\"

    "},{"location":"82_windows_linux_subsystem/#steps","title":"Steps","text":"

    "},{"location":"83_git_and_github/","title":"Git and Github","text":""},{"location":"83_git_and_github/#git","title":"Git","text":"

    Git is very powerful version control software. Git can track the changes in all the files of your projects and allow collarboration of multiple contributors.

    For details, please check: https://git-scm.com/

    "},{"location":"83_git_and_github/#github","title":"Github","text":"

    Github is an online platform, offering a cloud-based Git repository.

    https://github.com/

    "},{"location":"83_git_and_github/#create-a-new-id","title":"Create a new id","text":"

    Github signup page:

    https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home

    "},{"location":"83_git_and_github/#clone-a-repository","title":"Clone a repository","text":"

    Syntax: git colne <the url you just copied>

    Example: git clone https://github.com/Cloufield/GWASTutorial.git

    "},{"location":"83_git_and_github/#update-the-current-repository","title":"Update the current repository","text":"

    git pull

    "},{"location":"83_git_and_github/#git-setup","title":"git setup","text":"
    $ git config --global user.name \"myusername\"\n$ git config --global user.email myusername@myemail.com\n
    "},{"location":"83_git_and_github/#create-access-tokens","title":"Create access tokens","text":"

    Please see github official documents on how to create a personal token:

    https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token

    Useful Resources

    "},{"location":"84_ssh/","title":"SSH","text":"

    SSH stands for Secure Shell Protocol, which enables you to connect to remote server safely.

    "},{"location":"84_ssh/#login-to-remote-server","title":"Login to remote server","text":"
    ssh <username>@<host>\n

    Before you login in, you need to generate keys for ssh connection:

    "},{"location":"84_ssh/#keys","title":"Keys","text":"

    ssh-keygen -t rsa -b 4096\n
    You will get two keys, a public one and a private one.

    Warning

    Don't share your private key with others.

    What you need to do is just add you local public key to ~/.ssh/authorized_keys on host server.

    "},{"location":"84_ssh/#file-transfer","title":"File transfer","text":"

    Suppose you are using a local machine:

    Donwload files from remote host to local machine

    scp <username>@<host>:remote_path local_path\n

    Upload files from local machine to remote host

    scp local_path <username>@<host>:remote_path\n

    Info

    -r : copy recursively. This option is needed when you want to transfer an entire directory.

    Example

    Copy the local work directory to remote home directory

    $ scp -r /home/gwaslab/work gwaslab@remote.com:/home/gwaslab \n

    "},{"location":"84_ssh/#ssh-tunneling","title":"SSH Tunneling","text":"

    Quote

    In this forwarding type, the SSH client listens on a given port and tunnels any connection to that port to the specified port on the remote SSH server, which then connects to a port on the destination machine. The destination machine can be the remote SSH server or any other machine. https://linuxize.com/post/how-to-setup-ssh-tunneling/

    -L : Local port forwarding

    ssh -L [local_IP:]local_PORT:destination:destination_PORT <username>@<host>\n
    "},{"location":"84_ssh/#other-ssh-options","title":"other SSH options","text":""},{"location":"85_job_scheduler/","title":"Job scheduling system","text":"

    (If needed) Try to use job scheduling system to run a simple script:

    Two of the most commonly used job scheduling systems:

    "},{"location":"90_Recommended_Reading/","title":"Recommended reading","text":""},{"location":"90_Recommended_Reading/#textbooks","title":"Textbooks","text":"Year Category Reference 2020 Statistical Genetics An Introduction to Statistical Genetic Data Analysis By Melinda C. Mills, Nicola Barban and Felix C. Tropf https://mitpress.mit.edu/books/introduction-statistical-genetic-data-analysis 2019 Statistical Genetics Handbook of Statistical Genomics: Fourth Edition https://onlinelibrary.wiley.com/doi/book/10.1002/9781119487845 2009 Statistical Analysis and Machine Learning The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)introduction-statistical-genetic-data-analysis. Trevor Hastie, Robert Tibshirani, Jerome Friedman. https://hastie.su.domains/ElemStatLearn/ (PDF book is available)"},{"location":"90_Recommended_Reading/#overview-reviews","title":"Overview Reviews","text":"Year Reference Link 2021 Uffelmann, E., Huang, Q. Q., Munung, N. S., De Vries, J., Okada, Y., Martin, A. R., \u2026 & Posthuma, D. (2021). Genome-wide association studies. Nature Reviews Methods Primers, 1(1), 1-21. Pubmed 2019 Tam, V., Patel, N., Turcotte, M., Boss\u00e9, Y., Par\u00e9, G., & Meyre, D. (2019). Benefits and limitations of genome-wide association studies. Nature Reviews Genetics, 20(8), 467-484. Pubmed 2017 Pasaniuc, B., & Price, A. L. (2017). Dissecting the genetics of complex traits using summary association statistics. Nature reviews genetics, 18(2), 117-127. Pubmed 2023 Abdellaoui, A., Yengo, L., Verweij, K. J., & Visscher, P. M. (2023). 15 years of GWAS discovery: Realizing the promise. The American Journal of Human Genetics. Pubmed 2017 Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1), 5-22. Pubmed 2005 Hirschhorn, J. N., & Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits. Nature reviews genetics, 6(2), 95-108. Pubmed 2006 Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nature reviews genetics, 7(10), 781-791. Pubmed 2008 McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J., & Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5), 356-369. Pubmed 2010 Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature reviews genetics, 11(7), 459-463. Pubmed 2009 Ioannidis, J., Thomas, G., & Daly, M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nature Reviews Genetics, 10(5), 318-329. Pubmed"},{"location":"90_Recommended_Reading/#topic-specific","title":"Topic-specific","text":""},{"location":"90_Recommended_Reading/#ld","title":"LD","text":"Year Reference Link 2008 Slatkin, M. (2008). Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477-485. Pubmed"},{"location":"90_Recommended_Reading/#imputation","title":"Imputation","text":"Year Reference Link 2010 Marchini, J., & Howie, B. (2010). Genotype imputation for genome-wide association studies. Nature Reviews Genetics, 11(7), 499-511. Pubmed 2018 Das S, Abecasis GR, Browning BL. (2018). Genotype Imputation from Large Reference Panels. Annu. Rev. Genomics Hum. Genet. link"},{"location":"90_Recommended_Reading/#heritability","title":"Heritability","text":"Year Reference Link 2017 Yang, J., Zeng, J., Goddard, M. E., Wray, N. R., & Visscher, P. M. (2017). Concepts, estimation and interpretation of SNP-based heritability. Nature genetics, 49(9), 1304-1310. Pubmed 2009 Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., \u2026 & Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature, 461 (7265), 747-753. Pubmed"},{"location":"90_Recommended_Reading/#genetic-correlation","title":"Genetic correlation","text":"Year Reference Link 2019 Van Rheenen, W., Peyrot, W. J., Schork, A. J., Lee, S. H., & Wray, N. R. (2019). Genetic correlations of polygenic disease traits: from theory to practice. Nature Reviews Genetics, 20(10), 567-581. Pubmed"},{"location":"90_Recommended_Reading/#fine-mapping","title":"Fine-mapping","text":"Year Reference Link 2019 Schaid, D. J., Chen, W., & Larson, N. B. (2018). From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics, 19(8), 491-504. Pubmed 2023 \u738b \u9752\u6ce2, \u30b2\u30ce\u30e0\u30ef\u30a4\u30c9\u95a2\u9023\u89e3\u6790\u306e\u305d\u306e\u5148\u3078\uff1a\u7d71\u8a08\u7684fine-mapping\u306e\u57fa\u790e\u3068\u767a\u5c55, JSBi Bioinformatics Review, 2023, 4 \u5dfb, 1 \u53f7, p. 35-51 J-STAGE ### Polygenic risk scores Year Reference Link 2022 Wang, Y., Tsuo, K., Kanai, M., Neale, B. M., & Martin, A. R. (2022). Challenges and opportunities for developing more generalizable polygenic risk scores. Annual review of biomedical data science. link 2020 Choi, S. W., Mak, T. S. H., & O\u2019Reilly, P. F. (2020). Tutorial: a guide to performing polygenic risk score analyses. Nature protocols, 15(9), 2759-2772. Pubmed 2019 Martin, A. R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B. M., & Daly, M. J. (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nature genetics, 51(4), 584-591. Pubmed"},{"location":"90_Recommended_Reading/#rare-variants","title":"Rare variants","text":"Year Reference Link 2014 Lee, S., Abecasis, G. R., Boehnke, M., & Lin, X. (2014). Rare-variant association analysis: study designs and statistical tests. The American Journal of Human Genetics, 95(1), 5-23. Pubmed 2015 Auer, P. L., & Lettre, G. (2015). Rare variant association studies: considerations, challenges and opportunities. Genome medicine, 7(1), 1-11. Pubmed"},{"location":"90_Recommended_Reading/#genetic-architecture","title":"Genetic architecture","text":"Year Reference Link 2018 Timpson, N. J., Greenwood, C. M., Soranzo, N., Lawson, D. J., & Richards, J. B. (2018). Genetic architecture: the shape of the genetic contribution to human traits and disease. Nature Reviews Genetics, 19(2), 110-124. Pubmed"},{"location":"90_Recommended_Reading/#useful-websites","title":"Useful Websites","text":"Description Link A Bioinformatician's UNIX Toolbox http://lh3lh3.users.sourceforge.net/biounix.shtml Osaka university, Department of Statistical Genetics Homepage http://www.sg.med.osaka-u.ac.jp/school_2021.html Genome analysis wiki (Abecasis Group Wiki) https://genome.sph.umich.edu/wiki/Main_Page EPI 511, Advanced Population and Medical Genetics(Alkes Price, Harvard School of Public Health) https://alkesgroup.broadinstitute.org/EPI511 fiveMinuteStats(Matthew Stephens, Statistics and Human Genetics at the University of Chicago) https://stephens999.github.io/fiveMinuteStats Course homepage and digital textbook for Human Genome Variation with Computational Lab https://mccoy-lab.github.io/hgv_modules/"},{"location":"90_Recommended_Reading/#_1","title":"\u548c\u6587","text":"Year Category Reference 2015 Linux \u65b0\u3057\u3044Linux\u306e\u6559\u79d1\u66f8 \u5358\u884c\u672c \u2013 2015/6/6 \u4e09\u5b85 \u82f1\u660e (\u8457), \u5927\u89d2 \u7950\u4ecb (\u8457) 2012 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u306f\u3058\u3081\u3066\u306e\u30d1\u30bf\u30fc\u30f3\u8a8d\u8b58 \u5358\u884c\u672c\uff08\u30bd\u30d5\u30c8\u30ab\u30d0\u30fc\uff09 \u2013 2012/7/31 \u5e73\u4e95 \u6709\u4e09 (\u8457) 1991 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u7d71\u8a08\u5b66\u5165\u9580 (\u57fa\u790e\u7d71\u8a08\u5b66\u2160) \u5358\u884c\u672c \u2013 1991/7/9 \u6771\u4eac\u5927\u5b66\u6559\u990a\u5b66\u90e8\u7d71\u8a08\u5b66\u6559\u5ba4 (\u7de8\u96c6) 1992 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u81ea\u7136\u79d1\u5b66\u306e\u7d71\u8a08\u5b66 (\u57fa\u790e\u7d71\u8a08\u5b66) \u5358\u884c\u672c \u2013 1992/8/1 \u6771\u4eac\u5927\u5b66\u6559\u990a\u5b66\u90e8\u7d71\u8a08\u5b66\u6559\u5ba4 (\u7de8\u96c6) 2012 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u30c7\u30fc\u30bf\u89e3\u6790\u306e\u305f\u3081\u306e\u7d71\u8a08\u30e2\u30c7\u30ea\u30f3\u30b0\u5165\u9580\u2015\u2015\u4e00\u822c\u5316\u7dda\u5f62\u30e2\u30c7\u30eb\u30fb\u968e\u5c64\u30d9\u30a4\u30ba\u30e2\u30c7\u30eb\u30fbMCMC (\u78ba\u7387\u3068\u60c5\u5831\u306e\u79d1\u5b66) \u5358\u884c\u672c \u2013 2012/5/19 \u4e45\u4fdd \u62d3\u5f25 (\u8457) 2015 \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u907a\u4f1d\u7d71\u8a08\u5b66\u5165\u9580 (\u5ca9\u6ce2\u30aa\u30f3\u30c7\u30de\u30f3\u30c9\u30d6\u30c3\u30af\u30b9) \u30aa\u30f3\u30c7\u30de\u30f3\u30c9 (\u30da\u30fc\u30d1\u30fc\u30d0\u30c3\u30af) \u2013 2015/12/10 \u938c\u8c37 \u76f4\u4e4b (\u8457) 2020 \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u5b9f\u9a13\u533b\u5b66 2020\u5e743\u6708 Vol.38 No.4 GWAS\u3067\u8907\u96d1\u5f62\u8cea\u3092\u89e3\u304f\u305e! \u301c\u591a\u56e0\u5b50\u75be\u60a3\u30fb\u5f62\u8cea\u306e\u30d0\u30a4\u30aa\u30ed\u30b8\u30fc\u306b\u6311\u3080\u6b21\u4e16\u4ee3\u306e\u30b2\u30ce\u30e0\u533b\u79d1\u5b66 \u5358\u884c\u672c \u2013 2020/2/23 \u938c\u8c37 \u6d0b\u4e00\u90ce (\u8457) 2020 \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u30bc\u30ed\u304b\u3089\u5b9f\u8df5\u3059\u308b \u907a\u4f1d\u7d71\u8a08\u5b66\u30bb\u30df\u30ca\u30fc\u301c\u75be\u60a3\u3068\u30b2\u30ce\u30e0\u3092\u7d50\u3073\u3064\u3051\u308b \u5358\u884c\u672c \u2013 2020/3/13 \u5ca1\u7530 \u968f\u8c61 (\u8457) ~ \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u907a\u4f1d\u5b50\u533b\u5b66 \u30b7\u30ea\u30fc\u30ba\u4f01\u753b Statistical Genetics\u3000\u3008\u907a\u4f1d\u7d71\u8a08\u5b66\u306e\u57fa\u790e\u3009 - \u938c\u8c37 \u6d0b\u4e00\u90ce + \u03b1"},{"location":"95_Assignment/","title":"Self training","text":""},{"location":"95_Assignment/#pca-using-1000-genome-project-dataset","title":"PCA using 1000 Genome Project Dataset","text":"

    In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.

    Aim

    Aim:

    1. Download 1000 Genome VCF files.
    2. Perform PCA using 1000 Genome samples.
    3. Plot the PCs of these individuals.
    4. Interpret the results.

    Here is a brief overview of this mini project.

    The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.

    Tip

    Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.

    "},{"location":"95_Assignment/#1-download-the-publicly-available-1000-genome-vcf","title":"1. Download the publicly available 1000 Genome VCF","text":"

    Download the files we need from 1000 Genomes Project FTP site:

    1. Autosome VCF files
    2. Ancestry information file
    3. Reference genome sequence
    4. Strict mask

    Tip

    Note

    If it takes too long or if you are using your local laptop, you can just download the files for chr1.

    Sample shell script for downloading the files

    #!/bin/bash\nfor chr in $(seq 1 22)  #Note: If it takes too long, you can download just chr1.\ndo\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi\ndone\n\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai\n\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed\n
    "},{"location":"95_Assignment/#2-re-align-normalize-and-remove-duplication","title":"2. Re-align, normalize and remove duplication","text":"

    We need to use bcftools to process the raw vcf files.

    Install bcftools

    http://www.htslib.org/download/

    Since the variants are not normalized and also have many duplications, we need to clean the vcf files.

    Re-align with the reference genome, normalize variants and remove duplications

    #!/bin/bash\nfor chr in $(seq 1 22)\ndo\n    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \\\n      ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \\\n      bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \\\n        bcftools norm -Ob --rm-dup both \\\n          > ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \n    bcftools index ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf\ndone\n
    "},{"location":"95_Assignment/#3-convert-vcf-files-to-plink-binary-format","title":"3. Convert VCF files to plink binary format","text":"

    Example

    #!/bin/bash\nfor chr in $(seq 1 22)\ndo\nplink \\\n      --bcf ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \\\n      --keep-allele-order \\\n      --vcf-idspace-to _ \\\n      --const-fid \\\n      --allow-extra-chr 0 \\\n      --split-x b37 no-fail \\\n      --make-bed \\\n      --out ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes\ndone\n
    "},{"location":"95_Assignment/#4-using-snps-only-in-strict-masks","title":"4. Using SNPs only in strict masks","text":"

    Strict masks are in this directory.

    Strict mask

    The overlapped region with this mask is \u201ccallable\u201d (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/

    Tip

    Use plink --make-set option with the BED files to extract SNPs in the strict mask.

    "},{"location":"95_Assignment/#5-qc-it-and-prune-it-to-100k-variants","title":"5. QC it and prune it to ~ 100K variants.","text":"

    Tip

    Use PLINK.

    QC: only SNPs (exclude indels), MAF>0.1

    Pruning: plink --indep-pariwise

    "},{"location":"95_Assignment/#6-perform-pca","title":"6. Perform PCA","text":"

    Tip

    plink --pca

    "},{"location":"95_Assignment/#7-visualization-and-interpretation","title":"7. Visualization and interpretation.","text":"

    Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.

    Tip

    You can use R, python, or any other tools you like (even Excel can do the job.)

    (If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)

    "},{"location":"95_Assignment/#checklist","title":"Checklist","text":""},{"location":"95_Assignment/#reference","title":"Reference","text":""},{"location":"96_Assignment2/","title":"The final presentation for \u57fa\u790e\u6f14\u7fd2II","text":"

    Note

    "},{"location":"96_Assignment2/#outline","title":"Outline","text":"

    (Just an example, there is no need to strictly follow this.)

    "},{"location":"99_About/","title":"GWAS Tutorial - Fundamental Exercise II","text":"

    This tutorial is provided by the Laboratory of Complex Trait Genomics (Kamatani Lab) in the Deparment of Computational Biology and Medical Sciences at the Univerty of Tokyo. This tutorial is designed for the graduate course Fundamental Exercise II.

    "},{"location":"99_About/#main-contributors","title":"Main Contributors","text":""},{"location":"99_About/#contact-us","title":"Contact Us","text":"

    This repository is currently maintained by Yunye He.

    If you have any questions or suggestions, please feel free to contact gwaslab@gmail.com.

    Enjoy this real \"Manhattan plot\"!

    "},{"location":"Imputation/","title":"Imputation","text":"

    The missing data imputation is not a task specific to genetic studies. By comparing the genotyping array (generally 500k\u20131M markers) to the reference panel (WGSed), missing markers on the array are filled. The tabular data imputation methods could be used to impute the genotype data. However, haplotypes are coalesced from the ancestors, and the recombination events during gametogenesis, each individual's haplotype is a mosaic of all haplotypes in a population. Given these properties, hidden Markov model (HMM) based methods usually outperform tabular data-based ones.

    This HMM was first described in Li & Stephens 2003. Here we will not go through tools over the past 20 years. We will introduce the concept and the usage of Minimac.

    "},{"location":"Imputation/#figure-illustration","title":"Figure illustration","text":"

    In the figure, each row in the above panel represents a reference haplotype. The middle panel shows the genotyping array. Genotyped markers are squared and WGS-only markers are circled. The two colors represent the ref and alt alleles. You could also think they represent different haplotype fragments. The red triangles indicate the recombination hot spots, which a crossover between the reference haplotypes is more likely to happen.

    Given the genotyped marker, matching probabilities are calculated for all potential paths through reference haplotypes. Then, in this example (the real case is not this simple), we assumed at each recombination hotspot, there is a free recombination. You will see that all paths chained by dark blue match 2 of the 4 genotyped markers. So these paths have equal probability.

    Finally, missing markers are filled with the probability-weighted alleles on each path. For the left three circles, two paths are cyan and one path is orange, the imputation result will be 1/3 orange and 2/3 cyan.

    "},{"location":"Imputation/#how-to-do-imputation","title":"How to do imputation","text":"

    The simplest way is to use the Michigan or TOPMed imputation server, if you don't have resources of WGS data. Just make your vcf, submit it to the server, and select the favored reference panel. There are built-in phasing, liftover, and QC on the server, but we would strongly suggest checking the data and doing these steps by yourself. For example:

    Another way is to run the job locally. Recent tools are memory and computation efficient, you may run it in a small in-house server or even PC.

    A typical workflow of Minimac is:

    Parameter estimation (this step will create a m3vcf reference panel file):

    Minimac3 \\\n  --refHaps ./phased_reference.vcf.gz \\\n  --processReference \\\n  --prefix ./phased_reference \\\n  --log\n

    Imputation:

    minimac4 \\\n  --refHaps ./phased_reference.m3vcf \\\n  --haps ./phased_target.vcf.gz \\\n  --prefix ./result \\\n  --format GT,DS,HDS,GP,SD \\\n  --meta \\\n  --log \\\n  --cpus 10\n

    Details of the options.

    "},{"location":"Imputation/#after-imputation","title":"After imputation","text":"

    The output is a vcf file. First, we need to examine the imputation quality. It can be a long long story and I will not explain it in detail. Most of the time, when the following criteria meet,

    The standard imputation quality metric, named Rsq, efficiently discriminates the well-imputed variants at a threshold 0.7 (may loosen it to 0.3 to allow more variants in the GWAS).

    "},{"location":"Imputation/#before-gwas","title":"Before GWAS","text":"

    Three types of genotypes are widely used in GWAS -- best-guess genotype, allelic dosage, and genotype probability. Using Dosage (DS) keeps the dataset smallest while most association test software only requires this information.

    "},{"location":"PRS_evaluation/","title":"Polygenic risk scores evaluation","text":""},{"location":"PRS_evaluation/#regressions-for-evaluation-of-prs","title":"Regressions for evaluation of PRS","text":"\\[Phenotype \\sim PRS_{phenotype} + Covariates\\] \\[logit(P) \\sim PRS_{phenotype} + Covariates\\]

    Covariates usually include sex, age and top 10 PCs.

    "},{"location":"PRS_evaluation/#evaluation","title":"Evaluation","text":""},{"location":"PRS_evaluation/#roc-aic-auc-and-c-index","title":"ROC, AIC, AUC, and C-index","text":"

    ROC

    ROC: receiver operating characteristic curve shows the performance of a classification model at all thresholds.

    AUC

    AUC: area under the ROC Curve, a common measure for the performance of a classification model.

    AIC

    Akaike Information Criterion (AIC): a measure for comparison of different statistical models.

    \\[AIC = 2k - 2ln(\\hat{L})\\]

    C-index

    C-index: Harrell\u2019s C-index (concordance index), which is a metric to evaluate the predictive performance of models and is commonly used in survival analysis. It is a measure of the probability that the predicted scores \\(M_i\\) and \\(M_j\\) by a model of two randomly selected individuals \\(i\\) and \\(j\\), have the reverse relative order as their true event times \\(T_i, T_j\\).

    \\[ C = Pr (M_j > M_i | T_j < T_i) \\]

    Interpretation: Individuals with higher scores should have higher risks of the disease events

    "},{"location":"PRS_evaluation/#r2-and-pseudo-r2","title":"R2 and pseudo-R2","text":"

    Coefficient of determination

    \\(R^2\\) : coefficient of determination, which measures the amount of variance explained by the regression model.

    In linear regression:

    \\[ R^2 = 1 - {{RSS}\\over{TSS}} \\]

    Pseudo-R2 (Nagelkerke)

    In logistic regression,

    One of the most commonly used Pseudo-R2 for PRS analysis is Nagelkerke's \\(R^2\\)

    \\[R^2_{Nagelkerke} = {{1 - ({{L_0}\\over{L_M}})^{2/n}}\\over{1 - L_0^{2/n}}}\\] "},{"location":"PRS_evaluation/#r2-on-the-liability-scale-lee","title":"R2 on the liability scale (Lee)","text":"

    R2 on liability scale

    \\(R^2\\) on the liability scale for ascertained case-control studies

    \\[ R^2_l = {{R_o^2 C}\\over{1 + R_o^2 \\theta C }} \\]

    Reference : Lee, S. H., Goddard, M. E., Wray, N. R., & Visscher, P. M. (2012). A better coefficient of determination for genetic profile analysis. Genetic epidemiology, 36(3), 214-224.

    The authors also provided R codes for calculation (removed unrelated codes for simplicity)

    # R2 on the liability scale using the transformation\n\nnt = total number of the sample\nncase = number of cases\nncont = number of controls\nthd = the threshold on the normal distribution which truncates the proportion of disease prevalence\nK = population prevalence\nP = proportion of cases in the case-control samples\n\n#threshold\nthd = -qnorm(K,0,1)\n\n#value of standard normal density function at thd\nzv = dnorm(thd) \n\n#mean liability for case\nmv = zv/K \n\n#linear model\nlmv = lm(y\u223cg) \n\n#R20 : R2 on the observed scale\nR2O = var(lmv$fitted.values)/(ncase/nt*ncont/nt)\n\n# calculate correction factors\ntheta = mv*(P-K)/(1-K)*(mv*(P-K)/(1-K)-thd) \ncv = K*(1-K)/zv^2*K*(1-K)/(P*(1-P)) \n\n# convert to R2 on the liability scale\nR2 = R2O*cv/(1+R2O*theta*cv)\n
    "},{"location":"PRS_evaluation/#bootstrap-confidence-interval-methods-for-r2","title":"Bootstrap Confidence Interval Methods for R2","text":"

    Bootstrap is a commonly used resampling method to generate a sampling distribution from the known sample dataset. It repeatedly takes random samples with replacement from the known sample dataset.

    Steps:

    The percentile bootstrap interval is then defined as the interval between \\(100 \\times \\alpha /2\\) and \\(100 \\times (1 - \\alpha /2)\\) percentiles of the parameters estimated by bootstrapping. We can use this method to estimate the bootstrap interval for \\(R^2\\).

    "},{"location":"PRS_evaluation/#reference","title":"Reference","text":""},{"location":"Phasing/","title":"Phasing","text":"

    Human genome is diploid. Distribution of variants between homologous chromosomes can affect the interpretation of genotype data, such as allele specific expression, context-informed annotation, loss-of-function compound heterozygous events.

    Example

    ( SHAPEIT5 )

    In the above illustration, when LoF variants are on both copies of a gene, the gene is thought knocked out

    Trio data and long read sequencing can solve the haplotyping problem. That is not always possible. Statistical phasing is based on the Li & Stephens Markov model. The haploid version of this model (see Imputation) is easier to understand. Because the maternal and paternal haplotypes are independent, unphased genotype could be constructed by the addition of two haplotypes.

    Recent methods had incopoorates long IBD sharing, local haplotypes, etc, to make it tractable for large datasets. You could read the following methods if you are interested.

    "},{"location":"Phasing/#how-to-do-phasing","title":"How to do phasing","text":"

    In most of the cases, phasing is just a pre-step of imputation, and we do not care about how the phasing goes. But there are several considerations, like reference-based or reference-free, large and small sample size, rare variants cutoff. There is no single method that could best fit all cases.

    Here I show one example using EAGLE2.

    eagle \\\n    --vcf=target.vcf.gz \\\n    --geneticMapFile=genetic_map_hg19_withX.txt.gz \\\n    --chrom=19 \\\n    --outPrefix=target.eagle \\\n    --numThreads=10\n
    "},{"location":"TwoSampleMR/","title":"TwoSampleMR Tutorial","text":"In\u00a0[1]: Copied!
    library(data.table)\nlibrary(TwoSampleMR)\n
    library(data.table) library(TwoSampleMR)
    TwoSampleMR version 0.5.6 \n[>] New: Option to use non-European LD reference panels for clumping etc\n[>] Some studies temporarily quarantined to verify effect allele\n[>] See news(package='TwoSampleMR') and https://gwas.mrcieu.ac.uk for further details\n\n\n
    In\u00a0[2]: Copied!
    exp_raw <- fread(\"koges_bmi.txt.gz\")\n\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_raw$phenotype <- \"BMI\"\n\nexp_raw$n <- 72282\n\nexp_dat <- format_data( exp_raw,\n    type = \"exposure\",\n    snp_col = \"rsids\",\n    beta_col = \"beta\",\n    se_col = \"sebeta\",\n    effect_allele_col = \"alt\",\n    other_allele_col = \"ref\",\n    eaf_col = \"af\",\n    pval_col = \"pval\",\n    phenotype_col = \"phenotype\",\n    samplesize_col= \"n\"\n)\nclumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")\n
    exp_raw <- fread(\"koges_bmi.txt.gz\") exp_raw <- subset(exp_raw,exp_raw$pval<5e-8) exp_raw$phenotype <- \"BMI\" exp_raw$n <- 72282 exp_dat <- format_data( exp_raw, type = \"exposure\", snp_col = \"rsids\", beta_col = \"beta\", se_col = \"sebeta\", effect_allele_col = \"alt\", other_allele_col = \"ref\", eaf_col = \"af\", pval_col = \"pval\", phenotype_col = \"phenotype\", samplesize_col= \"n\" ) clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")
    Warning message in .fun(piece, ...):\n\u201cDuplicated SNPs present in exposure data for phenotype 'BMI. Just keeping the first instance:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs4665740\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs7201608\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\u201d\nAPI: public: http://gwas-api.mrcieu.ac.uk/\n\nPlease look at vignettes for options on running this locally if you need to run many instances of this command.\n\nClumping rvi6Om, 2452 variants, using EAS population reference\n\nRemoving 2420 of 2452 variants due to LD with other variants or absence from LD reference panel\n\n
    In\u00a0[16]: Copied!
    out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n                    select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\"))\n\nout_raw$phenotype <- \"T2D\"\n\nout_dat <- format_data( out_raw,\n    type = \"outcome\",\n    snp_col = \"SNPID\",\n    beta_col = \"BETA\",\n    se_col = \"SE\",\n    effect_allele_col = \"Allele2\",\n    other_allele_col = \"Allele1\",\n    pval_col = \"p.value\",\n    phenotype_col = \"phenotype\",\n    samplesize_col= \"n\",\n    eaf_col=\"AF_Allele2\"\n)\n
    out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\", select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\")) out_raw$phenotype <- \"T2D\" out_dat <- format_data( out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", se_col = \"SE\", effect_allele_col = \"Allele2\", other_allele_col = \"Allele1\", pval_col = \"p.value\", phenotype_col = \"phenotype\", samplesize_col= \"n\", eaf_col=\"AF_Allele2\" )
    Warning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201ceffect_allele column has some values that are not A/C/T/G or an indel comprising only these characters or D/I. These SNPs will be excluded.\u201d\nWarning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201cThe following SNP(s) are missing required information for the MR tests and will be excluded\n1:1142714:t:<cn0>\n1:4288465:t:<ins:me:alu>\n1:4882232:t:<cn0>\n1:5172414:g:<cn0>\n1:5173809:t:<cn0>\n1:5934301:g:<ins:me:alu>\n1:6814818:a:<ins:me:alu>\n1:7921468:c:<cn2>\n1:8502010:t:<ins:me:alu>\n1:8924066:c:<cn0>\n1:9171841:c:<cn0>\n1:9403667:a:<cn2>\n1:9595360:a:<cn0>\n1:9846036:c:<cn0>\n1:10067190:g:<cn0>\n1:10482499:g:<cn0>\n1:11682873:t:<cn0>\n1:11830220:t:<ins:me:sva>\n1:11988599:c:<cn0>\n1:12475666:t:<ins:me:sva>\n1:12737575:a:<ins:me:alu>\n1:12842004:a:<cn0>\n1:14437074:t:<cn0>\n1:14437868:a:<cn0>\n1:14713511:t:<cn2>\n1:14735732:g:<cn0>\n1:15343948:g:<cn0>\n1:16151682:c:<cn0>\n1:16329336:t:<ins:me:sva>\n1:16358741:g:<cn0>\n1:17676165:a:<cn0>\n1:19486410:c:<ins:me:alu>\n1:19855608:a:<cn2>\n1:20257109:t:<ins:me:alu>\n1:20310746:g:<cn0>\n1:20496899:c:<cn0>\n1:20497183:c:<cn0>\n1:20864015:t:<cn0>\n1:20944751:c:<ins:me:alu>\n1:21346279:a:<cn0>\n1:21492591:c:<ins:me:alu>\n1:21786418:t:<cn0>\n1:22302473:t:<cn0>\n1:22901908:t:<ins:me:alu>\n1:23908383:g:<cn0>\n1:24223580:g:<cn0>\n1:24520350:g:<cn0>\n1:24804603:c:<cn0>\n1:25055152:g:<cn0>\n1:26460095:a:<cn0>\n1:26961278:g:<cn0>\n1:29373390:t:<ins:me:alu>\n1:31090520:t:<ins:me:alu>\n1:31316259:t:<cn0>\n1:31720009:a:<cn0>\n1:32535965:g:<cn0>\n1:32544371:a:<cn0>\n1:33785116:c:<cn0>\n1:35101427:c:<cn0>\n1:35177287:g:<cn0>\n1:35627104:t:<cn0>\n1:36474694:t:<ins:me:alu>\n1:36733282:t:<cn0>\n1:37215810:a:<ins:me:alu>\n1:37816478:a:<cn0>\n1:38132306:t:<cn0>\n1:39084231:a:<cn0>\n1:39677675:t:<ins:me:alu>\n1:40524704:t:<ins:me:alu>\n1:40552356:a:<cn0>\n1:40976681:g:<cn0>\n1:41021684:a:<cn0>\n1:41785500:a:<ins:me:line1>\n1:42390318:c:<ins:me:alu>\n1:43694061:t:<cn0>\n1:44059290:a:<inv>\n1:45021223:t:<cn0>\n1:45708588:a:<cn0>\n1:45822649:t:<cn0>\n1:46333195:a:<ins:me:alu>\n1:46794814:t:<ins:me:alu>\n1:47267517:t:<cn0>\n1:47346571:a:<cn0>\n1:47623401:a:<cn0>\n1:47913001:t:<cn0>\n1:48820285:t:<ins:me:alu>\n1:48972537:g:<ins:me:alu>\n1:49357693:t:<ins:me:alu>\n1:49428756:t:<ins:me:line1>\n1:49861993:g:<ins:me:alu>\n1:50912662:c:<ins:me:alu>\n1:51102445:t:<cn0>\n1:52146313:a:<cn0>\n1:53594175:t:<cn0>\n1:53595112:c:<cn0>\n1:55092043:g:<cn0>\n1:55341923:c:<cn0>\n1:55342224:g:<cn0>\n1:55927718:a:<cn0>\n1:56268665:t:<ins:me:line1>\n1:56405404:t:<ins:me:line1>\n1:56879062:t:<ins:me:alu>\n1:57100960:t:<ins:me:sva>\n1:57208746:a:<cn0>\n1:58722032:t:<cn2>\n1:58743910:a:<cn0>\n1:58795378:a:<cn0>\n1:59205317:t:<ins:me:alu>\n1:59591483:t:<ins:me:alu>\n1:59871876:t:<ins:me:alu>\n1:60046725:a:<cn0>\n1:60048628:c:<cn0>\n1:60470604:t:<ins:me:alu>\n1:60487912:t:<cn0>\n1:60715714:t:<ins:me:line1>\n1:61144594:c:<ins:me:alu>\n1:62082822:a:<cn0>\n1:62113386:c:<cn0>\n1:62479250:t:<cn0>\n1:62622902:g:<cn0>\n1:62654739:c:<cn0>\n1:63841704:c:<ins:me:alu>\n1:64720497:a:<cn0>\n1:64850193:a:<ins:me:sva>\n1:65346960:t:<ins:me:alu>\n1:65412505:a:<cn0>\n1:68375746:a:<cn0>\n1:70061670:g:<ins:me:alu>\n1:70091056:t:<ins:me:alu>\n1:70093557:c:<ins:me:alu>\n1:70412360:t:<ins:me:alu>\n1:70424730:t:<cn2>\n1:70820401:t:<cn0>\n1:70912433:g:<ins:me:alu>\n1:72449620:a:<cn0>\n1:72755694:t:<cn0>\n1:72766343:t:<cn0>\n1:72778537:g:<cn0>\n1:73092779:c:<cn2>\n1:74312425:a:<cn0>\n1:75148055:t:<ins:me:alu>\n1:75192907:c:<ins:me:line1>\n1:75301685:t:<ins:me:alu>\n1:75557174:c:<ins:me:alu>\n1:76392967:t:<ins:me:alu>\n1:76416074:a:<ins:me:alu>\n1:76900598:c:<cn0>\n1:77577928:t:<ins:me:alu>\n1:77634327:a:<ins:me:alu>\n1:77764994:t:<ins:me:alu>\n1:77830614:t:<cn0>\n1:78446240:c:<ins:me:sva>\n1:78607067:t:<ins:me:alu>\n1:78649157:a:<cn0>\n1:78800902:t:<ins:me:line1>\n1:79108845:t:<ins:me:alu>\n1:79331208:c:<ins:me:alu>\n1:79582082:t:<ins:me:alu>\n1:79855600:c:<cn0>\n1:80221781:t:<cn0>\n1:80299106:t:<ins:me:alu>\n1:80504615:t:<cn0>\n1:80554065:t:<cn0>\n1:80955976:t:<ins:me:line1>\n1:81422415:c:<cn0>\n1:82312054:g:<ins:me:alu>\n1:82850409:g:<ins:me:alu>\n1:83041946:t:<cn0>\n1:84056670:a:<cn0>\n1:84388330:g:<cn0>\n1:84517858:a:<cn0>\n1:84712009:g:<cn0>\n1:84913274:c:<ins:me:alu>\n1:85293152:g:<ins:me:alu>\n1:85620127:t:<ins:me:alu>\n1:85910957:g:<cn0>\n1:86400829:t:<cn0>\n1:86696940:a:<ins:me:alu>\n1:87064962:c:<cn2>\n1:87096974:c:<cn0>\n1:87096990:t:<cn0>\n1:88813625:t:<ins:me:alu>\n1:89209563:t:<ins:me:alu>\n1:89733616:t:<ins:me:line1>\n1:89811425:g:<cn0>\n1:90370569:t:<ins:me:alu>\n1:90914512:g:<ins:me:line1>\n1:91878937:g:<cn0>\n1:92131841:g:<inv>\n1:92232051:t:<cn0>\n1:93291972:c:<cn0>\n1:93498232:t:<ins:me:alu>\n1:94288372:c:<cn0>\n1:95192010:a:<ins:me:line1>\n1:95342701:g:<ins:me:alu>\n1:95522242:t:<cn0>\n1:97458273:t:<inv>\n1:98605297:t:<ins:me:alu>\n1:99610528:a:<ins:me:alu>\n1:99698454:g:<ins:me:alu>\n1:100355940:a:<ins:me:alu>\n1:100645536:g:<ins:me:alu>\n1:100994221:g:<ins:me:alu>\n1:101693230:t:<cn0>\n1:101695346:a:<cn0>\n1:101770067:g:<ins:me:alu>\n1:101978980:t:<ins:me:line1>\n1:102568923:g:<ins:me:line1>\n1:102920544:t:<ins:me:alu>\n1:103054499:t:<ins:me:alu>\n1:104359763:g:<cn0>\n1:104443176:t:<cn0>\n1:104574487:t:<ins:me:alu>\n1:105054083:t:<ins:me:alu>\n1:105070244:c:<ins:me:alu>\n1:105138650:t:<ins:me:alu>\n1:105231111:t:<ins:me:alu>\n1:105832823:g:<cn0>\n1:106015797:t:<cn0>\n1:106978443:t:<cn0>\n1:107896853:g:<cn0>\n1:107949843:t:<ins:me:alu>\n1:108142479:t:<ins:me:alu>\n1:108369370:a:<cn0>\n1:108402972:a:<cn0>\n1:109366972:g:<cn0>\n1:109573240:a:<cn0>\n1:110187159:a:<cn0>\n1:110225019:c:<cn0>\n1:111013750:a:<cn0>\n1:111472607:g:<cn0>\n1:111802597:g:<ins:me:sva>\n1:111827762:a:<cn0>\n1:111896187:c:<ins:me:sva>\n1:112032284:t:<ins:me:alu>\n1:112123691:t:<ins:me:alu>\n1:112691740:a:<cn0>\n1:112736007:a:<ins:me:alu>\n1:112992009:t:<ins:me:alu>\n1:113799625:g:<cn0>\n1:114925678:t:<cn0>\n1:115178042:c:<cn0>\n1:116229468:c:<cn0>\n1:116983571:t:<ins:me:alu>\n1:117593370:a:<cn0>\n1:119526940:a:<cn0>\n1:119553366:c:<ins:me:line1>\n1:120012853:a:<cn0>\n1:152555495:g:<cn0>\n1:152643788:a:<cn0>\n1:152760084:c:<cn0>\n1:153133703:a:<cn0>\n1:154123770:t:<ins:me:alu>\n1:154324167:g:<cn0>\n1:154865017:g:<ins:me:alu>\n1:157173860:t:<cn0>\n1:157363502:t:<ins:me:alu>\n1:157540655:g:<cn0>\n1:157887236:t:<inv>\n1:158371473:a:<ins:me:alu>\n1:158488410:a:<cn0>\n1:158726918:a:<cn0>\n1:160979498:c:<cn0>\n1:162263027:t:<ins:me:alu>\n1:163088865:t:<ins:me:alu>\n1:163314443:g:<ins:me:alu>\n1:163639693:t:<ins:me:alu>\n1:165553149:t:<ins:me:line1>\n1:165861400:t:<ins:me:sva>\n1:166189445:t:<ins:me:alu>\n1:167506110:g:<ins:me:alu>\n1:167712862:g:<ins:me:alu>\n1:168926083:a:<ins:me:sva>\n1:169004356:c:<cn0>\n1:169042039:c:<cn0>\n1:169225213:t:<cn0>\n1:169524859:t:<ins:me:line1>\n1:170603451:a:<ins:me:alu>\n1:170991168:c:<ins:me:alu>\n1:171358314:t:<ins:me:alu>\n1:172177959:g:<cn0>\n1:172825753:g:<cn0>\n1:173811663:a:<cn0>\n1:174654509:g:<cn0>\n1:174796517:t:<cn0>\n1:174894014:g:<cn0>\n1:175152408:g:<cn0>\n1:177509016:g:<cn0>\n1:177544393:g:<cn0>\n1:177946159:a:<cn0>\n1:178397612:t:<ins:me:alu>\n1:178495321:a:<cn0>\n1:178692798:t:<ins:me:alu>\n1:179491966:t:<ins:me:alu>\n1:179607260:a:<cn0>\n1:180272299:a:<cn0>\n1:180857564:c:<ins:me:alu>\n1:181043348:a:<cn0>\n1:181588360:t:<ins:me:alu>\n1:181601286:t:<ins:me:alu>\n1:181853551:g:<ins:me:alu>\n1:182420857:t:<ins:me:alu>\n1:183308627:a:<cn0>\n1:185009806:t:<cn0>\n1:185504717:c:<ins:me:alu>\n1:185584799:t:<ins:me:alu>\n1:185857064:a:<cn0>\n1:187464747:t:<cn0>\n1:187522081:g:<ins:me:alu>\n1:187609013:t:<cn0>\n1:187716053:g:<cn0>\n1:187932575:t:<cn0>\n1:187955397:c:<ins:me:alu>\n1:188174657:t:<ins:me:alu>\n1:188186464:t:<ins:me:alu>\n1:188438213:t:<ins:me:alu>\n1:188615934:g:<ins:me:alu>\n1:189247039:a:<ins:me:alu>\n1:190052658:t:<cn0>\n1:190309695:t:<cn0>\n1:190773296:t:<ins:me:alu>\n1:190874469:t:<ins:me:alu>\n1:191466954:t:<ins:me:line1>\n1:191580781:a:<ins:me:alu>\n1:191817437:c:<ins:me:alu>\n1:191916438:t:<cn0>\n1:192008678:t:<ins:me:line1>\n1:192262268:a:<ins:me:line1>\n1:193549655:c:<ins:me:line1>\n1:193675125:t:<ins:me:alu>\n1:193999047:t:<cn0>\n1:194067859:t:<ins:me:alu>\n1:194575585:t:<cn0>\n1:194675140:c:<ins:me:alu>\n1:195146820:c:<ins:me:alu>\n1:195746415:a:<ins:me:line1>\n1:195885406:g:<cn0>\n1:195904499:g:<cn0>\n1:196464453:a:<ins:me:line1>\n1:196602664:a:<cn0>\n1:196728877:g:<cn0>\n1:196734744:a:<cn0>\n1:196761370:t:<ins:me:alu>\n1:197756784:c:<inv>\n1:197894025:c:<cn0>\n1:198093872:c:<ins:me:alu>\n1:198243300:t:<ins:me:alu>\n1:198529696:t:<ins:me:line1>\n1:198757296:t:<cn0>\n1:198773749:t:<cn0>\n1:198815313:a:<ins:me:alu>\n1:202961159:t:<ins:me:alu>\n1:203684252:t:<cn0>\n1:204238474:c:<ins:me:alu>\n1:204345055:t:<ins:me:alu>\n1:204381864:c:<cn0>\n1:205178526:t:<inv>\u201d\n
    In\u00a0[17]: Copied!
    harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\n
    harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)
    Harmonising BMI (rvi6Om) and T2D (ETcv15)\n\n
    In\u00a0[18]: Copied!
    harmonized_data\n
    harmonized_data A data.frame: 28 \u00d7 29 SNPeffect_allele.exposureother_allele.exposureeffect_allele.outcomeother_allele.outcomebeta.exposurebeta.outcomeeaf.exposureeaf.outcomeremove\u22efpval.exposurese.exposuresamplesize.exposureexposuremr_keep.exposurepval_origin.exposureid.exposureactionmr_keepsamplesize.outcome <chr><chr><chr><chr><chr><dbl><dbl><dbl><dbl><lgl>\u22ef<dbl><dbl><dbl><chr><lgl><chr><chr><dbl><lgl><lgl> 1rs10198356GAGA 0.044 0.0278218160.4500.46949841FALSE\u22ef1.5e-170.005172282BMITRUEreportedrvi6Om1TRUENA 2rs10209994CACA 0.030 0.0284334240.6400.65770918FALSE\u22ef2.0e-080.005472282BMITRUEreportedrvi6Om1TRUENA 3rs10824329AGAG 0.029 0.0182171190.5100.56240335FALSE\u22ef1.7e-080.005172282BMITRUEreportedrvi6Om1TRUENA 4rs10938397GAGA 0.036 0.0445547360.2800.29915686FALSE\u22ef1.0e-100.005672282BMITRUEreportedrvi6Om1TRUENA 5rs11066132TCTC-0.053-0.0319288060.1600.24197159FALSE\u22ef1.0e-130.007172282BMITRUEreportedrvi6Om1TRUENA 6rs12522139GTGT-0.037-0.0107492430.2700.24543922FALSE\u22ef1.8e-100.005772282BMITRUEreportedrvi6Om1TRUENA 7rs12591730AGAG 0.037 0.0330428120.2200.25367536FALSE\u22ef1.5e-080.006572282BMITRUEreportedrvi6Om1TRUENA 8rs13013021TCTC 0.070 0.1040752230.9070.90195307FALSE\u22ef1.9e-150.008872282BMITRUEreportedrvi6Om1TRUENA 9rs1955337 TGTG 0.036 0.0195935030.3000.24112816FALSE\u22ef7.4e-110.005672282BMITRUEreportedrvi6Om1TRUENA 10rs2076308 CGCG 0.037 0.0413520380.3100.31562874FALSE\u22ef3.4e-110.005572282BMITRUEreportedrvi6Om1TRUENA 11rs2278557 GCGC 0.034 0.0212111960.3200.29052039FALSE\u22ef7.4e-100.005572282BMITRUEreportedrvi6Om1TRUENA 12rs2304608 ACAC 0.031 0.0466695150.4700.44287320FALSE\u22ef1.1e-090.005172282BMITRUEreportedrvi6Om1TRUENA 13rs2531995 TCTC 0.031 0.0433160150.3700.33584772FALSE\u22ef5.2e-090.005372282BMITRUEreportedrvi6Om1TRUENA 14rs261967 CACA 0.032 0.0489708280.4400.39718313FALSE\u22ef3.5e-100.005172282BMITRUEreportedrvi6Om1TRUENA 15rs35332469CTCT-0.035 0.0080755980.2200.17678428FALSE\u22ef3.6e-080.006372282BMITRUEreportedrvi6Om1TRUENA 16rs35560038TATA-0.047 0.0739350890.5900.61936434FALSE\u22ef1.4e-190.005272282BMITRUEreportedrvi6Om1TRUENA 17rs3755804 TCTC 0.043 0.0228541340.2800.30750660FALSE\u22ef1.5e-140.005672282BMITRUEreportedrvi6Om1TRUENA 18rs4470425 ACAC-0.030-0.0208441370.4500.44152032FALSE\u22ef4.9e-090.005172282BMITRUEreportedrvi6Om1TRUENA 19rs476828 CTCT 0.067 0.0786518590.2700.25309742FALSE\u22ef2.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 20rs4883723 AGAG 0.039 0.0213709100.2800.22189601FALSE\u22ef8.3e-120.005772282BMITRUEreportedrvi6Om1TRUENA 21rs509325 GTGT 0.065 0.0356917590.2800.26816326FALSE\u22ef7.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 22rs55872725TCTC 0.090 0.1215170230.1200.20355108FALSE\u22ef1.8e-310.007772282BMITRUEreportedrvi6Om1TRUENA 23rs6089309 CTCT-0.033-0.0186698330.7000.65803267FALSE\u22ef3.5e-090.005672282BMITRUEreportedrvi6Om1TRUENA 24rs6265 TCTC-0.049-0.0316426960.4600.40541994FALSE\u22ef6.1e-220.005172282BMITRUEreportedrvi6Om1TRUENA 25rs6736712 GCGC-0.053-0.0297168990.9170.93023505FALSE\u22ef2.1e-080.009572282BMITRUEreportedrvi6Om1TRUENA 26rs7560832 CACA-0.150-0.0904811950.0120.01129784FALSE\u22ef2.0e-090.025072282BMITRUEreportedrvi6Om1TRUENA 27rs825486 TCTC-0.031 0.0190735540.6900.75485104FALSE\u22ef3.1e-080.005672282BMITRUEreportedrvi6Om1TRUENA 28rs9348441 ATAT-0.036 0.1792307940.4700.42502848FALSE\u22ef1.3e-120.005172282BMITRUEreportedrvi6Om1TRUENA In\u00a0[6]: Copied!
    res <- mr(harmonized_data)\n
    res <- mr(harmonized_data)
    Analysing 'rvi6Om' on 'hff6sO'\n\n
    In\u00a0[7]: Copied!
    res\n
    res A data.frame: 5 \u00d7 9 id.exposureid.outcomeoutcomeexposuremethodnsnpbsepval <chr><chr><chr><chr><chr><int><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 281.33375800.694852606.596064e-02 rvi6Omhff6sOT2DBMIWeighted median 280.62989800.085163151.399605e-13 rvi6Omhff6sOT2DBMIInverse variance weighted280.55989560.232258061.592361e-02 rvi6Omhff6sOT2DBMISimple mode 280.60978420.133054299.340189e-05 rvi6Omhff6sOT2DBMIWeighted mode 280.59467780.126803557.011481e-05 In\u00a0[8]: Copied!
    mr_heterogeneity(harmonized_data)\n
    mr_heterogeneity(harmonized_data) A data.frame: 2 \u00d7 8 id.exposureid.outcomeoutcomeexposuremethodQQ_dfQ_pval <chr><chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 670.7022261.000684e-124 rvi6Omhff6sOT2DBMIInverse variance weighted706.6579271.534239e-131 In\u00a0[9]: Copied!
    mr_pleiotropy_test(harmonized_data)\n
    mr_pleiotropy_test(harmonized_data) A data.frame: 1 \u00d7 7 id.exposureid.outcomeoutcomeexposureegger_interceptsepval <chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMI-0.036036970.03052410.2484472 In\u00a0[10]: Copied!
    res_single <- mr_singlesnp(harmonized_data)\n
    res_single <- mr_singlesnp(harmonized_data) In\u00a0[11]: Copied!
    res_single\n
    res_single A data.frame: 30 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs10198356 0.63231400.20828372.398742e-03 2BMIT2Drvi6Omhff6sONArs10209994 0.94778080.32258143.302164e-03 3BMIT2Drvi6Omhff6sONArs10824329 0.62817650.32462145.297739e-02 4BMIT2Drvi6Omhff6sONArs10938397 1.23763160.27758548.251150e-06 5BMIT2Drvi6Omhff6sONArs11066132 0.60243030.22324016.963693e-03 6BMIT2Drvi6Omhff6sONArs12522139 0.29052010.28902403.148119e-01 7BMIT2Drvi6Omhff6sONArs12591730 0.89304900.30766873.700413e-03 8BMIT2Drvi6Omhff6sONArs13013021 1.48678890.22077771.646925e-11 9BMIT2Drvi6Omhff6sONArs1955337 0.54426400.29941466.910079e-02 10BMIT2Drvi6Omhff6sONArs2076308 1.11762260.26579692.613132e-05 11BMIT2Drvi6Omhff6sONArs2278557 0.62385870.29681843.556906e-02 12BMIT2Drvi6Omhff6sONArs2304608 1.50546820.29689053.961740e-07 13BMIT2Drvi6Omhff6sONArs2531995 1.39729080.31301578.045689e-06 14BMIT2Drvi6Omhff6sONArs261967 1.53033840.29211921.616714e-07 15BMIT2Drvi6Omhff6sONArs35332469 -0.23073140.34792195.072217e-01 16BMIT2Drvi6Omhff6sONArs35560038 -1.57308700.20189686.619637e-15 17BMIT2Drvi6Omhff6sONArs3755804 0.53149150.23250732.225933e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.69480460.30799442.407689e-02 19BMIT2Drvi6Omhff6sONArs476828 1.17390830.15685507.207355e-14 20BMIT2Drvi6Omhff6sONArs4883723 0.54797210.28550045.494141e-02 21BMIT2Drvi6Omhff6sONArs509325 0.54910400.15981965.908641e-04 22BMIT2Drvi6Omhff6sONArs55872725 1.35018910.12597918.419325e-27 23BMIT2Drvi6Omhff6sONArs6089309 0.56575250.33470099.096620e-02 24BMIT2Drvi6Omhff6sONArs6265 0.64576930.19018716.851804e-04 25BMIT2Drvi6Omhff6sONArs6736712 0.56069620.34487841.039966e-01 26BMIT2Drvi6Omhff6sONArs7560832 0.60320800.29049723.785077e-02 27BMIT2Drvi6Omhff6sONArs825486 -0.61527590.35003347.878772e-02 28BMIT2Drvi6Omhff6sONArs9348441 -4.97863320.25727821.992909e-83 29BMIT2Drvi6Omhff6sONAAll - Inverse variance weighted 0.55989560.23225811.592361e-02 30BMIT2Drvi6Omhff6sONAAll - MR Egger 1.33375800.69485266.596064e-02 In\u00a0[12]: Copied!
    res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\n
    res_loo <- mr_leaveoneout(harmonized_data) res_loo A data.frame: 29 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs101983560.55628340.24249172.178871e-02 2BMIT2Drvi6Omhff6sONArs102099940.55205760.23881222.079526e-02 3BMIT2Drvi6Omhff6sONArs108243290.55853350.23902391.945341e-02 4BMIT2Drvi6Omhff6sONArs109383970.54126880.23887092.345460e-02 5BMIT2Drvi6Omhff6sONArs110661320.55806060.24172752.096381e-02 6BMIT2Drvi6Omhff6sONArs125221390.56671020.23950641.797373e-02 7BMIT2Drvi6Omhff6sONArs125917300.55248020.23909902.085075e-02 8BMIT2Drvi6Omhff6sONArs130130210.51897150.23868082.968017e-02 9BMIT2Drvi6Omhff6sONArs1955337 0.56026350.23945051.929468e-02 10BMIT2Drvi6Omhff6sONArs2076308 0.54313550.23944032.330758e-02 11BMIT2Drvi6Omhff6sONArs2278557 0.55836340.23949241.972992e-02 12BMIT2Drvi6Omhff6sONArs2304608 0.53725570.23773252.382639e-02 13BMIT2Drvi6Omhff6sONArs2531995 0.54190160.23797122.277590e-02 14BMIT2Drvi6Omhff6sONArs261967 0.53587610.23766862.415093e-02 15BMIT2Drvi6Omhff6sONArs353324690.57359070.23783451.587739e-02 16BMIT2Drvi6Omhff6sONArs355600380.67349060.22178042.391474e-03 17BMIT2Drvi6Omhff6sONArs3755804 0.56102150.24132492.008503e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.55689930.23926321.993549e-02 19BMIT2Drvi6Omhff6sONArs476828 0.50375550.24432243.922224e-02 20BMIT2Drvi6Omhff6sONArs4883723 0.56020500.23973251.945000e-02 21BMIT2Drvi6Omhff6sONArs509325 0.56084290.24685062.308693e-02 22BMIT2Drvi6Omhff6sONArs558727250.44194460.24547717.180543e-02 23BMIT2Drvi6Omhff6sONArs6089309 0.55978590.23889021.911519e-02 24BMIT2Drvi6Omhff6sONArs6265 0.55470680.24369102.282978e-02 25BMIT2Drvi6Omhff6sONArs6736712 0.55988150.23876021.902944e-02 26BMIT2Drvi6Omhff6sONArs7560832 0.55881130.23962291.969836e-02 27BMIT2Drvi6Omhff6sONArs825486 0.58000260.23675451.429330e-02 28BMIT2Drvi6Omhff6sONArs9348441 0.73789670.13668386.717515e-08 29BMIT2Drvi6Omhff6sONAAll 0.55989560.23225811.592361e-02 In\u00a0[29]: Copied!
    harmonized_data$\"r.outcome\" <- get_r_from_lor(\n  harmonized_data$\"beta.outcome\",\n  harmonized_data$\"eaf.outcome\",\n  45383,\n  132032,\n  0.26,\n  model = \"logit\",\n  correction = FALSE\n)\n
    harmonized_data$\"r.outcome\" <- get_r_from_lor( harmonized_data$\"beta.outcome\", harmonized_data$\"eaf.outcome\", 45383, 132032, 0.26, model = \"logit\", correction = FALSE ) In\u00a0[34]: Copied!
    out <- directionality_test(harmonized_data)\nout\n
    out <- directionality_test(harmonized_data) out
    r.exposure and/or r.outcome not present.\n\nCalculating approximate SNP-exposure and/or SNP-outcome correlations, assuming all are quantitative traits. Please pre-calculate r.exposure and/or r.outcome using get_r_from_lor() for any binary traits\n\n
    A data.frame: 1 \u00d7 8 id.exposureid.outcomeexposureoutcomesnp_r2.exposuresnp_r2.outcomecorrect_causal_directionsteiger_pval <chr><chr><chr><chr><dbl><dbl><lgl><dbl> rvi6OmETcv15BMIT2D0.021254530.005496427TRUENA In\u00a0[\u00a0]: Copied!
    res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\n
    res <- mr(harmonized_data) p1 <- mr_scatter_plot(res, harmonized_data) p1[[1]] In\u00a0[\u00a0]: Copied!
    res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\n
    res_single <- mr_singlesnp(harmonized_data) p2 <- mr_forest_plot(res_single) p2[[1]] In\u00a0[\u00a0]: Copied!
    res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\n
    res_loo <- mr_leaveoneout(harmonized_data) p3 <- mr_leaveoneout_plot(res_loo) p3[[1]] In\u00a0[\u00a0]: Copied!
    res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\n
    res_single <- mr_singlesnp(harmonized_data) p4 <- mr_funnel_plot(res_single) p4[[1]] In\u00a0[\u00a0]: Copied!
    \n
    In\u00a0[\u00a0]: Copied!
    \n
    "},{"location":"Visualization/","title":"Visualization by gwaslab","text":"In\u00a0[2]: Copied!
    import gwaslab as gl\n
    import gwaslab as gl In\u00a0[3]: Copied!
    sumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")\n
    sumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")
    Tue Dec 26 15:56:49 2023 GWASLab v3.4.22 https://cloufield.github.io/gwaslab/\nTue Dec 26 15:56:49 2023 (C) 2022-2023, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\nTue Dec 26 15:56:49 2023 Start to load format from formatbook....\nTue Dec 26 15:56:49 2023  -plink2 format meta info:\nTue Dec 26 15:56:49 2023   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear\nTue Dec 26 15:56:49 2023   - format_source  : https://www.cog-genomics.org/plink/2.0/formats\nTue Dec 26 15:56:49 2023   - format_version  : Alpha 3.3 final (3 Jun)\nTue Dec 26 15:56:49 2023   - last_check_date  :  20220806\nTue Dec 26 15:56:49 2023  -plink2 to gwaslab format dictionary:\nTue Dec 26 15:56:49 2023   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\nTue Dec 26 15:56:49 2023   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\nTue Dec 26 15:56:49 2023 Start to initiate from file :1kgeas.B1.glm.firth\nTue Dec 26 15:56:50 2023  -Reading columns          : REF,ID,ALT,POS,OR,LOG(OR)_SE,Z_STAT,OBS_CT,A1,#CHROM,P,A1_FREQ\nTue Dec 26 15:56:50 2023  -Renaming columns to      : REF,SNPID,ALT,POS,OR,SE,Z,N,EA,CHR,P,EAF\nTue Dec 26 15:56:50 2023  -Current Dataframe shape : 1128732  x  12\nTue Dec 26 15:56:50 2023  -Initiating a status column: STATUS ...\nTue Dec 26 15:56:50 2023  NEA not available: assigning REF to NEA...\nTue Dec 26 15:56:50 2023  -EA,REF and ALT columns are available: assigning NEA...\nTue Dec 26 15:56:50 2023  -For variants with EA == ALT : assigning REF to NEA ...\nTue Dec 26 15:56:50 2023  -For variants with EA != ALT : assigning ALT to NEA ...\nTue Dec 26 15:56:50 2023 Start to reorder the columns...\nTue Dec 26 15:56:50 2023  -Current Dataframe shape : 1128732  x  14\nTue Dec 26 15:56:50 2023  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\nTue Dec 26 15:56:50 2023 Finished sorting columns successfully!\nTue Dec 26 15:56:50 2023  -Column: SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \nTue Dec 26 15:56:50 2023  -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\nTue Dec 26 15:56:50 2023 Finished loading data successfully!\n
    In\u00a0[4]: Copied!
    sumstats.data\n
    sumstats.data Out[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 0 1:15774:G:A 1 15774 A G 0.028283 NaN NaN NaN NaN 495 9999999 G A 1 1:15777:A:G 1 15777 G A 0.073737 NaN NaN NaN NaN 495 9999999 A G 2 1:57292:C:T 1 57292 T C 0.104675 NaN NaN NaN NaN 492 9999999 C T 3 1:77874:G:A 1 77874 A G 0.019153 0.462750 0.249299 0.803130 1.122280 496 9999999 G A 4 1:87360:C:T 1 87360 T C 0.023139 NaN NaN NaN NaN 497 9999999 C T ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 1128727 22:51217954:G:A 22 51217954 A G 0.033199 NaN NaN NaN NaN 497 9999999 G A 1128728 22:51218377:G:C 22 51218377 C G 0.033333 0.362212 -0.994457 0.320000 0.697534 495 9999999 G C 1128729 22:51218615:T:A 22 51218615 A T 0.033266 0.362476 -1.029230 0.303374 0.688618 496 9999999 T A 1128730 22:51222100:G:T 22 51222100 T G 0.039157 NaN NaN NaN NaN 498 9999999 G T 1128731 22:51239678:G:T 22 51239678 T G 0.034137 NaN NaN NaN NaN 498 9999999 G T

    1128732 rows \u00d7 14 columns

    In\u00a0[5]: Copied!
    sumstats.get_lead(sig_level=5e-8)\n
    sumstats.get_lead(sig_level=5e-8)
    Tue Dec 26 15:56:51 2023 Start to extract lead variants...\nTue Dec 26 15:56:51 2023  -Processing 1128732 variants...\nTue Dec 26 15:56:51 2023  -Significance threshold : 5e-08\nTue Dec 26 15:56:51 2023  -Sliding window size: 500  kb\nTue Dec 26 15:56:51 2023  -Found 43 significant variants in total...\nTue Dec 26 15:56:51 2023  -Identified 4 lead variants!\nTue Dec 26 15:56:51 2023 Finished extracting lead variants successfully!\n
    Out[5]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 54904 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A 113179 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T 549726 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G 1088750 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C In\u00a0[9]: Copied!
    sumstats.plot_mqq(skip=2,anno=True)\n
    sumstats.plot_mqq(skip=2,anno=True)
    Tue Dec 26 15:59:17 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:59:17 2023  -Genomic coordinates version: 99...\nTue Dec 26 15:59:17 2023    -WARNING!!! Genomic coordinates version is unknown...\nTue Dec 26 15:59:17 2023  -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:59:17 2023  -Raw input contains 1128732 variants...\nTue Dec 26 15:59:17 2023  -Plot layout mode is : mqq\nTue Dec 26 15:59:17 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:59:17 2023 Start conversion and sanity check:\nTue Dec 26 15:59:17 2023  -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:59:17 2023  -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:59:17 2023  -Removed 220793 variants with nan in P column ...\nTue Dec 26 15:59:17 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:59:17 2023  -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:59:17 2023  -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:59:17 2023  -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:59:17 2023 Finished data conversion and sanity check.\nTue Dec 26 15:59:17 2023 Start to create manhattan plot with 6866 variants:\nTue Dec 26 15:59:17 2023  -Found 4 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:17 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:17 2023  -Annotating using column CHR:POS...\nTue Dec 26 15:59:17 2023  -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:17 2023 Start to create QQ plot with 6866 variants:\nTue Dec 26 15:59:17 2023 Expected range of P: (0,1.0)\nTue Dec 26 15:59:17 2023  -Lambda GC (MLOG10P mode) at 0.5 is   0.98908\nTue Dec 26 15:59:17 2023 Finished creating QQ plot successfully!\nTue Dec 26 15:59:17 2023  -Skip saving figures!\n
    Out[9]:
    (<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)
    In\u00a0[6]: Copied!
    sumstats.basic_check()\n
    sumstats.basic_check()
    Tue Dec 27 23:08:13 2022 Start to check IDs...\nTue Dec 27 23:08:13 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:13 2022  -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)\nTue Dec 27 23:08:14 2022 Finished checking IDs successfully!\nTue Dec 27 23:08:14 2022 Start to fix chromosome notation...\nTue Dec 27 23:08:14 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:17 2022  -Vairants with standardized chromosome notation: 1122299\nTue Dec 27 23:08:19 2022  -All CHR are already fixed...\nTue Dec 27 23:08:21 2022 Finished fixing chromosome notation successfully!\nTue Dec 27 23:08:21 2022 Start to fix basepair positions...\nTue Dec 27 23:08:21 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:21 2022  -Converting to Int64 data type ...\nTue Dec 27 23:08:22 2022  -Position upper_bound is: 250,000,000\nTue Dec 27 23:08:24 2022  -Remove outliers: 0\nTue Dec 27 23:08:24 2022  -Converted all position to datatype Int64.\nTue Dec 27 23:08:24 2022 Finished fixing basepair position successfully!\nTue Dec 27 23:08:24 2022 Start to fix alleles...\nTue Dec 27 23:08:24 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:25 2022  -Detected 0 variants with alleles that contain bases other than A/C/T/G .\nTue Dec 27 23:08:25 2022  -Converted all bases to string datatype and UPPERCASE.\nTue Dec 27 23:08:27 2022 Finished fixing allele successfully!\nTue Dec 27 23:08:27 2022 Start sanity check for statistics ...\nTue Dec 27 23:08:27 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:27 2022  -Checking if  0 <=N<= inf  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad N.\nTue Dec 27 23:08:27 2022  -Checking if  -37.5 <Z< 37.5  ...\nTue Dec 27 23:08:27 2022  -Removed 14 variants with bad Z.\nTue Dec 27 23:08:27 2022  -Checking if  5e-300 <= P <= 1  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad P.\nTue Dec 27 23:08:27 2022  -Checking if  0 <SE< inf  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad SE.\nTue Dec 27 23:08:27 2022  -Checking if  -10 <log(OR)< 10  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad OR.\nTue Dec 27 23:08:27 2022  -Checking STATUS...\nTue Dec 27 23:08:28 2022  -Coverting STAUTUS to interger.\nTue Dec 27 23:08:28 2022  -Removed 14 variants with bad statistics in total.\nTue Dec 27 23:08:28 2022 Finished sanity check successfully!\nTue Dec 27 23:08:28 2022 Start to normalize variants...\nTue Dec 27 23:08:28 2022  -Current Dataframe shape : 1122285  x  11\nTue Dec 27 23:08:29 2022  -No available variants to normalize..\nTue Dec 27 23:08:29 2022 Finished normalizing variants successfully!\n
    In\u00a0[7]: Copied!
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\")\n#2:55513738\n
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\") #2:55513738
    Tue Dec 26 15:58:10 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:10 2023  -Genomic coordinates version: 19...\nTue Dec 26 15:58:10 2023  -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:10 2023  -Raw input contains 1128732 variants...\nTue Dec 26 15:58:10 2023  -Plot layout mode is : r\nTue Dec 26 15:58:10 2023  -Region to plot : chr2:54513738-56513738.\nTue Dec 26 15:58:10 2023  -Extract SNPs in region : chr2:54513738-56513738...\nTue Dec 26 15:58:10 2023  -Extract SNPs in specified regions: 865\nTue Dec 26 15:58:10 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:10 2023 Start conversion and sanity check:\nTue Dec 26 15:58:10 2023  -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:10 2023  -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:10 2023  -Removed 160 variants with nan in P column ...\nTue Dec 26 15:58:10 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:10 2023  -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:10 2023  -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:11 2023  -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:11 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:11 2023 Start to create manhattan plot with 705 variants:\nTue Dec 26 15:58:11 2023  -Extracting lead variant...\nTue Dec 26 15:58:11 2023  -Loading gtf files from:default\n
    INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
    Tue Dec 26 15:58:40 2023  -plotting gene track..\nTue Dec 26 15:58:40 2023  -Finished plotting gene track..\nTue Dec 26 15:58:40 2023  -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:58:40 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:58:40 2023  -Annotating using column CHR:POS...\nTue Dec 26 15:58:40 2023  -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:58:40 2023  -Skip saving figures!\n
    Out[7]:
    (<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)
    In\u00a0[8]: Copied!
    gl.download_ref(\"1kg_eas_hg19\")\n
    gl.download_ref(\"1kg_eas_hg19\")
    Tue Dec 27 22:44:52 2022 Start to download  1kg_eas_hg19  ...\nTue Dec 27 22:44:52 2022  -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 27 22:52:33 2022  -Updating record in config file...\nTue Dec 27 22:52:35 2022  -Updating record in config file...\nTue Dec 27 22:52:35 2022  -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi\nTue Dec 27 22:52:35 2022 Downloaded  1kg_eas_hg19  successfully!\n
    In\u00a0[8]: Copied!
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")\n
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")
    Tue Dec 26 15:58:41 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:41 2023  -Genomic coordinates version: 19...\nTue Dec 26 15:58:41 2023  -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:41 2023  -Raw input contains 1128732 variants...\nTue Dec 26 15:58:41 2023  -Plot layout mode is : r\nTue Dec 26 15:58:41 2023  -Region to plot : chr2:54531536-56731536.\nTue Dec 26 15:58:41 2023  -Checking prefix for chromosomes in vcf files...\nTue Dec 26 15:58:41 2023  -No prefix for chromosomes in the VCF files.\nTue Dec 26 15:58:41 2023  -Extract SNPs in region : chr2:54531536-56731536...\nTue Dec 26 15:58:41 2023  -Extract SNPs in specified regions: 967\nTue Dec 26 15:58:41 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:41 2023 Start conversion and sanity check:\nTue Dec 26 15:58:41 2023  -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:41 2023  -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:41 2023  -Removed 172 variants with nan in P column ...\nTue Dec 26 15:58:41 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:41 2023  -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:41 2023  -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:41 2023  -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:41 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:41 2023 Start to load reference genotype...\nTue Dec 26 15:58:41 2023  -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 26 15:58:43 2023  -Retrieving index...\nTue Dec 26 15:58:43 2023  -Ref variants in the region: 71908\nTue Dec 26 15:58:43 2023  -Matching variants using POS, NEA, EA ...\nTue Dec 26 15:58:43 2023  -Calculating Rsq...\nTue Dec 26 15:58:43 2023 Finished loading reference genotype successfully!\nTue Dec 26 15:58:43 2023 Start to create manhattan plot with 795 variants:\nTue Dec 26 15:58:43 2023  -Extracting lead variant...\nTue Dec 26 15:58:44 2023  -Loading gtf files from:default\n
    INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
    Tue Dec 26 15:59:12 2023  -plotting gene track..\nTue Dec 26 15:59:12 2023  -Finished plotting gene track..\nTue Dec 26 15:59:13 2023  -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:13 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:13 2023  -Annotating using column CHR:POS...\nTue Dec 26 15:59:13 2023  -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:13 2023  -Skip saving figures!\n
    Out[8]:
    (<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)
    In\u00a0[\u00a0]: Copied!
    \n
    "},{"location":"Visualization/#visualization-by-gwaslab","title":"Visualization by gwaslab\u00b6","text":""},{"location":"Visualization/#import-gwaslab-package","title":"Import gwaslab package\u00b6","text":""},{"location":"Visualization/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"Visualization/#check-the-lead-variants-in-significant-loci","title":"Check the lead variants in significant loci\u00b6","text":""},{"location":"Visualization/#create-mahattan-plot","title":"Create mahattan plot\u00b6","text":""},{"location":"Visualization/#qc-check","title":"QC check\u00b6","text":""},{"location":"Visualization/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"Visualization/#create-regional-plot-with-ld-information","title":"Create regional plot with LD information\u00b6","text":""},{"location":"finemapping_susie/","title":"Finemapping using susieR","text":"In\u00a0[1]: Copied!
    import gwaslab as gl\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n
    import gwaslab as gl import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt In\u00a0[2]: Copied!
    sumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")\n
    sumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")
    2024/04/18 10:40:48 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/\n2024/04/18 10:40:48 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\n2024/04/18 10:40:48 Start to load format from formatbook....\n2024/04/18 10:40:48  -plink2 format meta info:\n2024/04/18 10:40:48   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear\n2024/04/18 10:40:48   - format_source  : https://www.cog-genomics.org/plink/2.0/formats\n2024/04/18 10:40:48   - format_version  : Alpha 3.3 final (3 Jun)\n2024/04/18 10:40:48   - last_check_date  :  20220806\n2024/04/18 10:40:48  -plink2 to gwaslab format dictionary:\n2024/04/18 10:40:48   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\n2024/04/18 10:40:48   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\n2024/04/18 10:40:48 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz\n2024/04/18 10:40:49  -Reading columns          : Z_STAT,A1_FREQ,POS,ALT,REF,P,A1,OR,OBS_CT,#CHROM,LOG(OR)_SE,ID\n2024/04/18 10:40:49  -Renaming columns to      : Z,EAF,POS,ALT,REF,P,EA,OR,N,CHR,SE,SNPID\n2024/04/18 10:40:49  -Current Dataframe shape : 1128732  x  12\n2024/04/18 10:40:49  -Initiating a status column: STATUS ...\n2024/04/18 10:40:49  #WARNING! Version of genomic coordinates is unknown...\n2024/04/18 10:40:49  NEA not available: assigning REF to NEA...\n2024/04/18 10:40:49  -EA,REF and ALT columns are available: assigning NEA...\n2024/04/18 10:40:49  -For variants with EA == ALT : assigning REF to NEA ...\n2024/04/18 10:40:49  -For variants with EA != ALT : assigning ALT to NEA ...\n2024/04/18 10:40:49 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:49  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:49  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:49 Finished reordering the columns.\n2024/04/18 10:40:49  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \n2024/04/18 10:40:49  -DType   : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\n2024/04/18 10:40:49  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       \n2024/04/18 10:40:50  -Current Dataframe memory usage: 106.06 MB\n2024/04/18 10:40:50 Finished loading data successfully!\n
    In\u00a0[3]: Copied!
    sumstats.basic_check()\n
    sumstats.basic_check()
    2024/04/18 10:40:50 Start to check SNPID/rsID...v3.4.43\n2024/04/18 10:40:50  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:50  -Checking SNPID data type...\n2024/04/18 10:40:50  -Converting SNPID to pd.string data type...\n2024/04/18 10:40:50  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)\n2024/04/18 10:40:51 Finished checking SNPID/rsID.\n2024/04/18 10:40:51 Start to fix chromosome notation (CHR)...v3.4.43\n2024/04/18 10:40:51  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:51  -Checking CHR data type...\n2024/04/18 10:40:51  -Variants with standardized chromosome notation: 1128732\n2024/04/18 10:40:51  -All CHR are already fixed...\n2024/04/18 10:40:52 Finished fixing chromosome notation (CHR).\n2024/04/18 10:40:52 Start to fix basepair positions (POS)...v3.4.43\n2024/04/18 10:40:52  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 107.13 MB\n2024/04/18 10:40:52  -Converting to Int64 data type ...\n2024/04/18 10:40:53  -Position bound:(0 , 250,000,000)\n2024/04/18 10:40:53  -Removed outliers: 0\n2024/04/18 10:40:53 Finished fixing basepair positions (POS).\n2024/04/18 10:40:53 Start to fix alleles (EA and NEA)...v3.4.43\n2024/04/18 10:40:53  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:53  -Converted all bases to string datatype and UPPERCASE.\n2024/04/18 10:40:53  -Variants with bad EA  : 0\n2024/04/18 10:40:54  -Variants with bad NEA : 0\n2024/04/18 10:40:54  -Variants with NA for EA or NEA: 0\n2024/04/18 10:40:54  -Variants with same EA and NEA: 0\n2024/04/18 10:40:54  -Detected 0 variants with alleles that contain bases other than A/C/T/G .\n2024/04/18 10:40:55 Finished fixing alleles (EA and NEA).\n2024/04/18 10:40:55 Start to perform sanity check for statistics...v3.4.43\n2024/04/18 10:40:55  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:55  -Comparison tolerance for floats: 1e-07\n2024/04/18 10:40:55  -Checking if 0 <= N <= 2147483647 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na N.\n2024/04/18 10:40:55  -Checking if -1e-07 < EAF < 1.0000001 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na EAF.\n2024/04/18 10:40:55  -Checking if -9999.0000001 < Z < 9999.0000001 ...\n2024/04/18 10:40:55   -Examples of invalid variants(SNPID): 1:15774:G:A,1:15777:A:G,1:57292:C:T,1:87360:C:T,1:625392:T:C ...\n2024/04/18 10:40:55   -Examples of invalid values (Z): NA,NA,NA,NA,NA ...\n2024/04/18 10:40:55  -Removed 220793 variants with bad/na Z.\n2024/04/18 10:40:55  -Checking if -1e-07 < P < 1.0000001 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na P.\n2024/04/18 10:40:55  -Checking if -1e-07 < SE < inf ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na SE.\n2024/04/18 10:40:55  -Checking if -100.0000001 < OR < 100.0000001 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na OR.\n2024/04/18 10:40:55  -Checking STATUS and converting STATUS to categories....\n2024/04/18 10:40:56  -Removed 220793 variants with bad statistics in total.\n2024/04/18 10:40:56  -Data types for each column:\n2024/04/18 10:40:56  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \n2024/04/18 10:40:56  -DType   : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:40:56  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       \n2024/04/18 10:40:56 Finished sanity check for statistics.\n2024/04/18 10:40:56 Start to check data consistency across columns...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56  -Tolerance: 0.001 (Relative) and 0.001 (Absolute)\n2024/04/18 10:40:56  -No availalbe columns for data consistency checking...Skipping...\n2024/04/18 10:40:56 Finished checking data consistency across columns.\n2024/04/18 10:40:56 Start to normalize indels...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56  -No available variants to normalize..\n2024/04/18 10:40:56 Finished normalizing variants successfully!\n2024/04/18 10:40:56 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 Finished sorting coordinates.\n2024/04/18 10:40:56 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:56 Finished reordering the columns.\n

    Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.

    In\u00a0[4]: Copied!
    sumstats.get_lead()\n
    sumstats.get_lead()
    2024/04/18 10:40:56 Start to extract lead variants...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56  -Processing 907939 variants...\n2024/04/18 10:40:56  -Significance threshold : 5e-08\n2024/04/18 10:40:56  -Sliding window size: 500  kb\n2024/04/18 10:40:56  -Using P for extracting lead variants...\n2024/04/18 10:40:56  -Found 43 significant variants in total...\n2024/04/18 10:40:56  -Identified 4 lead variants!\n2024/04/18 10:40:56 Finished extracting lead variants.\n
    Out[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 44298 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9960099 G A 91266 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9960099 C T 442239 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9960099 T G 875859 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9960099 T C In\u00a0[5]: Copied!
    sumstats.plot_mqq()\n
    sumstats.plot_mqq()
    2024/04/18 10:40:57 Start to create MQQ plot...v3.4.43:\n2024/04/18 10:40:57  -Genomic coordinates version: 99...\n2024/04/18 10:40:57  #WARNING! Genomic coordinates version is unknown.\n2024/04/18 10:40:57  -Genome-wide significance level to plot is set to 5e-08 ...\n2024/04/18 10:40:57  -Raw input contains 907939 variants...\n2024/04/18 10:40:57  -MQQ plot layout mode is : mqq\n2024/04/18 10:40:57 Finished loading specified columns from the sumstats.\n2024/04/18 10:40:57 Start data conversion and sanity check:\n2024/04/18 10:40:57  -Removed 0 variants with nan in CHR or POS column ...\n2024/04/18 10:40:57  -Removed 0 variants with CHR <=0...\n2024/04/18 10:40:57  -Removed 0 variants with nan in P column ...\n2024/04/18 10:40:57  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\n2024/04/18 10:40:57  -Sumstats P values are being converted to -log10(P)...\n2024/04/18 10:40:57  -Sanity check: 0 na/inf/-inf variants will be removed...\n2024/04/18 10:40:57  -Converting data above cut line...\n2024/04/18 10:40:57  -Maximum -log10(P) value is 14.772946706439042 .\n2024/04/18 10:40:57 Finished data conversion and sanity check.\n2024/04/18 10:40:57 Start to create MQQ plot with 907939 variants...\n2024/04/18 10:40:58  -Creating background plot...\n2024/04/18 10:40:59 Finished creating MQQ plot successfully!\n2024/04/18 10:40:59 Start to extract variants for annotation...\n2024/04/18 10:40:59  -Found 4 significant variants with a sliding window size of 500 kb...\n2024/04/18 10:40:59 Finished extracting variants for annotation...\n2024/04/18 10:40:59 Start to process figure arts.\n2024/04/18 10:40:59  -Processing X ticks...\n2024/04/18 10:40:59  -Processing X labels...\n2024/04/18 10:40:59  -Processing Y labels...\n2024/04/18 10:40:59  -Processing Y tick lables...\n2024/04/18 10:40:59  -Processing Y labels...\n2024/04/18 10:40:59  -Processing lines...\n2024/04/18 10:40:59 Finished processing figure arts.\n2024/04/18 10:40:59 Start to annotate variants...\n2024/04/18 10:40:59  -Skip annotating\n2024/04/18 10:40:59 Finished annotating variants.\n2024/04/18 10:40:59 Start to create QQ plot with 907939 variants:\n2024/04/18 10:40:59  -Plotting all variants...\n2024/04/18 10:40:59  -Expected range of P: (0,1.0)\n2024/04/18 10:40:59  -Lambda GC (MLOG10P mode) at 0.5 is   0.98908\n2024/04/18 10:40:59  -Processing Y tick lables...\n2024/04/18 10:40:59 Finished creating QQ plot successfully!\n2024/04/18 10:40:59 Start to save figure...\n2024/04/18 10:40:59  -Skip saving figure!\n2024/04/18 10:40:59 Finished saving figure...\n2024/04/18 10:40:59 Finished creating plot successfully!\n
    Out[5]:
    (<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fa6ad1132b0>)
    In\u00a0[6]: Copied!
    locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')\n
    locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
    2024/04/18 10:41:06 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06  -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06 Finished filtering values.\n
    In\u00a0[7]: Copied!
    locus.fill_data(to_fill=[\"BETA\"])\n
    locus.fill_data(to_fill=[\"BETA\"])
    2024/04/18 10:41:06 Start filling data using existing columns...v3.4.43\n2024/04/18 10:41:06  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \n2024/04/18 10:41:06  -DType   : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:41:06  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       \n2024/04/18 10:41:06  -Overwrite mode:  False\n2024/04/18 10:41:06   -Skipping columns:  []\n2024/04/18 10:41:06  -Filling columns:  ['BETA']\n2024/04/18 10:41:06   - Filling Columns iteratively...\n2024/04/18 10:41:06   - Filling BETA value using OR column...\n2024/04/18 10:41:06 Finished filling data using existing columns.\n2024/04/18 10:41:06 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:06  -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:06  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:06 Finished reordering the columns.\n
    In\u00a0[8]: Copied!
    locus.data\n
    locus.data Out[8]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 91067 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960099 A T 91068 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960099 G A 91069 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960099 G A 91070 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960099 A C 91071 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960099 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 91441 2:56004219:G:T 2 56004219 G T 0.171717 0.148489 0.169557 0.875763 0.381159 1.160080 495 9960099 G T 91442 2:56007034:T:C 2 56007034 T C 0.260121 0.073325 0.145565 0.503737 0.614446 1.076080 494 9960099 T C 91443 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960099 C G 91444 2:56009480:A:T 2 56009480 A T 0.157258 0.135667 0.177621 0.763784 0.444996 1.145300 496 9960099 A T 91445 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960099 C T

    379 rows \u00d7 15 columns

    In\u00a0[9]: Copied!
    locus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")\n
    locus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")
    2024/04/18 10:41:07 Start to check if NEA is aligned with reference sequence...v3.4.43\n2024/04/18 10:41:07  -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:07  -Reference genome FASTA file: /home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\n2024/04/18 10:41:07  -Loading fasta records:2  \n2024/04/18 10:41:19  -Checking records\n2024/04/18 10:41:19    -Building numpy fasta records from dict\n2024/04/18 10:41:20    -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )\n2024/04/18 10:41:20    -Checking records for ( len(NEA) > 4 or len(EA) > 4 )\n2024/04/18 10:41:20  -Finished checking records\n2024/04/18 10:41:20  -Variants allele on given reference sequence :  264\n2024/04/18 10:41:20  -Variants flipped :  115\n2024/04/18 10:41:20   -Raw Matching rate :  100.00%\n2024/04/18 10:41:20  -Variants inferred reverse_complement :  0\n2024/04/18 10:41:20  -Variants inferred reverse_complement_flipped :  0\n2024/04/18 10:41:20  -Both allele on genome + unable to distinguish :  0\n2024/04/18 10:41:20  -Variants not on given reference sequence :  0\n2024/04/18 10:41:20 Finished checking if NEA is aligned with reference sequence.\n2024/04/18 10:41:20 Start to adjust statistics based on STATUS code...v3.4.43\n2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.43\n2024/04/18 10:41:20  -Flipping 115 variants...\n2024/04/18 10:41:20  -Swapping column: NEA <=> EA...\n2024/04/18 10:41:20  -Flipping column: BETA = - BETA...\n2024/04/18 10:41:20  -Flipping column: Z = - Z...\n2024/04/18 10:41:20  -Flipping column: EAF = 1 - EAF...\n2024/04/18 10:41:20  -Flipping column: OR = 1 / OR...\n2024/04/18 10:41:20  -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x\n2024/04/18 10:41:20 Finished adjusting statistics based on STATUS code.\n2024/04/18 10:41:20 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Finished sorting coordinates.\n2024/04/18 10:41:20 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.03 MB\n2024/04/18 10:41:20  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:20 Finished reordering the columns.\n
    Out[9]:
    <gwaslab.g_Sumstats.Sumstats at 0x7fa6a33a8130>
    In\u00a0[10]: Copied!
    locus.data\n
    locus.data Out[10]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T

    379 rows \u00d7 15 columns

    In\u00a0[11]: Copied!
    locus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\n
    locus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None) locus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None) In\u00a0[12]: Copied!
    !plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt\n\n!plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt_r2\n
    !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r square \\ --extract sig_locus.snplist \\ --out sig_locus_mt !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract sig_locus.snplist \\ --out sig_locus_mt_r2
    PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to sig_locus_mt.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract sig_locus.snplist\n  --keep-allele-order\n  --out sig_locus_mt\n  --r square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r square to sig_locus_mt.ld ... 0% [processingwriting]          done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to sig_locus_mt_r2.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract sig_locus.snplist\n  --keep-allele-order\n  --out sig_locus_mt_r2\n  --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt_r2.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting]          done.\n
    In\u00a0[13]: Copied!
    import rpy2\nimport rpy2.robjects as ro\nfrom rpy2.robjects.packages import importr\nimport rpy2.robjects.numpy2ri as numpy2ri\nnumpy2ri.activate()\n
    import rpy2 import rpy2.robjects as ro from rpy2.robjects.packages import importr import rpy2.robjects.numpy2ri as numpy2ri numpy2ri.activate()
    INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY\nINFO:rpy2.situation:R home found: /home/yunye/anaconda3/envs/gwaslab_py39/lib/R\nINFO:rpy2.situation:R library path: \nINFO:rpy2.situation:LD_LIBRARY_PATH: \nINFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save\nINFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.\n
    In\u00a0[14]: Copied!
    df = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\")\ndf\n
    df = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\") df Out[14]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T

    379 rows \u00d7 15 columns

    In\u00a0[15]: Copied!
    # import susieR as object\nsusieR = importr('susieR')\n
    # import susieR as object susieR = importr('susieR') In\u00a0[16]: Copied!
    # convert pd.DataFrame to numpy\nld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None)\nR_df = ld.values\nld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None)\nR_df2 = ld2.values\n
    # convert pd.DataFrame to numpy ld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None) R_df = ld.values ld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None) R_df2 = ld2.values In\u00a0[17]: Copied!
    R_df\n
    R_df Out[17]:
    array([[ 1.00000e+00,  9.58562e-01, -3.08678e-01, ...,  1.96204e-02,\n        -3.54602e-04, -7.14868e-03],\n       [ 9.58562e-01,  1.00000e+00, -2.97617e-01, ...,  2.47755e-02,\n        -1.49234e-02, -7.00509e-03],\n       [-3.08678e-01, -2.97617e-01,  1.00000e+00, ..., -3.49335e-02,\n        -1.37163e-02, -2.12828e-02],\n       ...,\n       [ 1.96204e-02,  2.47755e-02, -3.49335e-02, ...,  1.00000e+00,\n         5.26193e-02, -3.09069e-02],\n       [-3.54602e-04, -1.49234e-02, -1.37163e-02, ...,  5.26193e-02,\n         1.00000e+00, -3.01142e-01],\n       [-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02,\n        -3.01142e-01,  1.00000e+00]])
    In\u00a0[18]: Copied!
    plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0])\nsns.heatmap(data=R_df2,ax=ax[1])\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\n
    plt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0]) sns.heatmap(data=R_df2,ax=ax[1]) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[18]:
    Text(0.5, 1.0, 'LD r2 matrix')
    <Figure size 2000x2000 with 0 Axes>

    https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#fine-mapping-with-susier-using-summary-statistics

    In\u00a0[19]: Copied!
    ro.r('set.seed(123)')\nfit = susieR.susie_rss(\n    bhat = df[\"BETA\"].values.reshape((len(R_df),1)),\n    shat = df[\"SE\"].values.reshape((len(R_df),1)),\n    R = R_df,\n    L = 10,\n    n = 503\n)\n
    ro.r('set.seed(123)') fit = susieR.susie_rss( bhat = df[\"BETA\"].values.reshape((len(R_df),1)), shat = df[\"SE\"].values.reshape((len(R_df),1)), R = R_df, L = 10, n = 503 ) In\u00a0[20]: Copied!
    # show the results of susie_get_cs\nprint(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\n
    # show the results of susie_get_cs print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
    $L1\n[1] 200 218 221 224\n\n\n

    We found 1 credible set here

    In\u00a0[21]: Copied!
    # add the information to dataframe for plotting\ndf[\"cs\"] = 0\nn_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\nfor i in range(n_cs):\n    cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]\n    df.loc[np.array(cs_index)-1,\"cs\"] = i + 1\ndf[\"pip\"] = np.array(susieR.susie_get_pip(fit))\n
    # add the information to dataframe for plotting df[\"cs\"] = 0 n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0]) for i in range(n_cs): cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i] df.loc[np.array(cs_index)-1,\"cs\"] = i + 1 df[\"pip\"] = np.array(susieR.susie_get_pip(fit)) In\u00a0[22]: Copied!
    fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))\ndf[\"MLOG10P\"] = -np.log10(df[\"P\"])\ncol_to_plot = \"MLOG10P\"\np=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot],\n           marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\naxes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\n           marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[0].set_xlabel(\"position\")\naxes[0].set_xlim((55400000, 55800000))\naxes[0].set_ylabel(col_to_plot)\naxes[0].legend()\n\np=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"],\n           marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[1].set_xlabel(\"position\")\naxes[1].set_xlim((55400000, 55800000))\naxes[1].set_ylabel(\"PIP\")\naxes[1].legend()\n
    fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1)) df[\"MLOG10P\"] = -np.log10(df[\"P\"]) col_to_plot = \"MLOG10P\" p=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2) axes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot], marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[0].set_xlabel(\"position\") axes[0].set_xlim((55400000, 55800000)) axes[0].set_ylabel(col_to_plot) axes[0].legend() p=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2) axes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[1].set_xlabel(\"position\") axes[1].set_xlim((55400000, 55800000)) axes[1].set_ylabel(\"PIP\") axes[1].legend()
    /tmp/ipykernel_420/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.\n  axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\n
    Out[22]:
    <matplotlib.legend.Legend at 0x7fa6a330d5e0>

    The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.

    Lets then check the variants in credible set

    In\u00a0[23]: Copied!
    df.loc[np.array(cs_index)-1,:]\n
    df.loc[np.array(cs_index)-1,:] Out[23]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT cs pip MLOG10P 199 2:55513738:C:T 2 55513738 T C 0.623992 1.219516 0.153159 7.96244 1.686760e-15 3.385550 496 9960019 C T 1 0.325435 14.772947 217 2:55605943:A:G 2 55605943 G A 0.685484 1.321987 0.166688 7.93089 2.175840e-15 3.750867 496 9960019 A G 1 0.267953 14.662373 220 2:55612986:G:C 2 55612986 C G 0.685223 1.302133 0.166154 7.83691 4.617840e-15 3.677133 494 9960019 G C 1 0.150449 14.335561 223 2:55622624:G:A 2 55622624 A G 0.688508 1.324109 0.167119 7.92315 2.315640e-15 3.758833 496 9960019 G A 1 0.255449 14.635329 In\u00a0[24]: Copied!
    !echo \"2:55513738:C:T\" > credible.snplist\n!echo \"2:55605943:A:G\" >> credible.snplist\n!echo \"2:55612986:G:C\" >> credible.snplist\n!echo \"2:55620927:G:A\" >> credible.snplist\n!echo \"2:55622624:G:A\" >> credible.snplist\n\n!plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract credible.snplist \\\n  --out credible_r\n\n!plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract credible.snplist \\\n  --out credible_r2\n
    !echo \"2:55513738:C:T\" > credible.snplist !echo \"2:55605943:A:G\" >> credible.snplist !echo \"2:55612986:G:C\" >> credible.snplist !echo \"2:55620927:G:A\" >> credible.snplist !echo \"2:55622624:G:A\" >> credible.snplist !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r2
    PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to credible_r.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract credible.snplist\n  --keep-allele-order\n  --out credible_r\n  --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r.ld ... 0% [processingwriting]          done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to credible_r2.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract credible.snplist\n  --keep-allele-order\n  --out credible_r2\n  --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r2.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r2.ld ... 0% [processingwriting]          done.\n
    In\u00a0[25]: Copied!
    credible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"]\nld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None)\nld.columns=credible_snplist\nld.index=credible_snplist\nld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None)\nld2.columns=credible_snplist\nld2.index=credible_snplist\n
    credible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"] ld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None) ld.columns=credible_snplist ld.index=credible_snplist ld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None) ld2.columns=credible_snplist ld2.index=credible_snplist In\u00a0[26]: Copied!
    plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0)\nsns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1)\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\n
    plt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0) sns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[26]:
    Text(0.5, 1.0, 'LD r2 matrix')
    <Figure size 2000x2000 with 0 Axes>

    Variants in the credible set are in strong LD with the bona fide causal variant.

    This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.

    "},{"location":"finemapping_susie/#finemapping-using-susier","title":"Finemapping using susieR\u00b6","text":""},{"location":"finemapping_susie/#data-preparation","title":"Data preparation\u00b6","text":""},{"location":"finemapping_susie/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"finemapping_susie/#data-standardization-and-sanity-check","title":"Data standardization and sanity check\u00b6","text":""},{"location":"finemapping_susie/#extract-lead-variants","title":"Extract lead variants\u00b6","text":""},{"location":"finemapping_susie/#create-manhattan-plot-for-checking","title":"Create manhattan plot for checking\u00b6","text":""},{"location":"finemapping_susie/#extract-the-variants-around-255513738ct-for-finemapping","title":"Extract the variants around 2:55513738:C:T for finemapping\u00b6","text":""},{"location":"finemapping_susie/#convert-or-to-beta","title":"Convert OR to BETA\u00b6","text":""},{"location":"finemapping_susie/#align-nea-with-reference-sequence","title":"Align NEA with reference sequence\u00b6","text":""},{"location":"finemapping_susie/#output-the-sumstats-of-this-locus","title":"Output the sumstats of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-plink-to-get-ld-matrix-for-this-locus","title":"Run PLINK to get LD matrix for this locus\u00b6","text":""},{"location":"finemapping_susie/#finemapping","title":"Finemapping\u00b6","text":""},{"location":"finemapping_susie/#load-locus-sumstats","title":"Load locus sumstats\u00b6","text":""},{"location":"finemapping_susie/#import-sumsier","title":"Import sumsieR\u00b6","text":""},{"location":"finemapping_susie/#load-ld-matrix","title":"Load LD matrix\u00b6","text":""},{"location":"finemapping_susie/#visualize-the-ld-structure-of-this-locus","title":"Visualize the LD structure of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-finemapping-use-susier","title":"Run finemapping use susieR\u00b6","text":""},{"location":"finemapping_susie/#extract-credible-sets-and-pip","title":"Extract credible sets and PIP\u00b6","text":""},{"location":"finemapping_susie/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"finemapping_susie/#pitfalls","title":"Pitfalls\u00b6","text":""},{"location":"finemapping_susie/#check-ld-of-the-causal-variant-and-variants-in-the-credible-set","title":"Check LD of the causal variant and variants in the credible set\u00b6","text":""},{"location":"finemapping_susie/#load-ld-and-plot","title":"Load LD and plot\u00b6","text":""},{"location":"plot_PCA/","title":"Plotting PCA","text":"In\u00a0[1]: Copied!
    import pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n
    import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In\u00a0[2]: Copied!
    pca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\")\npca\n
    pca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\") pca Out[2]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752

    500 rows \u00d7 14 columns

    In\u00a0[6]: Copied!
    ped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\")\nped\n
    ped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\") ped Out[6]: sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00096 GBR EUR male NaN NaN 1 HG00097 GBR EUR female NaN NaN 2 HG00099 GBR EUR female NaN NaN 3 HG00100 GBR EUR female NaN NaN 4 HG00101 GBR EUR male NaN NaN ... ... ... ... ... ... ... 2499 NA21137 GIH SAS female NaN NaN 2500 NA21141 GIH SAS female NaN NaN 2501 NA21142 GIH SAS female NaN NaN 2502 NA21143 GIH SAS female NaN NaN 2503 NA21144 GIH SAS female NaN NaN

    2504 rows \u00d7 6 columns

    In\u00a0[7]: Copied!
    pcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\")\npcaped\n
    pcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\") pcaped Out[7]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 HG00403 CHS EAS male NaN NaN 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 HG00404 CHS EAS female NaN NaN 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 HG00406 CHS EAS male NaN NaN 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 HG00407 CHS EAS female NaN NaN 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 HG00409 CHS EAS male NaN NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 NA19087 JPT EAS female NaN NaN 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 NA19088 JPT EAS male NaN NaN 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 NA19089 JPT EAS male NaN NaN 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 NA19090 JPT EAS female NaN NaN 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752 NA19091 JPT EAS male NaN NaN

    500 rows \u00d7 20 columns

    In\u00a0[8]: Copied!
    plt.figure(figsize=(10,10))\nsns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50)\n
    plt.figure(figsize=(10,10)) sns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50) Out[8]:
    <Axes: xlabel='PC1_AVG', ylabel='PC2_AVG'>
    "},{"location":"plot_PCA/#plotting-pca","title":"Plotting PCA\u00b6","text":""},{"location":"plot_PCA/#loading-files","title":"loading files\u00b6","text":""},{"location":"plot_PCA/#merge-pca-and-population-information","title":"Merge PCA and population information\u00b6","text":""},{"location":"plot_PCA/#plotting","title":"Plotting\u00b6","text":""},{"location":"prs_tutorial/","title":"PRS Tutorial","text":"In\u00a0[1]: Copied!
    import sys\nsys.path.insert(0,\"/Users/he/work/PRSlink/src\")\nimport prslink as pl\n
    import sys sys.path.insert(0,\"/Users/he/work/PRSlink/src\") import prslink as pl In\u00a0[2]: Copied!
    a= pl.PRS()\n
    a= pl.PRS() In\u00a0[3]: Copied!
    a.add_score(\"./1kgeas.0.1.profile\",  \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.2.profile\",  \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.3.profile\",  \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.4.profile\",  \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.5.profile\",  \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")\n
    a.add_score(\"./1kgeas.0.1.profile\", \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.2.profile\", \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.3.profile\", \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.4.profile\", \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.5.profile\", \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")
    - Dataset shape before loading : (0, 1)\n- Loading score data from file: ./1kgeas.0.1.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.1\n  - Overlapping IDs:0\n- Loading finished successfully!\n- Dataset shape after loading : (504, 2)\n- Dataset shape before loading : (504, 2)\n- Loading score data from file: ./1kgeas.0.05.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.05\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 3)\n- Dataset shape before loading : (504, 3)\n- Loading score data from file: ./1kgeas.0.2.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.2\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 4)\n- Dataset shape before loading : (504, 4)\n- Loading score data from file: ./1kgeas.0.3.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.3\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 5)\n- Dataset shape before loading : (504, 5)\n- Loading score data from file: ./1kgeas.0.4.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.4\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 6)\n- Dataset shape before loading : (504, 6)\n- Loading score data from file: ./1kgeas.0.5.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.5\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 7)\n- Dataset shape before loading : (504, 7)\n- Loading score data from file: ./1kgeas.0.001.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.01\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 8)\n
    In\u00a0[4]: Copied!
    a.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")\n
    a.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")
    - Dataset shape before loading : (504, 8)\n- Loading pheno data from file: ../01_Dataset/t2d/1kgeas_t2d.txt\n  - Setting ID:IID\n  - Loading pheno:T2D\n  - Loaded columns: T2D\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 9)\n
    In\u00a0[5]: Copied!
    a.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")\n
    a.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")
    - Dataset shape before loading : (504, 9)\n- Loading covar data from file: ./1kgeas.eigenvec\n  - Setting ID:IID\n  - Loading covar:PC1 PC2 PC3 PC4 PC5\n  - Loaded columns: PC1 PC2 PC3 PC4 PC5\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 14)\n
    In\u00a0[6]: Copied!
    a.data[\"T2D\"] = a.data[\"T2D\"]-1\n
    a.data[\"T2D\"] = a.data[\"T2D\"]-1 In\u00a0[7]: Copied!
    a.data\n
    a.data Out[7]: IID 0.1 0.05 0.2 0.3 0.4 0.5 0.01 T2D PC1 PC2 PC3 PC4 PC5 0 HG00403 -0.000061 -2.812450e-05 -0.000019 -2.131690e-05 -0.000024 -0.000022 0.000073 0 0.000107 0.039080 0.021048 0.016633 0.063373 1 HG00404 0.000025 4.460810e-07 0.000041 4.370760e-05 0.000024 0.000018 0.000156 1 -0.001216 0.045148 0.009013 0.028122 0.041474 2 HG00406 0.000011 2.369040e-05 -0.000009 2.928090e-07 -0.000010 -0.000008 -0.000188 0 0.005020 0.044668 0.016583 0.020077 -0.031782 3 HG00407 -0.000133 -1.326670e-04 -0.000069 -5.677710e-05 -0.000062 -0.000057 -0.000744 1 0.005408 0.034132 0.014955 0.003872 0.009794 4 HG00409 0.000010 -3.120730e-07 -0.000012 -1.873660e-05 -0.000025 -0.000023 -0.000367 1 -0.002121 0.031752 -0.048352 -0.043185 0.064674 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 499 NA19087 -0.000042 -6.215880e-05 -0.000038 -1.116230e-05 -0.000019 -0.000018 -0.000397 0 -0.067583 -0.040340 0.015038 0.039039 -0.010774 500 NA19088 0.000085 9.058670e-05 0.000047 2.666260e-05 0.000016 0.000014 0.000723 0 -0.069752 -0.047710 0.028578 0.036714 -0.000906 501 NA19089 -0.000067 -4.767610e-05 -0.000011 -1.393760e-05 -0.000019 -0.000016 -0.000126 0 -0.073989 -0.046706 0.040089 -0.034719 -0.062692 502 NA19090 0.000064 3.989030e-05 0.000022 7.445850e-06 0.000010 0.000003 -0.000149 0 -0.061156 -0.034606 0.032674 -0.016363 -0.065390 503 NA19091 0.000051 4.469220e-05 0.000043 3.089720e-05 0.000019 0.000016 0.000028 0 -0.067749 -0.052950 0.036908 -0.023856 -0.058515

    504 rows \u00d7 14 columns

    In\u00a0[13]: Copied!
    a.set_k({\"T2D\":0.2})\n
    a.set_k({\"T2D\":0.2}) In\u00a0[14]: Copied!
    a.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)\n
    a.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)
     - Binary trait: fitting logistic regression...\n - Binary trait: using records with phenotype being 0 or 1...\nOptimization terminated successfully.\n         Current function value: 0.668348\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.653338\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.657903\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.654492\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.654413\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.653085\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.654681\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.661290\n         Iterations 5\n
    Out[14]: PHENO TYPE PRS N_CASE N BETA CI_L CI_U P R2_null R2_full Delta_R2 AUC_null AUC_full Delta_AUC R2_lia_null R2_lia_full Delta_R2_lia SE 0 T2D B 0.01 200 502 0.250643 0.064512 0.436773 0.008308 0.010809 0.029616 0.018808 0.536921 0.586821 0.049901 0.010729 0.029826 0.019096 NaN 1 T2D B 0.05 200 502 0.310895 0.119814 0.501976 0.001428 0.010809 0.038545 0.027736 0.536921 0.601987 0.065066 0.010729 0.038925 0.028196 NaN 2 T2D B 0.5 200 502 0.367803 0.169184 0.566421 0.000284 0.010809 0.046985 0.036176 0.536921 0.605397 0.068477 0.010729 0.047553 0.036824 NaN 3 T2D B 0.2 200 502 0.365641 0.169678 0.561604 0.000255 0.010809 0.047479 0.036670 0.536921 0.607318 0.070397 0.010729 0.048079 0.037349 NaN 4 T2D B 0.3 200 502 0.367788 0.171062 0.564515 0.000248 0.010809 0.047686 0.036877 0.536921 0.608493 0.071573 0.010729 0.048315 0.037585 NaN 5 T2D B 0.1 200 502 0.374750 0.181520 0.567979 0.000144 0.010809 0.050488 0.039679 0.536921 0.613957 0.077036 0.010729 0.051270 0.040540 NaN 6 T2D B 0.4 200 502 0.389232 0.189866 0.588597 0.000130 0.010809 0.051145 0.040336 0.536921 0.609238 0.072318 0.010729 0.051845 0.041116 NaN In\u00a0[15]: Copied!
    a.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)\n
    a.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)
    Optimization terminated successfully.\n         Current function value: 0.668348\n         Iterations 5\n
    In\u00a0[16]: Copied!
    a.plot_prs(a.score_cols)\n
    a.plot_prs(a.score_cols) In\u00a0[\u00a0]: Copied!
    \n
    "}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"GWASTutorial","text":"

    Note: this tutorial is being updated to Version 2024

    This Github page aims to provide a hands-on tutorial on common analysis in Complex Trait Genomics. This tutorial is designed for the course Fundamental Exercise II provided by The Laboratory of Complex Trait Genomics at the University of Tokyo. For more information, please see About.

    This tutorial covers the minimum skills and knowledge required to perform a typical genome-wide association study (GWAS). The contents are categorized into the following groups. Additionally, for absolute beginners, we also prepared a section on command lines in Linux.

    If you have any questions or suggestions, please feel free to let us know in the Issue section of this repository.

    "},{"location":"#contents","title":"Contents","text":""},{"location":"#command-lines","title":"Command lines","text":""},{"location":"#pre-gwas","title":"Pre-GWAS","text":""},{"location":"#gwas","title":"GWAS","text":""},{"location":"#post-gwas","title":"Post-GWAS","text":"

    In these sections, we will briefly introduce the Post-GWAS analyses, which will dig deeper into the GWAS summary statistics. \u00a0

    "},{"location":"#topics","title":"Topics","text":"

    Introductions on GWAS-related issues

    "},{"location":"#others","title":"Others","text":""},{"location":"01_Dataset/","title":"Sample Dataset","text":"

    504 EAS individuals from 1000 Genomes Project Phase 3 version 5

    Url: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

    Genome build: human_g1k_v37.fasta (hg19)

    "},{"location":"01_Dataset/#genotype-data-processing","title":"Genotype Data Processing","text":""},{"location":"01_Dataset/#download","title":"Download","text":"

    Note

    The sample dataset 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip has been included in 01_Dataset when you clone the repository. There is no need to download it again if you clone this repository.

    You can also simply run download_sampledata.sh in 01_Dataset and the dataset will be downloaded and decompressed.

    ./download_sampledata.sh\n

    Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.

    or you can manually download it from this link.

    Unzip the dataset unzip -j 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip, and you will get the following files:

    1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n
    "},{"location":"01_Dataset/#phenotype-simulation","title":"Phenotype Simulation","text":"

    Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.

    gcta  \\\n  --bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015 \\\n  --simu-cc 250 254  \\\n  --simu-causal-loci causal.snplist  \\\n  --simu-hsq 0.8  \\\n  --simu-k 0.5  \\\n  --simu-rep 1  \\\n  --out 1kgeas_binary\n
    $ cat causal.snplist\n2:55620927:G:A 3\n8:97094292:C:T 3\n20:42758834:T:C 3\n7:134326056:G:T 3\n1:167562605:G:A 3\n

    Warning

    This simulation is just used for showing the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result itself is meaningless.

    Allele frequency and Effect size

    "},{"location":"01_Dataset/#reference","title":"Reference","text":""},{"location":"02_Linux_basics/","title":"Introduction","text":"

    This section is intended to provide a minimum introduction of the command line in Linux system for handling genomic data. (If you are alreay familiar with Linux commands, it is completely ok to skip this section.)

    If you are a beginner with no background in programming, it would be helpful if you could learn some basic commands first before any analysis. In this section, we will introduce the most basic commands which enable you to handle genomic files in the terminal using command lines in a linux system.

    For Mac users

    This tutorial will probably work with no problems. Just simply open your terminal and follow the tutorial. (Note: A few commands might be different on MacOS.)

    For Windows users

    You can simply insall WSL to get a linux environment. Please check here for how to install WSL.

    "},{"location":"02_Linux_basics/#table-of-contents","title":"Table of Contents","text":""},{"location":"02_Linux_basics/#linux-system-introduction","title":"Linux System Introduction","text":""},{"location":"02_Linux_basics/#what-is-linux","title":"What is Linux?","text":"Term Description Linux refers to a family of open-source Unix-like operating systems based on the Linux kernel. Linux kernel a free and open-source Unix-like operating system kernel, which controls the software and hardware of the computer. Linux distributions refer to\u00a0operating systems\u00a0made from a software collection that is based upon the\u00a0Linux kernel.

    Main functions of the Linux kernel

    Some of the most common linux distributions

    Linux and Linus

    Linux is named after Linus Benedict Torvalds, who is a legendary Finnish software engineer who lead the development of the Linux kernel. He also developped the amazing version control software - Git.

    Reference: https://en.wikipedia.org/wiki/Linux

    "},{"location":"02_Linux_basics/#how-do-we-interact-with-computers","title":"How do we interact with computers?","text":"

    GUI and CUI

    Shell

    "},{"location":"02_Linux_basics/#a-general-comparison-between-cui-and-gui","title":"A general comparison between CUI and GUI","text":"GUI CUI Interaction Graphics Command line Precision LOW HIGH Speed LOW HIGH Memory required HIGH LOW Ease of operation Easier DIFFICULT Flexibility MORE flexible LESS flexible

    Tip

    The reason why we want to use CUI for large-scale data analysis is that CUI is better in term of precision, memory usage and processing speed.

    "},{"location":"02_Linux_basics/#overview-of-the-basic-commands-in-linux","title":"Overview of the basic commands in Linux","text":"

    Unlike clicking and dragging files in Windows or MacOS, in Linux, we usually handle files by typing commands in the terminal.

    Here is a list of the basic commands we are going to cover in this brief tutorial:

    Basic Linux commands

    Function group Commands Description Directories pwd, ls, mkdir, rmdir Commands for checking, creating and removing directories Files touch,cp,mv,rm Commands for creating, copying, moving and removing files Checking files cat,zcat,head,tail,less,more,wc Commands for inspecting files Archiving and compression tar,gzip,gunzip,zip,unzip Commands for Archiving and Compressing files Manipulating text sort,uniq,cut,join,tr Commands for manipulating text files Modifying permission chmod,chown, chgrp Commands for changing the permissions of files and directories Links ln Commands for creating symbolic and hard links Pipe, redirect and others pipe, >,>>,*,.,.. A group of miscellaneous commands Advance text editing awk, sed Commands for more complicated text manipulation and editing"},{"location":"02_Linux_basics/#how-to-check-the-usage-of-a-command-using-man","title":"How to check the usage of a command using man:","text":"

    The first command we might want to learn is man, which shows the manual for a certain command. When you forget how to use a command, you can always use man to check.

    man : Check the manual of a command (e.g., man chmod) or --help option (e.g., chmod --help)

    For example, we want to check the usage of pwd:

    Use man to get the manual for commands

    $ man pwd\n
    Then you will see the manual of pwd in your terminal.
    PWD(1)                                              User     Commands                                              PWD(1)\n\nNAME\n       pwd - print name of current/working directory\n\nSYNOPSIS\n       pwd [OPTION]...\n\nDESCRIPTION\n       Print the full filename of the current working directory.\n....\n

    Explain shell

    Or you can use this wonderful website to get explanations for your commands.

    URL : https://explainshell.com/

    "},{"location":"02_Linux_basics/#commands","title":"Commands","text":""},{"location":"02_Linux_basics/#directories","title":"Directories","text":"

    The first set of commands are: pwd , cd , ls, mkdir and rmdir, which are related to directories (like the folders in a Windows system).

    "},{"location":"02_Linux_basics/#pwd","title":"pwd","text":"

    pwd : Print working directory, which means printing the path of the current directory (working directory)

    Use pwd to print the current directory you are in

    $ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n

    This command prints the absolute path.

    An example of Linux file system and file paths

    Type Description Example Absolute path path starting from root (the orange path) /home/User3/GWASTutorial/02_Linux_basics/README.md Relative path path starting from the current directory (the blue path) ./GWASTutorial/02_Linux_basics/README.md

    Tip: use readlink to obtain the absolute path of a file

    To get the absolute path of a file, you can use readlink -f [filename].

    $ readlink -f README.md \n/home/he/work/GWASTutorial/02_Linux_basics/README.md\n
    "},{"location":"02_Linux_basics/#cd","title":"cd","text":"

    cd: Change the current working directory.

    Use cd to change directory to 02_Linux_basics and then print the current directory

    $ cd 02_Linux_basics\n$ pwd\n/home/he/work/GWASTutorial/02_Linux_basics\n
    "},{"location":"02_Linux_basics/#ls","title":"ls","text":"

    ls : List the contents in the working directory

    Some frequently used options for ls :

    Simply list the files and directories in the current directory

    $ ls\nREADME.md  sumstats.txt\n

    List the files and directories with options -lha

    $ ls -lha\ndrwxr-xr-x   4 he  staff   128B Dec 23 14:07 .\ndrwxr-xr-x  17 he  staff   544B Dec 23 12:13 ..\n-rw-r--r--   1 he  staff     0B Oct 17 11:24 README.md\n-rw-r--r--   1 he  staff    31M Dec 23 14:07 sumstats.txt\n

    Tip: use tree to visualize the structure of a directory

    You can use tree command to visualize the structure of a directory.

    $ tree ./02_Linux_basics/\n./02_Linux_basics/\n\u251c\u2500\u2500 README.md\n\u2514\u2500\u2500 sumstats.txt\n\n0 directories, 2 files\n
    "},{"location":"02_Linux_basics/#mkdir-rmdir","title":"mkdir & rmdir","text":"

    Make a directory and delete it

    $ mkdir new_directory\n$ ls\nnew_directory  README.md  sumstats.txt\n$ rmdir new_directory/\n$ ls\nREADME.md  sumstats.txt\n
    "},{"location":"02_Linux_basics/#manipulating-files","title":"Manipulating files","text":"

    This set of commands includes: touch, mv , rm and cp

    "},{"location":"02_Linux_basics/#touch","title":"touch","text":"

    touch command is used to create a new empty file.

    Create an empty text file called newfile.txt in this directory

    $ ls -l\ntotal 64048\n-rw-r--r--  1 he  staff         0 Oct 17 11:24 README.md\n-rw-r--r--  1 he  staff  32790417 Dec 23 14:07 sumstats.txt\n\ntouch newfile.txt\n\n$ touch newfile.txt\n$ ls -l\ntotal 64048\n-rw-r--r--  1 he  staff         0 Oct 17 11:24 README.md\n-rw-r--r--  1 he  staff         0 Dec 23 14:14 newfile.txt\n-rw-r--r--  1 he  staff  32790417 Dec 23 14:07 sumstats.txt\n
    "},{"location":"02_Linux_basics/#mv","title":"mv","text":"

    mv has two functions:

    The following command will create a new directoru called new_directory, and move sumstats.txt into that directory. Just like draggig a file in to a folder in window system.

    Move a file to a different directory

    # make a new directory\n$ mkdir new_directory\n\n#move sumstats to the new directory\n$ mv sumstats.txt new_directory/\n\n# list the item in new_directory\n$ ls new_directory/\nsumstats.txt\n

    Now, let's move it back to the current directory and rename it to sumstats_new.txt.

    Rename a file using mv

    $ mv ./new_directory/sumstats.txt ./\n
    Note: ./ means the current directory You can also use mv to rename a file:
    #rename\n$mv sumstats.txt sumstats_new.txt \n

    "},{"location":"02_Linux_basics/#rm","title":"rm","text":"

    rm : Remove files or diretories

    Remove a file and a directory

    # remove a file\n$rm file\n\n#remove files in a directory (recursive mode)\n$rm -r directory/\n

    There is no trash can in Linux command-line interface

    If you delete a file with rm , it will be very difficult to restore it. Please be careful wehn using rm.

    "},{"location":"02_Linux_basics/#cp","title":"cp","text":"

    cp command is used to copy files or diretories.

    Copy a file and a directory

    #cp files\n$cp file1 file2\n\n# copy directory\n$cp -r directory1/ directory2/\n
    "},{"location":"02_Linux_basics/#links","title":"Links","text":"

    Symbolic link is like a shortcut on window system, which is a special type of file that points to another file.

    It is very useful when you want to organize your tool box or working space.

    You can use ln -s pathA pathB to create such a link.

    Create a symbolic link for plink

    Let`s create a symbolic link for plink first.

    # /home/he/tools/plink/plink is the orinial file\n# /home/he/tools/bin is the path for the symbolic link \nln -s /home/he/tools/plink/plink /home/he/tools/bin\n

    And then check the link.

    cd /home/he/tools/bin\nls -lha\nlrwxr-xr-x  1 he  staff    27B Aug 30 11:30 plink -> /home/he/tools/plink/plink\n
    "},{"location":"02_Linux_basics/#archiving-and-compression","title":"Archiving and Compression","text":"

    Results for millions of variants are usually very large, sometimes >10GB, or consists of multiple files.

    To save space and make it easier to transfer, we need to archive and compress these files.

    Archiving and Compression

    Commoly used commands for archiving and compression:

    Extensions Create Extract Functions file.gz gzip gunzip compress files.tar tar -cvf tar -xvf archive files.tar.gz or files.tgz tar -czvf tar -xvzf archive and compress file.zip zip unzip archive and compress

    Compress and decompress a file using gzip and gunzip

    $ ls -lh\n-rw-r--r--  1 he  staff    31M Dec 23 14:07 sumstats.txt\n\n$ gzip sumstats.txt\n$ ls -lh\n-rw-r--r--  1 he  staff   9.9M Dec 23 14:07 sumstats.txt.gz\n\n$ gunzip sumstats.txt.gz\n$ ls -lh\n-rw-r--r--   1 he  staff    31M Dec 23 14:07 sumstats.txt\n
    "},{"location":"02_Linux_basics/#read-and-check-files","title":"Read and check files","text":"

    We have a group of handy commands to check part of or the entire file, including cat, zcat, less, head, tail, wc

    "},{"location":"02_Linux_basics/#cat","title":"cat","text":"

    cat command can print the contents of files or concatenate the files.

    Create and then cat the file a_text_file.txt

    $ ls -lha > a_text_file.txt\n$ cat a_text_file.txt \ntotal 32M\ndrwxr-x---  2 he staff 4.0K Apr  2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr  1 22:20 ..\n-rw-r-----  1 he staff    0 Apr  2 00:37 a_text_file.txt\n-rw-r-----  1 he staff 5.0K Apr  1 22:20 README.md\n-rw-r-----  1 he staff  32M Mar 30 18:17 sumstats.txt\n

    Warning

    Be careful not to cat a text file with a huge number of lines. You can try to cat sumstats.txt and see what happends.

    By the way, > a_text_file.txt here means redirect the output to file a_text_file.txt.

    "},{"location":"02_Linux_basics/#zcat","title":"zcat","text":"

    zcat is similar to cat, but can only applied to compressed files.

    cat and zcat a gzipped text file

    $ gzip a_text_file.txt \n$ cat a_text_file.txt.gz                                                         TGba_text_file.    txt\u044f\n@\u0231\u00bbO\ud8ac\udc19v\u0602\ud85e\udca9\u00bc\ud9c3\udce0bq}\udb06\udca4\\\ueee0\u00a4n\u0662\u00aa\uda40\udc2cn\u00bb\u06a1\u01ed\n                          w5J_\u00bd\ud88d\ude27P\u07c9=\u00ffK\n(\u05a3\u0530\u00a7\u04a4\u0176a\u0786                              \u00acM\u00adR\udbb5\udc8am\u00b3\u00fee\u00b8\u00a4\u00bc\u05cdSd\ufff1\u07f2\ub4e4\u00aa\u00adv\n       \u5a41                                                                                                               resize: unknown character, exiting.\n\n$ zcat a_text_file.txt.gz \ntotal 32M\ndrwxr-x---  2 he staff 4.0K Apr  2 00:37 .\ndrwxr-x--- 29 he staff 4.0K Apr  1 22:20 ..\n-rw-r-----  1 he staff    0 Apr  2 00:37 a_text_file.txt\n-rw-r-----  1 he staff 5.0K Apr  1 22:20 README.md\n-rw-r-----  1 he staff  32M Mar 30 18:17 sumstats.txt\n

    gzcat

    Use gzcat instead of zcat if your device is running MacOS.

    "},{"location":"02_Linux_basics/#head","title":"head","text":"

    head: Print the first 10 lines.

    -n: option to change the number of lines.

    Check the first 10 lines and only the first line of the file sumstats.txt

    $ head sumstats.txt \nCHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   319 17  2   1   1   ADD 10000   1.04326 0.0495816   0.854176    0.393008    .\n1   319 22  1   2   2   ADD 10000   1.03347 0.0493972   0.666451    0.505123    .\n1   418 23  1   2   2   ADD 10000   1.02668 0.0498185   0.528492    0.597158    .\n1   537 30  1   2   2   ADD 10000   1.01341 0.0498496   0.267238    0.789286    .\n1   546 31  2   1   1   ADD 10000   1.02051 0.0336786   0.60284 0.546615    .\n1   575 33  2   1   1   ADD 10000   1.09795 0.0818305   1.14199 0.25346 .\n1   752 44  2   1   1   ADD 10000   1.02038 0.0494069   0.408395    0.682984    .\n1   913 50  2   1   1   ADD 10000   1.07852 0.0493585   1.53144 0.12566 .\n1   1356    77  2   1   1   ADD 10000   0.947521    0.0339805   -1.5864 0.112649    .\n\n$ head -n 1 sumstats.txt \nCHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n
    "},{"location":"02_Linux_basics/#tail","title":"tail","text":"

    Similar to head, you can use tail ro check the last 10 lines. -n works in the same way.

    Check the last 10 lines of the file sumstats.txt

    $ tail sumstats.txt \n22  99996057    9959945 2   1   1   ADD 10000   1.03234 0.0335547   0.948413    0.342919.\n22  99996465    9959971 2   1   1   ADD 10000   1.04755 0.0337187   1.37769 0.1683  .\n22  99997041    9960013 2   1   1   ADD 10000   1.01942 0.0937548   0.205195    0.837419.\n22  99997608    9960051 2   1   1   ADD 10000   0.969928    0.0397711   -0.767722   0.    442652    .\n22  99997629    9960055 2   1   1   ADD 10000   0.986949    0.0395305   -0.332315   0.    739652    .\n22  99997742    9960061 2   1   1   ADD 10000   0.990829    0.0396614   -0.232298   0.    816307    .\n22  99998121    9960086 2   1   1   ADD 10000   1.04448 0.0335879   1.29555 0.19513 .\n22  99998455    9960106 2   1   1   ADD 10000   0.880953    0.152754    -0.829771   0.    406668    .\n22  99999208    9960146 2   1   1   ADD 10000   0.944604    0.065187    -0.874248   0.    381983    .\n22  99999382    9960164 2   1   1   ADD 10000   0.970509    0.033978    -0.881014   0.37831 .\n
    "},{"location":"02_Linux_basics/#wc","title":"wc","text":"

    wc: short for word count, which count the lines, words, and characters in a file.

    For example,

    Count the lines, words, and characters in sumstats.txt

    $ wc sumstats.txt \n  445933  5797129 32790417 sumstats.txt\n
    This means that sumstats.txt has 445933 lines, 5797129 words, and 32790417 characters.

    "},{"location":"02_Linux_basics/#edit-files","title":"Edit files","text":"

    Vim is a handy text editor for command line.

    Vim - text editor

    vim README.md\n

    Simple workflow using Vim

    1. vim file_to_edit.txt
    2. Press i to enter the INSERT mode.
    3. Edit the file.
    4. When finished, just press Esc key to escape the INSERT mode.
    5. Then enter :wq to quit and also save the file.

    Vim is a little bit hard to learn for beginners, but when you get familiar with it, it will be a mighty and convenient tool. For more detailed tutorials on Vim, you can check: https://github.com/iggredible/Learn-Vim

    Other common command line text editors

    "},{"location":"02_Linux_basics/#permission","title":"Permission","text":"

    The permissions of a file or directory are represented as a 10-character string (1+3+3+3) :

    For example, this represents a directory(the initial d) which is readable, writable and executable for the owner(the first 3: rwx), users in the same group(the 3 characters in the middle: rwx) and others (last 3 characters: rwx).

    drwxrwxrwx

    -> d (directory or file) rwx (permissions for owner) rwx (permissions for users in the same group) rwx (permissions for other users)

    Notation Description r readable w writable x executable d directory - file

    Command for checking the permissions of files in the current directory: ls -l

    Command for changing permissions: chmod, chown, chgrp

    Syntax:

    chmod [3-digit Binary notation] [path]\n

    Number notation Permission 3-digit Binary notation 7 rwx 111 6 rw- 110 5 r-x 101 4 r-- 100 3 -wx 011 2 -w- 010 1 --x 001 0 --- 000

    Change the permissions of the file README.md to 660

    # there is a readme file in the directory, and its permissions are -rw-r----- \n$ ls -lh\ntotal 4.0K\n-rw-r----- 1 he staff 2.1K Feb 24 01:16 README.md\n\n# let's change the permissions to 660, which is a numeric notation of -rw-rw---- based on the     table above\n$ chmod 660 README.md \n\n# chack again, and it was changed.\n$ ls -lh\ntotal 4.0K\n-rw-rw---- 1 he staff 2.1K Feb 24 01:16 README.md\n

    Note

    These commands are very important because we use genome data, which could raise severe ethical and privacy issues if there is data leak.

    Warning

    Please always be cautious when handling human genomic data.

    "},{"location":"02_Linux_basics/#others","title":"Others","text":"

    There are a group of very handy and flexible commands which will greatly improve your efficiency. These include | , >, >>,*,.,..,~,and -.

    "},{"location":"02_Linux_basics/#pipe","title":"| (pipe)","text":"

    Pipe basically is used to pass the output of the previous command to the next command as input, instead of printing is in terminal. Using pipe you can do very complicated manipulations of the files.

    An example of Pipe

    cat sumstats.txt | sort | uniq | wc\n
    This means (1) print sumstats, (2) sort the output, (3) then keep the unique lines and finally (4) count the lines and words.

    "},{"location":"02_Linux_basics/#_1","title":">","text":"

    > redirects output to a new file (if the file already exist, it will be overwritten)

    Redirects the output of cat sumstats.txt | sort | uniq | wc to count.txt

    cat sumstats.txt | sort | uniq | wc > count.txt\n
    "},{"location":"02_Linux_basics/#_2","title":">>","text":"

    >> redirects output to a file by appending to the end of the file (if the file already exist, it will not be overwritten)

    Redirects the output of cat sumstats.txt | sort | uniq | wc to count.txt by appending

    cat sumstats.txt | sort | uniq | wc >> count.txt\n

    Other useful commands include :

    Command Description Example Code Example code meaning * represent zero or more characters - - ? represent a single character - - . the current directory - - .. the parent directory of the current directory. cd .. change to the parent directory of the current directory ~ the home directory cd ~ change to the curent user's home directory - the last directory you are working in. cd - change to the last directory you are working in.

    Wildcards

    The asterisk * and the question mark ? are called wildcard characters or wildcards in Linux, which are special symbols that can represent other normal characters. Wildcards are especially useful when handling multiple files with similar pattern in their names.

    Warning

    Be extremely careful when you use rm and *. It is disastrous when you mistakenly type rm *

    "},{"location":"02_Linux_basics/#bash-scripts","title":"Bash scripts","text":"

    If you have a lot of commands to run, or if you want to automate some complex manipulations, bash scripts are a good way to address this issue.

    We can use vim to create a bash script called hello.sh

    A simple example of bash scripts:

    Example

    hello.sh
    #!/bin/bash\necho \"Hello, world1\"\necho \"Hello, world2\"\n

    #! is called shebang, which tells the system which interpreter to use to execute the shell script.

    Then use chmod to give it permission to execute.

    chmod +x hello.sh \n

    Now we can run the srcipt by ./hello.sh:

    ./hello.sh\n\"Hello, world1\" \n\"Hello, world2\" \n
    "},{"location":"02_Linux_basics/#advanced-text-editing","title":"Advanced text editing","text":"

    (optional: awk, sed, cut, sort, join, uniq)

    Advanced commands:

    "},{"location":"02_Linux_basics/#git-and-github","title":"Git and Github","text":"

    Git is a powerful version control software and github is a platform where you can share your codes.

    Currently you just need to learn git clone, which simply downloads an existing repository.

    git clone https://github.com/Cloufield/GWASTutorial.git

    You can also check here for more information.

    Quote

    "},{"location":"02_Linux_basics/#download","title":"Download","text":"

    We can use wget [option] [url] command to download files to local machine.

    -O option specify the file name you want to change for the downloaded file.

    Use wget to download the hg19 reference genome from UCSC

    # Download hg19 reference genome from UCSC\nwget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n\n# Download hg19 reference genome from UCSC and rename it to  my_refgenome.fa.gz\nwget -O my_refgenome.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz\n
    "},{"location":"02_Linux_basics/#exercise","title":"Exercise","text":"

    The questions are generated by Microsoft Bing!

    What is the command to list all files and directories in your current working directory?

    What is the command to create a new directory named \u201ctest\u201d?

    What is the command to copy a file named \u201cdata.txt\u201d from your current working directory to another directory named \u201cbackup\u201d?

    What is the command to display the first 10 lines of a file named \u201cresults.csv\u201d?

    What is the command to count the number of lines, words, and characters in a file named \u201creport.txt\u201d?

    What is the command to search for a pattern in a file named \u201clog.txt\u201d and print only the matching lines?

    What is the command to sort the contents of a file named \u201cnames.txt\u201d in alphabetical order and save the output to a new file named \u201csorted_names.txt\u201d?

    What is the command to display the difference between two files named \u201cold_version.py\u201d and \u201cnew_version.py\u201d?

    What is the command to change the permissions of a file named \u201cscript.sh\u201d to make it executable by everyone?

    What is the command to run a program named \u201cprogram.exe\u201d in the background and redirect its output to a file named \u201coutput.log\u201d?

    "},{"location":"03_Data_formats/","title":"Data format","text":"

    This section lists some of the most commonly used formats in complex trait genomic analysis.

    "},{"location":"03_Data_formats/#table-of-contents","title":"Table of Contents","text":""},{"location":"03_Data_formats/#data-formats-for-general-purposes","title":"Data formats for general purposes","text":""},{"location":"03_Data_formats/#txt","title":"txt","text":"

    Simple text file

    .txt

    cat sample_text.txt \nLorem ipsum dolor sit amet, consectetur adipiscing elit. In ut sem congue, tristique tortor et, ullamcorper elit. Nulla elementum, erat ac fringilla mattis, nisi tellus euismod dui, interdum laoreet orci velit vel leo. Vestibulum neque mi, pharetra in tempor id, malesuada at ipsum. Duis tellus enim, suscipit sit amet vestibulum in, ultricies vitae erat. Proin consequat id quam sed sodales. Ut a magna non tellus dictum aliquet vitae nec mi. Suspendisse potenti. Vestibulum mauris sem, viverra ac metus sed, scelerisque ornare arcu. Vivamus consequat, libero vitae aliquet tempor, lorem leo mattis arcu, et viverra erat ligula sit amet tortor. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Praesent ut massa ac tortor lobortis placerat. Pellentesque aliquam tortor augue, at rutrum magna molestie et. Etiam congue nulla in venenatis congue. Nunc ac felis pharetra, cursus leo et, finibus eros.\n
    Random texts are generated using - https://www.lipsum.com/

    "},{"location":"03_Data_formats/#tsv","title":"tsv","text":"

    Tab-separated values Tabular data format

    .tsv

    head sample_data.tsv\n#CHROM  POS ID  REF ALT A1  FIRTH?  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   N   ADD 503 0.750168    0.280794    -1.02373    0.305961    .\n1   14599   1:14599:T:A T   A   A   N   ADD 503 1.80972 0.231595    2.56124 0.0104299   .\n1   14604   1:14604:A:G A   G   G   N   ADD 503 1.80972 0.231595    2.56124 0.0104299   .\n1   14930   1:14930:A:G A   G   G   N   ADD 503 1.70139 0.240245    2.21209 0.0269602   .\n1   69897   1:69897:T:C T   C   T   N   ADD 503 1.58002 0.194774    2.34855 0.0188466   .\n1   86331   1:86331:A:G A   G   G   N   ADD 503 1.47006 0.236102    1.63193 0.102694    .\n1   91581   1:91581:G:A G   A   A   N   ADD 503 0.924422    0.122991    -0.638963   0.522847    .\n1   122872  1:122872:T:G    T   G   G   N   ADD 503 1.07113 0.180776    0.380121    0.703856    .\n1   135163  1:135163:C:T    C   T   T   N   ADD 503 0.711822    0.23908 -1.42182    0.155079    .\n
    "},{"location":"03_Data_formats/#csv","title":"csv","text":"

    Comma-separated values Tabular data format

    .csv

    head sample_data.csv \n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
    "},{"location":"03_Data_formats/#data-formats-in-bioinformatics","title":"Data formats in bioinformatics","text":"

    A typical workflow for generating genotype data for genome-wide association analysis.

    "},{"location":"03_Data_formats/#sequence","title":"Sequence","text":""},{"location":"03_Data_formats/#fasta","title":"fasta","text":"

    text-based format for representing either nucleotide sequences or amino acid (protein) sequences

    .fa or .fasta

    >SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n
    "},{"location":"03_Data_formats/#fastq","title":"fastq","text":"

    text-based format for storing both a nucleotide sequence and its corresponding quality scores

    .fastq

    @SEQ_ID\nGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT\n+\n!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65\n
    Reference: https://en.wikipedia.org/wiki/FASTQ_format

    "},{"location":"03_Data_formats/#alingment","title":"Alingment","text":""},{"location":"03_Data_formats/#sambam","title":"SAM/BAM","text":"

    Sequence Alignment/Map Format is a TAB-delimited text file format consisting of a header section and an alignment section.

    .sam

    @HD VN:1.6 SO:coordinate\n@SQ SN:ref LN:45\nr001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *\nr002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *\nr003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;\nr004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *\nr003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;\nr001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1\n
    Reference : https://samtools.github.io/hts-specs/SAMv1.pdf

    "},{"location":"03_Data_formats/#variant-and-genotype","title":"Variant and genotype","text":""},{"location":"03_Data_formats/#vcf-vcfgz-vcfgztbi","title":"vcf / vcf.gz / vcf.gz.tbi","text":"

    VCF is a text file format consisting of meta-information lines, a header line, and then data lines. Each data line contains information about a variant in the genome (and the genotype information on samples for each variant).

    .vcf

    ##fileformat=VCFv4.2\n##fileDate=20090805\n##source=myImputationProgramV3.1\n##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta\n##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>\n##phasing=partial\n##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">\n##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">\n##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">\n##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">\n##INFO=<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">\n##INFO=<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">\n##FILTER=<ID=q10,Description=\"Quality below 10\">\n##FILTER=<ID=s50,Description=\"Less than 50% of samples have data\">\n##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">\n##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">\n##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">\n#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003\n20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.\n20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3\n20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4\n20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2\n20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3\n
    Reference : https://samtools.github.io/hts-specs/VCFv4.2.pdf

    "},{"location":"03_Data_formats/#plink-format","title":"PLINK format","text":"

    The figure shows how genotypes are stored in files.

    We have 3 parts of information:

    1. Individual information
    2. Variant information
    3. Genotype matrix

    And there are different ways (format sets) to represent this information in PLINK1.9 and PLINK2:

    1. ped / map
    2. fam / bim / bed
    3. psam / pvar / pgen

    "},{"location":"03_Data_formats/#ped-map","title":"ped / map","text":"

    .ped (PLINK/MERLIN/Haploview text pedigree + genotype table)

    Original standard text format for sample pedigree information and genotype calls.Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.

    .ped

    # check the first 16 rows and 16 columns of the ped file\ncut -d \" \" -f 1-16 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped | head\n0 HG00403 0 0 0 -9 G G T T A A G A C C\n0 HG00404 0 0 0 -9 G G T T A A G A T C\n0 HG00406 0 0 0 -9 G G T T A A G A T C\n0 HG00407 0 0 0 -9 G G T T A A A A C C\n0 HG00409 0 0 0 -9 G G T T A A G A C C\n0 HG00410 0 0 0 -9 G G T T A A G A C C\n0 HG00419 0 0 0 -9 G G T T A A A A T C\n0 HG00421 0 0 0 -9 G G T T A A G A C C\n0 HG00422 0 0 0 -9 G G T T A A G A C C\n0 HG00428 0 0 0 -9 G G T T A A G A C C\n0 HG00436 0 0 0 -9 G G A T G A A A C C\n0 HG00437 0 0 0 -9 C G T T A A G A C C\n0 HG00442 0 0 0 -9 G G T T A A G A C C\n0 HG00443 0 0 0 -9 G G T T A A G A C C\n0 HG00445 0 0 0 -9 G G T T A A G A C C\n0 HG00446 0 0 0 -9 C G T T A A G A T C\n

    .map (PLINK text fileset variant information file)

    Variant information file accompanying a .ped text pedigree + genotype table. A text file with no header line, and one line per variant with the following 3-4 fields:

    .map

    head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n1       1:13273:G:C     0       13273\n1       1:14599:T:A     0       14599\n1       1:14604:A:G     0       14604\n1       1:14930:A:G     0       14930\n1       1:69897:T:C     0       69897\n1       1:86331:A:G     0       86331\n1       1:91581:G:A     0       91581\n1       1:122872:T:G    0       122872\n1       1:135163:C:T    0       135163\n1       1:233473:C:G    0       233473\n

    Reference: https://www.cog-genomics.org/plink/1.9/formats

    "},{"location":"03_Data_formats/#bed-fam-bim","title":"bed / fam /bim","text":"

    bed/fam/bim formats are the binary implementation of ped/map formats. bed/bim/fam files contain the same information as ped/map but are much smaller in size.

    -rw-r----- 1 yunye yunye 135M Dec 23 11:45 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed\n-rw-r----- 1 yunye yunye  36M Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n-rw-r----- 1 yunye yunye 9.4K Dec 23 11:46 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n-rw-r--r-- 1 yunye yunye  32M Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.map\n-rw-r--r-- 1 yunye yunye 2.2G Dec 27 17:51 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.ped\n

    .fam

    head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.fam\n0 HG00403 0 0 0 -9\n0 HG00404 0 0 0 -9\n0 HG00406 0 0 0 -9\n0 HG00407 0 0 0 -9\n0 HG00409 0 0 0 -9\n0 HG00410 0 0 0 -9\n0 HG00419 0 0 0 -9\n0 HG00421 0 0 0 -9\n0 HG00422 0 0 0 -9\n0 HG00428 0 0 0 -9\n

    .bim

    head 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bim\n1       1:13273:G:C     0       13273   C       G\n1       1:14599:T:A     0       14599   A       T\n1       1:14604:A:G     0       14604   G       A\n1       1:14930:A:G     0       14930   G       A\n1       1:69897:T:C     0       69897   C       T\n1       1:86331:A:G     0       86331   G       A\n1       1:91581:G:A     0       91581   A       G\n1       1:122872:T:G    0       122872  G       T\n1       1:135163:C:T    0       135163  T       C\n1       1:233473:C:G    0       233473  G       C\n

    .bed

    \"Primary representation of genotype calls at biallelic variants The first three bytes should be 0x6c, 0x1b, and 0x01 in that order. The rest of the file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.\"

    hexdump -C 1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020.bed | head\n00000000  6c 1b 01 ff ff bf bf ff  ff ff ef fb ff ff ff fe  |l...............|\n00000010  ff ff ff ff fb ff bb ff  ff fb af ff ff fe fb ff  |................|\n00000020  ff ff ff fe ff ff ff ff  ff bf ff ff ef ff ff ef  |................|\n00000030  bb ff ff ff ff ff ff ff  fa ff ff ff ff ff ff ff  |................|\n00000040  ff ff ff fb ff ff ff ff  ff ff ff ff ff ff ff ef  |................|\n00000050  ff ff ff fb fe ef fe ff  ff ff ff eb ff ff fe fe  |................|\n00000060  ff ff fe ff bf ff fa fb  fb eb be ff ff 3b ff be  |.............;..|\n00000070  fe be bf ef fe ff ef ee  ff ff bf ea fe bf fe ff  |................|\n00000080  bf ff ff ef ff ff ff ff  ff fa ff ff eb ff ff ff  |................|\n00000090  ff ff fb fe af ff bf ff  ff ff ff ff ff ff ff ff  |................|\n

    Reference: https://www.cog-genomics.org/plink/1.9/formats

    "},{"location":"03_Data_formats/#imputation-dosage","title":"Imputation dosage","text":""},{"location":"03_Data_formats/#bgen-bgi","title":"bgen / bgi","text":"

    Reference: https://www.well.ox.ac.uk/~gav/bgen_format/

    "},{"location":"03_Data_formats/#pgenpsampvar","title":"pgen,psam,pvar","text":"

    Reference: https://www.cog-genomics.org/plink/2.0/formats#pgen

    NOTE: pgen only saved the dosage for each individual (a scalar ranged from 0 to 2). It could not been converted back to the genotype probability (a vector of length 3) or allele probability (a matrix of dimension 2 x 2) saved in bgen.

    "},{"location":"03_Data_formats/#summary","title":"Summary","text":""},{"location":"04_Data_QC/","title":"PLINK basics","text":"

    In this module, we will learn the basics of genotype data QC using PLINK, which is one of the most commonly used software in complex trait genomics. (Huge thanks to the developers: PLINK1.9 and PLINK2)

    "},{"location":"04_Data_QC/#table-of-contents","title":"Table of Contents","text":""},{"location":"04_Data_QC/#preparation","title":"Preparation","text":""},{"location":"04_Data_QC/#plink-192-installation","title":"PLINK 1.9&2 installation","text":"

    To get prepared for genotype QC, we will need to make directories, download software and add the software to your environment path.

    First, we will simply create some directories to keep the tools we need to use.

    Create directories

    cd ~\nmkdir tools\ncd tools\nmkdir bin\nmkdir plink\nmkdir plink2\n

    You can download each tool into its corresponding directories.

    The bin directory here is for keeping all the symbolic links to the executable files of each tool.

    In this way, it is much easier to manage and organize the paths and tools. We will only add the bin directory here to the environment path.

    "},{"location":"04_Data_QC/#download-plink19-and-plink2-and-then-unzip","title":"Download PLINK1.9 and PLINK2 and then unzip","text":"

    Next, go to the Plink webpage to download the software. We will need both PLINK1.9 and PLINK2.

    Download PLINK1.9 and PLINK2 from the following webpage to the corresponding directories:

    Info

    If you are using Mac or Windows, then please download the Mac or Windows version. In this tutorial, we will use a Linux system and the Linux version of PLINK.

    Find the suitable version on the PLINK website, right-click and copy the link address.

    Download PLINK2 (Linux AVX2 AMD)

    cd ~/tools/plink2\nwget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_amd_avx2_20231212.zip\nunzip plink2_linux_amd_avx2_20231212.zip\n

    Then do the same for PLINK1.9

    Download PLINK1.9 (Linux 64-bit)

    cd ~/tools/plink\nwget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip\nunzip plink_linux_x86_64_20231211.zip\n
    "},{"location":"04_Data_QC/#create-symbolic-links","title":"Create symbolic links","text":"

    After downloading and unzipping, we will create symbolic links for the plink binary files, and then move the link to ~/tools/bin/.

    Create symbolic links

    cd ~\nln -s ~/tools/plink2/plink2 ~/tools/bin/plink2\nln -s ~/tools/plink/plink ~/tools/bin/plink\n
    "},{"location":"04_Data_QC/#add-paths-to-the-environment-path","title":"Add paths to the environment path","text":"

    Then add ~/tools/bin/ to the environment path.

    Example

    export PATH=$PATH:~/tools/bin/\n
    This command will add the path to your current shell.

    If you restart the terminal, it will be lost. So you may need to add it to the Bash configuration file. Then run

    echo \"export PATH=$PATH:~/tools/bin/\" >> ~/.bashrc\n

    This will add a new line at the end of .bashrc, which will be run every time you open a new bash shell.

    All done. Let's test if we installed PLINK successfully or not.

    Check if PLINK is installed successfully.

    ./plink\nPLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\n\nplink <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink --help [flag name(s)...]\n\nCommands include --make-bed, --recode, --flip-scan, --merge-list,\n--write-snplist, --list-duplicate-vars, --freqx, --missing, --test-mishap,\n--hardy, --mendel, --ibc, --impute-sex, --indep-pairphase, --r2, --show-tags,\n--blocks, --distance, --genome, --homozyg, --make-rel, --make-grm-gz,\n--rel-cutoff, --cluster, --pca, --neighbour, --ibs-test, --regress-distance,\n--model, --bd, --gxe, --logistic, --dosage, --lasso, --test-missing,\n--make-perm-pheno, --tdt, --qfam, --annotate, --clump, --gene-report,\n--meta-analysis, --epistasis, --fast-epistasis, and --score.\n\n\"plink --help | more\" describes all functions (warning: long).\n
    ./plink2\nPLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023)       www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\n\nplink2 <input flag(s)...> [command flag(s)...] [other flag(s)...]\nplink2 --help [flag name(s)...]\n\nCommands include --rm-dup list, --make-bpgen, --export, --freq, --geno-counts,\n--sample-counts, --missing, --hardy, --het, --fst, --indep-pairwise, --ld,\n--sample-diff, --make-king, --king-cutoff, --pmerge, --pgen-diff,\n--write-samples, --write-snplist, --make-grm-list, --pca, --glm, --adjust-file,\n--gwas-ssf, --clump, --score, --variant-score, --genotyping-rate, --pgen-info,\n--validate, and --zst-decompress.\n\n\"plink2 --help | more\" describes all functions.\n

    Well done. We have successfully installed plink1.9 and plink2.

    "},{"location":"04_Data_QC/#download-genotype-data","title":"Download genotype data","text":"

    Next, we need to download the sample genotype data. The way to create the sample data is described [here].(https://cloufield.github.io/GWASTutorial/01_Dataset/) This dataset contains 504 EAS individuals from 1000 Genome Project Phase 3v5 with around 1 million variants.

    Simply run download_sampledata.sh in 01_Dataset to download this dataset (from Dropbox). See here

    Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.

    Download sample data

    cd ../01_Dataset\n./download_sampledata.sh\n

    And you will get the following three PLINK files:

    -rw-r--r-- 1 yunye yunye 149M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed\n-rw-r--r-- 1 yunye yunye  40M Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n-rw-r--r-- 1 yunye yunye  13K Dec 26 13:25 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\n

    Check the bim file:

    head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim\n1       1:14930:A:G     0       14930   G       A\n1       1:15774:G:A     0       15774   A       G\n1       1:15777:A:G     0       15777   G       A\n1       1:57292:C:T     0       57292   T       C\n1       1:77874:G:A     0       77874   A       G\n1       1:87360:C:T     0       87360   T       C\n1       1:92917:T:A     0       92917   A       T\n1       1:104186:T:C    0       104186  T       C\n1       1:125271:C:T    0       125271  C       T\n1       1:232449:G:A    0       232449  A       G\n

    Check the fam file:

    head 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam\nHG00403 HG00403 0 0 0 -9\nHG00404 HG00404 0 0 0 -9\nHG00406 HG00406 0 0 0 -9\nHG00407 HG00407 0 0 0 -9\nHG00409 HG00409 0 0 0 -9\nHG00410 HG00410 0 0 0 -9\nHG00419 HG00419 0 0 0 -9\nHG00421 HG00421 0 0 0 -9\nHG00422 HG00422 0 0 0 -9\nHG00428 HG00428 0 0 0 -9\n

    "},{"location":"04_Data_QC/#plink-tutorial","title":"PLINK tutorial","text":"

    Detailed descriptions can be found on plink's website: PLINK1.9 and PLINK2.

    The functions we will learn in this tutorial:

    1. Calculating missing rate (call rate)
    2. Calculating allele Frequency
    3. Conducting Hardy-Weinberg equilibrium exact test
    4. Applying filters
    5. Conducting LD-Pruning
    6. Calculating inbreeding F coefficient
    7. Conducting sample & SNP filtering (extract/exclude/keep/remove)
    8. Estimating IBD / PI_HAT
    9. Calculating LD
    10. Data management (make-bed/recode)

    All sample codes and results for this module are available in ./04_data_QC

    "},{"location":"04_Data_QC/#qc-step-summary","title":"QC Step Summary","text":"

    QC Step Summary

    QC step Option in PLINK Commonly used threshold to exclude Sample missing rate --geno, --missing missing rate > 0.01 (0.02, or 0.05) SNP missing rate --mind, --missing missing rate > 0.01 (0.02, or 0.05) Minor allele frequency --freq, --maf maf < 0.01 Sample Relatedness --genome pi_hat > 0.2 to exclude second-degree relatives Hardy-Weinberg equilibrium --hwe,--hardy hwe < 1e-6 Inbreeding F coefficient --het outside of 3 SD from the mean

    First, we can calculate some basic statistics of our simulated data:

    "},{"location":"04_Data_QC/#missing-rate-call-rate","title":"Missing rate (call rate)","text":"

    The first thing we want to know is the missing rate of our data. Usually, we need to check the missing rate of samples and SNPs to decide a threshold to exclude low-quality samples and SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#missing)

    Missing rate and Call rate

    Suppose we have N samples and M SNPs for each sample.

    For sample \\(j\\) :

    \\[Sample\\ Missing\\ Rate_{j} = {{N_{missing\\ SNPs\\ for\\ j}}\\over{M}} = 1 - Call\\ Rate_{sample, j}\\]

    For SNP \\(i\\) :

    \\[SNP\\ Missing\\ Rate_{i} = {{N_{missing\\ samples\\ at\\ i}}\\over{N}} = 1 - Call\\ Rate_{SNP, i}\\]

    The input is PLINK bed/bim/fam file. Usually, they have the same prefix, and we just need to pass the prefix to --bfile option.

    "},{"location":"04_Data_QC/#plink-syntax","title":"PLINK syntax","text":"

    PLINK syntax

    To calculate the missing rate, we need the flag --missing, which tells PLINK to calculate the missing rate in the dataset specified by --bfile.

    Calculate missing rate

    cd ../04_Data_QC\ngenotypeFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" #!!! Please add your own path here.  \"1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" is the prefix of PLINK bed file. \n\nplink \\\n    --bfile ${genotypeFile} \\\n    --missing \\\n    --out plink_results\n
    Remeber to set the value for ${genotypeFile}.

    This code will generate two files plink_results.imiss and plink_results.lmiss, which contain the missing rate information for samples and SNPs respectively.

    Take a look at the .imiss file. The last column shows the missing rate for samples. Since we used part of the 1000 Genome Project data this time, there are no missing SNPs in the original datasets. But for educational purposes, we randomly make some of the genotypes missing.

    # missing rate for each sample\nhead plink_results.imiss\n    FID       IID MISS_PHENO   N_MISS   N_GENO   F_MISS\nHG00403   HG00403          Y    10020  1235116 0.008113\nHG00404   HG00404          Y     9192  1235116 0.007442\nHG00406   HG00406          Y    15751  1235116  0.01275\nHG00407   HG00407          Y    14653  1235116  0.01186\nHG00409   HG00409          Y     5667  1235116 0.004588\nHG00410   HG00410          Y     6066  1235116 0.004911\nHG00419   HG00419          Y    20000  1235116  0.01619\nHG00421   HG00421          Y    17542  1235116   0.0142\nHG00422   HG00422          Y    18608  1235116  0.01507\n
    # missing rate for each SNP\nhead plink_results.lmiss\n CHR              SNP   N_MISS   N_GENO   F_MISS\n   1      1:14930:A:G        2      504 0.003968\n   1      1:15774:G:A        3      504 0.005952\n   1      1:15777:A:G        3      504 0.005952\n   1      1:57292:C:T        6      504   0.0119\n   1      1:77874:G:A        3      504 0.005952\n   1      1:87360:C:T        1      504 0.001984\n   1      1:92917:T:A        7      504  0.01389\n   1     1:104186:T:C        3      504 0.005952\n   1     1:125271:C:T        2      504 0.003968\n

    Distribution of sample missing rate and SNP missing rate

    Note: The missing values were simulated based on normal distributions for each individual.

    Sample missing rate

    SNP missing rate

    For the meaning of headers, please refer to PLINK documents.

    "},{"location":"04_Data_QC/#allele-frequency","title":"Allele Frequency","text":"

    One of the most important statistics of SNPs is their frequency in a certain population. Many downstream analyses are based on investigating differences in allele frequencies.

    Usually, variants can be categorized into 3 groups based on their Minor Allele Frequency (MAF):

    1. Common variants : MAF>=0.05
    2. Low-frequency variants : 0.01<=MAF<0.05
    3. Rare variants : MAF<0.01

    How to calculate Minor Allele Frequency (MAF)

    Suppose the reference allele(REF) is A and the alternative allele(ALT) is B for a certain SNP. The posible genotypes are AA, AB and BB. In a population of N samples (2N alleles), \\(N = N_{AA} + 2 \\times N_{AB} + N_{BB}\\) :

    So we can calculate the allele frequency:

    The MAF for this SNP in this specific population is defined as:

    \\(MAF = min( AF_{REF}, AF_{ALT} )\\)

    For different downstream analyses, we might use different sets of variants. For example, for PCA, we might use only common variants. For gene-based tests, we might use only rare variants.

    Using PLINK1.9 we can easily calculate the MAF of variants in the input data.

    Calculate the MAF of variants using PLINK1.9

    plink \\\n    --bfile ${genotypeFile} \\\n    --freq \\\n    --out plink_results\n
    # results from plink1.9\nhead plink_results.frq\nCHR              SNP   A1   A2          MAF  NCHROBS\n1      1:14930:A:G    G    A       0.4133     1004\n1      1:15774:G:A    A    G      0.02794     1002\n1      1:15777:A:G    G    A      0.07385     1002\n1      1:57292:C:T    T    C       0.1054      996\n1      1:77874:G:A    A    G      0.01996     1002\n1      1:87360:C:T    T    C      0.02286     1006\n1      1:92917:T:A    A    T     0.003018      994\n1     1:104186:T:C    T    C        0.499     1002\n1     1:125271:C:T    C    T      0.03088     1004\n

    Next, we use plink2 to run the same options to check the difference between the results.

    Calculate the alternative allele frequencies of variants using PLINK2

    plink2 \\\n        --bfile ${genotypeFile} \\\n        --freq \\\n        --out plink_results\n
    # results from plink2\nhead plink_results.afreq\n#CHROM  ID      REF     ALT     PROVISIONAL_REF?        ALT_FREQS       OBS_CT\n1       1:14930:A:G     A       G       Y       0.413347        1004\n1       1:15774:G:A     G       A       Y       0.0279441       1002\n1       1:15777:A:G     A       G       Y       0.0738523       1002\n1       1:57292:C:T     C       T       Y       0.105422        996\n1       1:77874:G:A     G       A       Y       0.0199601       1002\n1       1:87360:C:T     C       T       Y       0.0228628       1006\n1       1:92917:T:A     T       A       Y       0.00301811      994\n1       1:104186:T:C    T       C       Y       0.500998        1002\n1       1:125271:C:T    C       T       Y       0.969124        1004\n

    We need to pay attention to the concepts here.

    In PLINK1.9, the concept here is minor (A1) and major(A2) allele, while in PLINK2 it is the reference (REF) allele and the alternative (ALT) allele.

    "},{"location":"04_Data_QC/#hardy-weinberg-equilibrium-exact-test","title":"Hardy-Weinberg equilibrium exact test","text":"

    For SNP QC, besides checking the missing rate, we also need to check if the SNP is in Hardy-Weinberg equilibrium:

    --hardy will perform Hardy-Weinberg equilibrium exact test for each variant. Variants with low P value usually suggest genotyping errors, or indicate evolutionary selection for these variants.

    The following command can calculate the Hardy-Weinberg equilibrium exact test statistics for all SNPs. (https://www.cog-genomics.org/plink/1.9/basic_stats#hardy)

    Info

    Suppose we have N unrelated samples (2N alleles). Under HWE, the exact probability of observing \\(n_{AB}\\) sample with genotype AB in N samples is:

    \\[P(N_{AB} = n_{AB} | N, n_A) = {{2^{n_{AB}}}N!\\over{n_{AA}!n_{AB}!n_{BB}!}} \\times {{n_A!n_B!}\\over{n_A!n_B!}} \\]

    To compute the Hardy-Weinberg equilibrium exact test statistics, we will sum up the probabilities of all configurations with probability equal to or less than the observed configuration :

    \\[P_{HWE} = \\sum_{n^{*}_AB} I[P(N_{AB} = n_{AB} | N, n_A) \\geqq P(N_{AB} = n^{*}_{AB} | N, n_A)] \\times P(N_{AB} = n^{*}_{AB} | N, n_A)\\]

    \\(I(x)\\) is the indicator function. If x is true, \\(I(x) = 1\\); otherwise, \\(I(x) = 0\\).

    Reference : Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link

    Calculate the Hardy-Weinberg equilibrium exact test statistics for a single SNP using Python

    This code is converted from here (Jeremy McRae) to python. Orginal citation: Wigginton, JE, Cutler, DJ, and Abecasis, GR (2005) A Note on Exact Tests of Hardy-Weinberg Equilibrium. AJHG 76: 887-893

    def snphwe(obs_hets, obs_hom1, obs_hom2):\n    obs_homr = min(obs_hom1, obs_hom2)\n    obs_homc = max(obs_hom1, obs_hom2)\n\n    rare = 2 * obs_homr + obs_hets\n    genotypes = obs_hets + obs_homc + obs_homr\n\n    probs = [0.0 for i in range(rare +1)]\n\n    mid = rare * (2 * genotypes - rare) // (2 * genotypes)\n    if mid % 2 != rare%2:\n        mid += 1\n\n    probs[mid] = 1.0\n    sum_p = 1 #probs[mid]\n\n    curr_homr = (rare - mid) // 2\n    curr_homc = genotypes - mid - curr_homr\n\n    for curr_hets in range(mid, 1, -2):\n        probs[curr_hets - 2] = probs[curr_hets] * curr_hets * (curr_hets - 1.0)/ (4.0 * (curr_homr + 1.0) * (curr_homc + 1.0))\n        sum_p+= probs[curr_hets - 2]\n        curr_homr += 1\n        curr_homc += 1\n\n    curr_homr = (rare - mid) // 2\n    curr_homc = genotypes - mid - curr_homr\n\n    for curr_hets in range(mid, rare-1, 2):\n        probs[curr_hets + 2] = probs[curr_hets] * 4.0 * curr_homr * curr_homc/ ((curr_hets + 2.0) * (curr_hets + 1.0))\n        sum_p += probs[curr_hets + 2]\n        curr_homr -= 1\n        curr_homc -= 1\n\n    target = probs[obs_hets]\n    p_hwe = 0.0\n    for p in probs:\n        if p <= target :\n            p_hwe += p / sum_p  \n\n    return min(p_hwe,1)\n

    Calculate the Hardy-Weinberg equilibrium exact test statistics using PLINK

    plink \\\n    --bfile ${genotypeFile} \\\n    --hardy \\\n    --out plink_results\n
    head plink_results.hwe\n    CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P\n1      1:14930:A:G  ALL(NP)    G    A             4/407/91   0.8108    0.485    4.864e-61\n1      1:15774:G:A  ALL(NP)    A    G             0/28/473  0.05589  0.05433            1\n1      1:15777:A:G  ALL(NP)    G    A             1/72/428   0.1437   0.1368       0.5053\n1      1:57292:C:T  ALL(NP)    T    C             3/99/396   0.1988   0.1886       0.3393\n1      1:77874:G:A  ALL(NP)    A    G             0/20/481  0.03992  0.03912            1\n1      1:87360:C:T  ALL(NP)    T    C             0/23/480  0.04573  0.04468            1\n1      1:92917:T:A  ALL(NP)    A    T              0/3/494 0.006036 0.006018            1\n1     1:104186:T:C  ALL(NP)    T    C            74/352/75   0.7026      0.5    6.418e-20\n1     1:125271:C:T  ALL(NP)    C    T             1/29/472  0.05777  0.05985       0.3798\n

    "},{"location":"04_Data_QC/#applying-filters","title":"Applying filters","text":"

    Previously we calculated the basic statistics using PLINK. But when performing certain analyses, we just want to exclude the bad-quality samples or SNPs instead of calculating the statistics for all samples and SNPs.

    In this case we can apply the following filters for example:

    We will apply these filters in the following example if LD-pruning.

    "},{"location":"04_Data_QC/#ld-pruning","title":"LD Pruning","text":"

    There is often strong Linkage disequilibrium(LD) among SNPs, for some analysis we don't need all SNPs and we need to remove the redundant SNPs to avoid bias in genetic estimations. For example, for relatedness estimation, we will use only LD-Pruned SNP set.

    We can use --indep-pairwise 50 5 0.2 to filter out those in strong LD and keep only the independent SNPs.

    Meaning of --indep-pairwise x y z

    Please check https://www.cog-genomics.org/plink/1.9/ld#indep for details.

    Combined with the filters we just introduced, we can run:

    Example

    plink \\\n    --bfile ${genotypeFile} \\\n    --maf 0.01 \\\n    --geno 0.02 \\\n    --mind 0.02 \\\n    --hwe 1e-6 \\\n    --indep-pairwise 50 5 0.2 \\\n    --out plink_results\n
    This command generates two outputs: plink_results.prune.in and plink_results.prune.out plink_results.prune.in is the independent set of SNPs we will use in the following analysis.

    You can check the PLINK log for how many variants were removed based on the filters you applied:

    Total genotyping rate in remaining samples is 0.993916.\n108837 variants removed due to missing genotype data (--geno).\n--hwe: 9754 variants removed due to Hardy-Weinberg exact test.\n87149 variants removed due to minor allele threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1029376 variants and 501 people pass filters and QC.\n

    Let's take a look at the LD-pruned SNP file. Basically, it just contains one SNP id per line.

    head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
    "},{"location":"04_Data_QC/#inbreeding-f-coefficient","title":"Inbreeding F coefficient","text":"

    Next, we can check the heterozygosity F of samples (https://www.cog-genomics.org/plink/1.9/basic_stats#ibc) :

    -het option will compute observed and expected autosomal homozygous genotype counts for each sample. Usually, we need to exclude individuals with high or low heterozygosity coefficients, which suggests that the sample might be contaminated.

    Inbreeding F coefficient calculation by PLINK

    \\[F = {{O(HOM) - E(HOM)}\\over{ M - E(HOM)}}\\]

    High F may indicate a relatively high level of inbreeding.

    Low F may suggest the sample DNA was contaminated.

    Performing LD-pruning beforehand since these calculations do not take LD into account.

    Calculate inbreeding F coefficient

    plink \\\n    --bfile ${genotypeFile} \\\n    --extract plink_results.prune.in \\\n    --het \\\n    --out plink_results\n

    Check the output:

    head plink_results.het\n    FID       IID       O(HOM)       E(HOM)        N(NM)            F\nHG00403   HG00403       180222    1.796e+05       217363      0.01698\nHG00404   HG00404       180127    1.797e+05       217553      0.01023\nHG00406   HG00406       178891    1.789e+05       216533   -0.0001138\nHG00407   HG00407       178992     1.79e+05       216677   -0.0008034\nHG00409   HG00409       179918    1.801e+05       218045    -0.006049\nHG00410   HG00410       179782    1.801e+05       218028    -0.009268\nHG00419   HG00419       178362    1.783e+05       215849     0.001315\nHG00421   HG00421       178222    1.785e+05       216110    -0.008288\nHG00422   HG00422       178316    1.784e+05       215938      -0.0022\n

    A commonly used method is to exclude samples with heterozygosity F deviating more than 3 standard deviations (SD) from the mean. Some studies used a fixed value such as +-0.15 or +-0.2.

    Usually we will use only LD-pruned SNPs for the calculation of F.

    We can plot the distribution of F:

    Distribution of \\(F_{het}\\) in sample data

    Here we use +-0.1 as the \\(F_{het}\\) threshold for convenience.

    Create sample list of individuals with extreme F using awk

    # only one sample\nawk 'NR>1 && $6>0.1 || $6<-0.1 {print $1,$2}' plink_results.het > high_het.sample\n
    "},{"location":"04_Data_QC/#sample-snp-filtering-extractexcludekeepremove","title":"Sample & SNP filtering (extract/exclude/keep/remove)","text":"

    Sometimes we will use only a subset of samples or SNPs included the original dataset. In this case, we can use --extract or --exclude to select or exclude SNPs from analysis, --keep or --remove to select or exclude samples.

    For --keep or --remove, the input is the filename of a sample FID and IID file. For --extract or --exclude, the input is the filename of an SNP list file.

    head plink_results.prune.in\n1:15774:G:A\n1:15777:A:G\n1:77874:G:A\n1:87360:C:T\n1:125271:C:T\n1:232449:G:A\n1:533113:A:G\n1:565697:A:G\n1:566933:A:G\n1:567092:T:C\n
    "},{"location":"04_Data_QC/#ibd-pi_hat-kinship-coefficient","title":"IBD / PI_HAT / kinship coefficient","text":"

    --genome will estimate IBS/IBD. Usually, for this analysis, we need to prune our data first since the strong LD will cause bias in the results. (This step is computationally intensive)

    Combined with the --extract, we can run:

    How PLINK estimates IBD

    The prior probability of IBS sharing can be modeled as:

    \\[P(I=i) = \\sum^{z=i}_{z=0}P(I=i|Z=z)P(Z=z)\\]

    So the proportion of alleles shared IBD (\\(\\hat{\\pi}\\)) can be estimated by:

    \\[\\hat{\\pi} = {{P(Z=1)}\\over{2}} + P(Z=2)\\]

    Estimate IBD

    plink \\\n    --bfile ${genotypeFile} \\\n    --extract plink_results.prune.in \\\n    --genome \\\n    --out plink_results\n

    PI_HAT is the IBD estimation. Please check https://www.cog-genomics.org/plink/1.9/ibd for more details.

    head plink_results.genome\n    FID1     IID1     FID2     IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO\nHG00403  HG00403  HG00404  HG00404 UN    NA  1.0000  0.0000  0.0000  0.0000  -1  0.858562  0.3679  1.9774\nHG00403  HG00403  HG00406  HG00406 UN    NA  0.9805  0.0044  0.0151  0.0173  -1  0.858324  0.8183  2.0625\nHG00403  HG00403  HG00407  HG00407 UN    NA  0.9790  0.0000  0.0210  0.0210  -1  0.857794  0.8034  2.0587\nHG00403  HG00403  HG00409  HG00409 UN    NA  0.9912  0.0000  0.0088  0.0088  -1  0.857024  0.2637  1.9578\nHG00403  HG00403  HG00410  HG00410 UN    NA  0.9699  0.0235  0.0066  0.0184  -1  0.858194  0.6889  2.0335\nHG00403  HG00403  HG00419  HG00419 UN    NA  1.0000  0.0000  0.0000  0.0000  -1  0.857643  0.8597  2.0745\nHG00403  HG00403  HG00421  HG00421 UN    NA  0.9773  0.0218  0.0010  0.0118  -1  0.857276  0.2186  1.9484\nHG00403  HG00403  HG00422  HG00422 UN    NA  0.9880  0.0000  0.0120  0.0120  -1  0.857224  0.8277  2.0652\nHG00403  HG00403  HG00428  HG00428 UN    NA  0.9801  0.0069  0.0130  0.0164  -1  0.858162  0.9812  2.1471\n

    KING-robust kinship estimator

    PLINK2 uses KING-robust kinship estimator, which is more robust in the presence of population substructure. See here.

    Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W. M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

    Since the samples are unrelated, we do not need to remove any samples at this step. But remember to check this for your dataset.

    "},{"location":"04_Data_QC/#ld-calculation","title":"LD calculation","text":"

    We can also use our data to estimate the LD between a pair of SNPs.

    Details on LD can be found here

    --chr option in PLINK allows us to include SNPs on a specific chromosome. To calculate LD r2 for SNPs on chr22 , we can run:

    Example

    plink \\\n        --bfile ${genotypeFile} \\\n        --chr 22 \\\n        --r2 \\\n        --out plink_results\n
    head plink_results.ld\n CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2\n22     16069141   22:16069141:C:G     22     16071624   22:16071624:A:G     0.771226\n22     16069784   22:16069784:A:T     22     16149743   22:16149743:T:A     0.217197\n22     16069784   22:16069784:A:T     22     16150589   22:16150589:C:A     0.224992\n22     16069784   22:16069784:A:T     22     16159060   22:16159060:G:A       0.2289\n22     16149743   22:16149743:T:A     22     16150589   22:16150589:C:A     0.965109\n22     16149743   22:16149743:T:A     22     16152606   22:16152606:T:C     0.692157\n22     16149743   22:16149743:T:A     22     16159060   22:16159060:G:A     0.721796\n22     16149743   22:16149743:T:A     22     16193549   22:16193549:C:T     0.336477\n22     16149743   22:16149743:T:A     22     16212542   22:16212542:C:T     0.442424\n
    "},{"location":"04_Data_QC/#data-management-make-bedrecode","title":"Data management (make-bed/recode)","text":"

    By far the input data we use is in binary form, but sometimes we may want the text version.

    Info

    To convert the formats, we can run:

    Convert PLINK formats

    #extract the 1000 samples with the pruned SNPs, and make a bed file.\nplink \\\n    --bfile ${genotypeFile} \\\n    --extract plink_results.prune.in \\\n    --make-bed \\\n    --out plink_1000_pruned\n\n#convert the bed/bim/fam to ped/map\nplink \\\n        --bfile plink_1000_pruned \\\n        --recode \\\n        --out plink_1000_pruned\n
    "},{"location":"04_Data_QC/#apply-all-the-filters-to-obtain-a-clean-dataset","title":"Apply all the filters to obtain a clean dataset","text":"

    We can then apply the filters and remove samples with high \\(F_{het}\\) to get a clean dataset for later use.

    plink \\\n        --bfile ${genotypeFile} \\\n        --maf 0.01 \\\n        --geno 0.02 \\\n        --mind 0.02 \\\n        --hwe 1e-6 \\\n        --remove high_het.sample \\\n        --keep-allele-order \\\n        --make-bed \\\n        --out sample_data.clean\n
    1224104 variants and 500 people pass filters and QC.\n
    -rw-r--r--  1 yunye yunye 146M Dec 26 15:40 sample_data.clean.bed\n-rw-r--r--  1 yunye yunye  39M Dec 26 15:40 sample_data.clean.bim\n-rw-r--r--  1 yunye yunye  13K Dec 26 15:40 sample_data.clean.fam\n
    "},{"location":"04_Data_QC/#other-common-qc-steps-not-included-in-this-tutorial","title":"Other common QC steps not included in this tutorial","text":""},{"location":"04_Data_QC/#exercise","title":"Exercise","text":""},{"location":"04_Data_QC/#additional-resources","title":"Additional resources","text":""},{"location":"04_Data_QC/#reference","title":"Reference","text":""},{"location":"05_PCA/","title":"Principle component analysis (PCA)","text":"

    PCA aims to find the orthogonal directions of maximum variance and project the data onto a new subspace with equal or fewer dimensions than the original one. Simply speaking, GRM (genetic relationship matrix; covariance matrix) is first estimated and then PCA is applied to this matrix to generate eigenvectors and eigenvalues. Finally, the \\(k\\) eigenvectors with the largest eigenvalues are used to transform the genotypes to a new feature subspace.

    Genetic relationship matrix (GRM)

    Citation: Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.

    A simple PCA

    Source data:

    cov = np.array([[6, -3], [-3, 3.5]])\npts = np.random.multivariate_normal([0, 0], cov, size=800)\n

    The red arrow shows the first principal component axis (PC1) and the blue arrow shows the second principal component axis (PC2). The two axes are orthogonal.

    Interpretation of PCs

    The first principal component of a set of p variables, presumed to be jointly normally distributed, is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p iterations until all the variance is explained.

    PCA is by far the most commonly used dimension reduction approach used in population genetics which could identify the difference in ancestry among the sample individuals. The population outliers could be excluded from the main cluster. For GWAS we also need to include top PCs to adjust for the population stratification.

    Please read the following paper on how we apply PCA to genetic data: Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904\u2013909 (2006). https://doi.org/10.1038/ng1847 https://www.nature.com/articles/ng1847

    So before association analysis, we will learn how to run PCA analysis first.

    PCA workflow

    "},{"location":"05_PCA/#preparation","title":"Preparation","text":""},{"location":"05_PCA/#exclude-snps-in-high-ld-or-hla-regions","title":"Exclude SNPs in high-LD or HLA regions","text":"

    For PCA, we first exclude SNPs in high-LD or HLA regions from the genotype data.

    The reason why we want to exclude such high-LD or HLA regions

    "},{"location":"05_PCA/#download-bed-like-files-for-high-ld-or-hla-regions","title":"Download BED-like files for high-LD or HLA regions","text":"

    You can simply copy the list of high-LD or HLA regions in Genome build version(.bed format) to a text file high-ld.txt.

    High LD regions were obtained from

    https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)

    High LD regions of hg19

    high-ld-hg19.txt
    1   48000000    52000000    highld\n2   86000000    100500000   highld\n2   134500000   138000000   highld\n2   183000000   190000000   highld\n3   47500000    50000000    highld\n3   83500000    87000000    highld\n3   89000000    97500000    highld\n5   44500000    50500000    highld\n5   98000000    100500000   highld\n5   129000000   132000000   highld\n5   135500000   138500000   highld\n6   25000000    35000000    highld\n6   57000000    64000000    highld\n6   140000000   142500000   highld\n7   55000000    66000000    highld\n8   7000000 13000000    highld\n8   43000000    50000000    highld\n8   112000000   115000000   highld\n10  37000000    43000000    highld\n11  46000000    57000000    highld\n11  87500000    90500000    highld\n12  33000000    40000000    highld\n12  109500000   112000000   highld\n20  32000000    34500000    highld\n
    "},{"location":"05_PCA/#create-a-list-of-snps-in-high-ld-or-hla-regions","title":"Create a list of SNPs in high-LD or HLA regions","text":"

    Next, use high-ld.txt to extract all SNPs that are located in the regions described in the file using the code as follows:

    plink --file ${plinkFile} --make-set high-ld.txt --write-set --out hild\n

    Create a list of SNPs in the regions specified in high-ld.txt

    plinkFile=\"../04_Data_QC/sample_data.clean\"\n\nplink \\\n    --bfile ${plinkFile} \\\n    --make-set high-ld-hg19.txt \\\n    --write-set \\\n    --out hild\n

    And all SNPs in the regions will be extracted to hild.set.

    $head hild.set\nhighld\n1:48000156:C:G\n1:48002096:C:G\n1:48003081:T:C\n1:48004776:C:T\n1:48006500:A:G\n1:48006546:C:T\n1:48008102:T:G\n1:48009994:C:T\n1:48009997:C:A\n

    For downstream analysis, we can exclude these SNPs using --exclude hild.set.

    "},{"location":"05_PCA/#pca-steps","title":"PCA steps","text":"

    Steps to perform a typical genomic PCA analysis

    MAF filter for LD-pruning and PCA

    For LD-pruning and PCA, we usually only use variants with MAF > 0.01 or MAF>0.05 ( --maf 0.01 or --maf 0.05) for robust estimation.

    "},{"location":"05_PCA/#sample-codes","title":"Sample codes","text":"

    Sample codes for performing PCA

    plinkFile=\"\" #please set this to your own path\noutPrefix=\"plink_results\"\nthreadnum=2\nhildset = hild.set \n\n# LD-pruning, excluding high-LD and HLA regions\nplink2 \\\n        --bfile ${plinkFile} \\\n        --maf 0.01 \\\n        --threads ${threadnum} \\\n        --exclude ${hildset} \\ \n        --indep-pairwise 500 50 0.2 \\\n        --out ${outPrefix}\n\n# Remove related samples using king-cuttoff\nplink2 \\\n        --bfile ${plinkFile} \\\n        --extract ${outPrefix}.prune.in \\\n        --king-cutoff 0.0884 \\\n        --threads ${threadnum} \\\n        --out ${outPrefix}\n\n# PCA after pruning and removing related samples\nplink2 \\\n        --bfile ${plinkFile} \\\n        --keep ${outPrefix}.king.cutoff.in.id \\\n        --extract ${outPrefix}.prune.in \\\n        --freq counts \\\n        --threads ${threadnum} \\\n        --pca approx allele-wts 10 \\     \n        --out ${outPrefix}\n\n# Projection (related and unrelated samples)\nplink2 \\\n        --bfile ${plinkFile} \\\n        --threads ${threadnum} \\\n        --read-freq ${outPrefix}.acount \\\n        --score ${outPrefix}.eigenvec.allele 2 5 header-read no-mean-imputation variance-standardize \\\n        --score-col-nums 6-15 \\\n        --out ${outPrefix}_projected\n

    --pca and --pca approx

    For step 3, please note that approx flag is only recommended for analysis of >5000 samples. (It was applied in the sample code anyway because in real analysis you usually have a much larger sample size, though the sample size of our data is just ~500)

    After step 3, the allele-wts 10 modifier requests an additional one-line-per-allele .eigenvec.allele file with the first 10 PCs expressed as allele weights instead of sample weights.

    We will get the plink_results.eigenvec.allele file, which will be used to project onto all samples along with an allele count plink_results.acount file.

    In the projection, score ${outPrefix}.eigenvec.allele 2 5 sets the ID (2nd column) and A1 (5th column), score-col-nums 6-15 sets the first 10 PCs to be projected.

    Please check https://www.cog-genomics.org/plink/2.0/score#pca_project for more details on the projection.

    Allele weight and count files

    plink_results.eigenvec.allele
    #CHROM  ID      REF     ALT     PROVISIONAL_REF?        A1      PC1     PC2     PC3     PC4     PC5     PC6     PC7PC8      PC9     PC10\n1       1:15774:G:A     G       A       Y       G       0.57834 -1.03002        0.744557        -0.161887       0.389223    -0.0514592      0.133195        -0.0336162      -0.846376       0.0542876\n1       1:15774:G:A     G       A       Y       A       -0.57834        1.03002 -0.744557       0.161887        -0.389223   0.0514592       -0.133195       0.0336162       0.846376        -0.0542876\n1       1:15777:A:G     A       G       Y       A       -0.585215       0.401872        -0.393071       -1.79583   0.89579  -0.700882       -0.103729       -0.694495       -0.007313       0.513223\n1       1:15777:A:G     A       G       Y       G       0.585215        -0.401872       0.393071        1.79583 -0.89579    0.700882        0.103729        0.694495        0.007313        -0.513223\n1       1:57292:C:T     C       T       Y       C       -0.123768       0.912046        -0.353606       -0.220148  -0.893017        -0.374505       -0.141002       -0.249335       0.625097        0.206104\n1       1:57292:C:T     C       T       Y       T       0.123768        -0.912046       0.353606        0.220148   0.893017 0.374505        0.141002        0.249335        -0.625097       -0.206104\n1       1:77874:G:A     G       A       Y       G       1.49202 -1.12567        1.19915 0.0755314       0.401134   -0.015842        0.0452086       0.273072        -0.00716098     0.237545\n1       1:77874:G:A     G       A       Y       A       -1.49202        1.12567 -1.19915        -0.0755314      -0.401134   0.015842        -0.0452086      -0.273072       0.00716098      -0.237545\n1       1:87360:C:T     C       T       Y       C       -0.191803       0.600666        -0.513208       -0.0765155 -0.656552        0.0930399       -0.0238774      -0.330449       -0.192037       -0.727729\n
    plink_results.acount
    #CHROM  ID      REF     ALT     PROVISIONAL_REF?        ALT_CTS OBS_CT\n1       1:15774:G:A     G       A       Y       28      994\n1       1:15777:A:G     A       G       Y       73      994\n1       1:57292:C:T     C       T       Y       104     988\n1       1:77874:G:A     G       A       Y       19      994\n1       1:87360:C:T     C       T       Y       23      998\n1       1:125271:C:T    C       T       Y       967     996\n1       1:232449:G:A    G       A       Y       185     996\n1       1:533113:A:G    A       G       Y       129     992\n1       1:565697:A:G    A       G       Y       334     996\n

    Eventually, we will get the PCA results for all samples.

    PCA results for all samples

    plink_results_projected.sscore
    #FID    IID     ALLELE_CT       NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG     PC9_AVG PC10_AVG\nHG00403 HG00403 390256  390256  0.00290265      -0.0248649      0.0100408       0.00957591      0.00694349      -0.00222251 0.0082228       -0.00114937     0.00335249      0.00437471\nHG00404 HG00404 390696  390696  -0.000141221    -0.027965       0.025389        -0.00582538     -0.00274707     0.00658501  0.0113803       0.0077766       0.0159976       0.0178927\nHG00406 HG00406 388524  388524  0.00707397      -0.0315445      -0.00437011     -0.0012621      -0.0114932      -0.00539483 -0.00620153     0.00452379      -0.000870627    -0.00227979\nHG00407 HG00407 388808  388808  0.00683977      -0.025073       -0.00652723     0.00679729      -0.0116 -0.0102328 0.0139572        0.00618677      0.0138063       0.00825269\nHG00409 HG00409 391646  391646  0.000398695     -0.0290334      -0.0189352      -0.00135977     0.0290436       0.00942829  -0.0171194      -0.0129637      0.0253596       0.022907\nHG00410 HG00410 391600  391600  0.00277094      -0.0280021      -0.0209991      -0.00799085     0.0318038       -0.00284209 -0.031517       -0.0010026      0.0132541       0.0357565\nHG00419 HG00419 387118  387118  0.00684154      -0.0326244      0.00237159      0.0167284       -0.0119737      -0.0079637  -0.0144339      0.00712756      0.0114292       0.00404426\nHG00421 HG00421 387720  387720  0.00157095      -0.0338115      -0.00690541     0.0121058       0.00111378      0.00530794  -0.0017545      -0.00121793     0.00393407      0.00414204\nHG00422 HG00422 387466  387466  0.00439167      -0.0332386      0.000741526     0.0124843       -0.00362248     -0.00343393 -0.00735112     0.00944759      -0.0107516      0.00376537\n
    "},{"location":"05_PCA/#plotting-the-pcs","title":"Plotting the PCs","text":"

    You can now create scatterplots of the PCs using R or Python.

    For plotting using Python: plot_PCA.ipynb

    Scatter plot of PC1 and PC2 using 1KG EAS individuals

    Note : We only used a small proportion of all available variants. This figure only very roughly shows the population structure in East Asia.

    Requirements: - python>3 - numpy,pandas,seaborn,matplotlib

    "},{"location":"05_PCA/#pca-umap","title":"PCA-UMAP","text":"

    (optional) We can also apply another non-linear dimension reduction algorithm called UMAP to the PCs to further identify the local structures. (PCA-UMAP)

    For more details, please check: - https://umap-learn.readthedocs.io/en/latest/index.html

    An example of PCA and PCA-UMAP for population genetics: - Sakaue, S., Hirata, J., Kanai, M., Suzuki, K., Akiyama, M., Lai Too, C., ... & Okada, Y. (2020). Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nature communications, 11(1), 1-11.

    "},{"location":"05_PCA/#references","title":"References","text":""},{"location":"06_Association_tests/","title":"Association test","text":""},{"location":"06_Association_tests/#overview","title":"Overview","text":""},{"location":"06_Association_tests/#genetic-models","title":"Genetic models","text":"

    To test the association between a phenotype and genotypes, we need to group the genotypes based on genetic models.

    There are three basic genetic models:

    Three genetic models

    For example, suppose we have a biallelic SNP whose reference allele is A and the alternative allele is G.

    There are three possible genotypes for this SNP: AA, AG, and GG.

    This table shows how we group different genotypes under each genetic model for association tests using linear or logistic regressions.

    Genetic models AA AG GG Additive model 0 1 2 Dominant model 0 1 1 Recessive model 0 0 1

    Contingency table and non-parametric tests

    A simple way to test association is to use the 2x2 or 2x3 contingency table. For dominant and recessive models, Chi-square tests are performed using the 2x2 table. For the additive model, Cochran-Armitage trend tests are performed for the 2x3 table. However, the non-parametric tests do not adjust for the bias caused by other covariates like sex, age and so forth.

    "},{"location":"06_Association_tests/#association-testing-basics","title":"Association testing basics","text":"

    For quantitative traits, we can employ a simple linear regression model to test associations:

    \\[ y = G\\beta_G + X\\beta_X + e \\]

    Interpretation of linear regression

    For binary traits, we can utilize the logistic regression model to test associations:

    \\[ logit(p) = G\\beta_G + X\\beta_X + e \\]

    Linear regression and logistic regression

    "},{"location":"06_Association_tests/#file-preparation","title":"File Preparation","text":"

    To perform genome-wide association tests, usually, we need the following files:

    Phenotype and covariate files

    Phenotype file for a simulated binary trait; B1 is the phenotype name; 1 means the control, 2 means the case.

    1kgeas_binary.txt
    FID IID B1\nHG00403 HG00403 1\nHG00404 HG00404 2\nHG00406 HG00406 1\nHG00407 HG00407 1\nHG00409 HG00409 2\nHG00410 HG00410 2\nHG00419 HG00419 1\nHG00421 HG00421 1\nHG00422 HG00422 1\n\nCovariate file (only top PCs calculated in the previous PCA section)\n\n```txt title=\"plink_results_projected.sscore\"\n#FID    IID     ALLELE_CT       NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVGPC9_AVG  PC10_AVG\nHG00403 HG00403 390256  390256  0.00290265      -0.0248649      -0.0100407      0.00957595      0.00694056      0.00222996      0.00823028      0.00116497      -0.00334937     0.00434627\nHG00404 HG00404 390696  390696  -0.000141221    -0.027965       -0.025389       -0.00582553     -0.00274711     -0.00657958     0.0113769       -0.00778919     -0.0159685      0.0180678\nHG00406 HG00406 388524  388524  0.00707397      -0.0315445      0.00437013      -0.00126195     -0.0114938      0.00538932      -0.00619657     -0.00454686     0.000969112     -0.00217617\nHG00407 HG00407 388808  388808  0.00683977      -0.025073       0.00652723      0.00679731      -0.0116001      0.0102403       0.0139674       -0.00621948     -0.013797       0.00827744\nHG00409 HG00409 391646  391646  0.000398695     -0.0290334      0.0189352       -0.00135996     0.0290464       -0.00941851     -0.0171911      0.01293 -0.0252628      0.0230819\nHG00410 HG00410 391600  391600  0.00277094      -0.0280021      0.0209991       -0.00799089     0.0318043       0.00283456      -0.0315157      0.000978664     -0.0133768      0.0356721\nHG00419 HG00419 387118  387118  0.00684154      -0.0326244      -0.00237159     0.0167284       -0.0119684      0.00795149      -0.0144241      -0.00716183     -0.0115059      0.0038652\nHG00421 HG00421 387720  387720  0.00157095      -0.0338115      0.00690542      0.0121058       0.00111448      -0.00531714     -0.00175494     0.00118513      -0.00391494     0.00414682\nHG00422 HG00422 387466  387466  0.00439167      -0.0332386      -0.000741482    0.0124843       -0.00362885     0.00342491      -0.0073205      -0.00939123     0.010718        0.00360906\n
    "},{"location":"06_Association_tests/#association-tests-using-plink","title":"Association tests using PLINK","text":"

    Please check https://www.cog-genomics.org/plink/2.0/assoc for more details.

    We will perform logistic regression with firth correction for a simulated binary trait under the additive model using the 1KG East Asian individuals.

    Firth correction

    Adding a penalty term to the log-likelihood function when fitting the logistic model results in less bias. - Firth, David. \"Bias reduction of maximum likelihood estimates.\" Biometrika 80.1 (1993): 27-38.

    Quantitative traits

    For quantitative traits, linear regressions will be performed and in this case, we do not need to add firth (since Firth correction is not appliable).

    Sample codes for association test using plink for binary traits

    genotypeFile=\"../04_Data_QC/sample_data.clean\" # the clean dataset we generated in previous section\nphenotypeFile=\"../01_Dataset/1kgeas_binary.txt\" # the phenotype file\ncovariateFile=\"../05_PCA/plink_results_projected.sscore\" # the PC score file\n\ncovariateCols=6-10\ncolName=\"B1\"\nthreadnum=2\n\nplink2 \\\n    --bfile ${genotypeFile} \\\n    --pheno ${phenotypeFile} \\\n    --pheno-name ${colName} \\\n    --maf 0.01 \\\n    --covar ${covariateFile} \\\n    --covar-col-nums ${covariateCols} \\\n    --glm hide-covar firth  firth-residualize single-prec-cc \\\n    --threads ${threadnum} \\\n    --out 1kgeas\n

    Note

    Using the latest version of PLINK2, you need to add firth-residualize single-prec-cc to generate the results. (The algorithm and precision have been changed since 2023 for firth regression)

    You will see a similar log like:

    Log

    1kgeas.log
    PLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023)       www.cog-genomics.org/plink/2.0/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to 1kgeas.log.\nOptions in effect:\n--bfile ../04_Data_QC/sample_data.clean\n--covar ../05_PCA/plink_results_projected.sscore\n--covar-col-nums 6-10\n--glm hide-covar firth firth-residualize single-prec-cc\n--maf 0.01\n--out 1kgeas\n--pheno ../01_Dataset/1kgeas_binary.txt\n--pheno-name B1\n--threads 2\n\nStart time: Tue Dec 26 15:52:10 2023\n31934 MiB RAM detected, ~30479 available; reserving 15967 MiB for main\nworkspace.\nUsing up to 2 compute threads.\n500 samples (0 females, 0 males, 500 ambiguous; 500 founders) loaded from\n../04_Data_QC/sample_data.clean.fam.\n1224104 variants loaded from ../04_Data_QC/sample_data.clean.bim.\n1 binary phenotype loaded (248 cases, 250 controls).\n5 covariates loaded from ../05_PCA/plink_results_projected.sscore.\nCalculating allele frequencies... done.\n95372 variants removed due to allele frequency threshold(s)\n(--maf/--max-maf/--mac/--max-mac).\n1128732 variants remaining after main filters.\n--glm Firth regression on phenotype 'B1': done.\nResults written to 1kgeas.B1.glm.firth .\nEnd time: Tue Dec 26 15:53:49 2023\n

    Let's check the first lines of the output:

    Association test results

    1kgeas.B1.glm.firth
        #CHROM  POS     ID      REF     ALT     PROVISIONAL_REF?        A1      OMITTED A1_FREQ TEST    OBS_CT  OR      LOG(OR)_SE  Z_STAT  P       ERRCODE\n1       15774   1:15774:G:A     G       A       Y       A       G       0.0282828       ADD     495     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       15777   1:15777:A:G     A       G       Y       G       A       0.0737374       ADD     495     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       57292   1:57292:C:T     C       T       Y       T       C       0.104675        ADD     492     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       77874   1:77874:G:A     G       A       Y       A       G       0.0191532       ADD     496     1.12228 0.46275     0.249299        0.80313 .\n1       87360   1:87360:C:T     C       T       Y       T       C       0.0231388       ADD     497     NA      NA NA       NA      FIRTH_CONVERGE_FAIL\n1       125271  1:125271:C:T    C       T       Y       C       T       0.0292339       ADD     496     1.53387 0.373358    1.1458  0.25188 .\n1       232449  1:232449:G:A    G       A       Y       A       G       0.185484        ADD     496     0.884097   0.168961 -0.729096       0.465943        .\n1       533113  1:533113:A:G    A       G       Y       G       A       0.129555        ADD     494     0.90593 0.196631    -0.50243        0.615365        .\n1       565697  1:565697:A:G    A       G       Y       G       A       0.334677        ADD     496     1.04653 0.15286     0.297509        0.766078        .\n

    Usually, other options are added to enhance the sumstats

    "},{"location":"06_Association_tests/#genomic-control","title":"Genomic control","text":"

    Genomic control (GC) is a basic method for controlling for confounding factors including population stratification.

    We will calculate the genomic control factor (lambda GC) to evaluate the inflation. The genomic control factor is calculated by dividing the median of observed Chi square statistics by the median of Chi square distribution with the degree of freedom being 1 (which is approximately 0.455).

    \\[ \\lambda_{GC} = {median(\\chi^{2}_{observed}) \\over median(\\chi^{2}_1)} \\]

    Then, we can used the genomic control factor to correct observed Chi suqare statistics.

    \\[ \\chi^{2}_{corrected} = {\\chi^{2}_{observed} \\over \\lambda_{GC}} \\]

    Genomic inflation is based on the idea that most of the variants are not associated, thus no deviation between the observed and expected Chi square distribution, except the spikes at the end. However, if the trait is highly polygenic, this assumption may be violated.

    Reference: Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997-1004.

    "},{"location":"06_Association_tests/#significant-loci","title":"Significant loci","text":"

    Please check Visualization using gwaslab

    Loci that reached genome-wide significance threshold (P value < 5e-8) :

    SNPID   CHR POS EA  NEA EAF SE  Z   P   OR  N   STATUS  REF ALT\n1:167562605:G:A 1   167562605   A   G   0.391481    0.159645    7.69462 1.419150e-14    3.415780    493 9999999 G   A\n2:55513738:C:T  2   55513738    C   T   0.376008    0.153159    -7.96244    1.686760e-15    0.295373    496 9999999 C   T\n7:134368632:T:G 7   134368632   G   T   0.138105    0.225526    6.89025 5.569440e-12    4.730010    496 9999999 T   G\n20:42758834:T:C 20  42758834    T   C   0.227273    0.184323    -7.76902    7.909780e-15    0.238829    495 9999999 T   C\n

    Warning

    This is just to show the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result is meaningless here.

    Allele frequency and Effect size

    "},{"location":"06_Association_tests/#visualization","title":"Visualization","text":"

    To visualize the sumstats, we will create the Manhattan plot, QQ plot and regional plot.

    Please check for codes : Visualization using gwaslab

    "},{"location":"06_Association_tests/#manhattan-plot","title":"Manhattan plot","text":"

    Manhattan plot is the most classic visualization of GWAS summary statistics. It is a form of scatter plot. Each dot represents the test result for a variant. variants are sorted by their genome coordinates and are aligned along the X axis. Y axis shows the -log10(P value) for tests of variants in GWAS.

    Note

    This kind of plot was named after Manhattan in New York City since it resembles the Manhattan skyline.

    A real Manhattan plot

    I took this photo in 2020 just before the COVID-19 pandemic. It was a cloudy and misty day. Those birds formed a significance threshold line. And the skyscrapers above that line resembled the significant signals in your GWAS. I believe you could easily get how the GWAS Manhattan plot was named.

    Data we need from sumstats to create Manhattan plots:

    Steps to create Manhattan plot

    1. sort the variants by genome coordinates.
    2. map the genome coordinates of variants to the x axis.
    3. convert P value to -log10(P).
    4. create the scatter plot.
    "},{"location":"06_Association_tests/#quantile-quantile-plot","title":"Quantile-quantile plot","text":"

    Quantile-quantile plot (also known as Q-Q plot), is commonly used to compare an observed distribution with its expected distribution. For a specific point (x,y) on Q-Q plot, its y coordinate corresponds to one of the quantiles of the observed distribution, while its x coordinate corresponds to the same quantile of the expected distribution.

    Quantile-quantile plot is used to check if there is any significant inflation in P value distribution, which usually indicates population stratification or cryptic relatedness.

    Data we need from sumstats to create the Manhattan plot:

    Steps to create Q-Q plot

    Suppose we have n variants in our sumstats,

    1. convert the n P value to -log10(P).
    2. sort the -log10(P) values in asending order.
    3. get n numbers from (0,1) with equal intervals.
    4. convert the n numbers to -log10(P) and sort in ascending order.
    5. create scatter plot using the sorted -log10(P) of sumstats as Y and sorted -log10(P) we generated as X.

    Note

    The expected distribution of P value is a Uniform distribution from 0 to 1.

    \\[P_{expected} \\sim U(0,1)\\]"},{"location":"06_Association_tests/#regional-plot","title":"Regional plot","text":"

    Manhattan plot is very useful to check the overview of our sumstats. But if we want to check a specific genomic locus, we need a plot with finer resolution. This kind of plot is called a regional plot. It is basically the Manhattan plot of only a small region on the genome, with points colored by its LD r2 with the lead variant in this region.

    Such a plot is especially helpful to understand the signal and loci, e.g., LD structure, independent signals, and genes.

    The regional plot for the loci of 2:55513738:C:T.

    Please check Visualization using gwaslab

    "},{"location":"06_Association_tests/#gwas-ssf","title":"GWAS-SSF","text":"

    To standardize the format of GWAS summary statistics for sharing, GWAS-SSF format was proposed in 2022. This format is now used as the standard format for GWAS Catalog.

    GWAS-SSF consists of :

    1. a tab-separated data file with well-defined fields (shown in the following figure)
    2. an accompanying metadata file describing the study (such as sample ancestry, genotyping method, md5sum, and so forth)

    Schematic representation of GWAS-SSF data file

    GWAS-SSF

    Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv, 2022-07.

    For details, please check:

    "},{"location":"07_Annotation/","title":"Variant Annotation","text":""},{"location":"07_Annotation/#table-of-contents","title":"Table of Contents","text":""},{"location":"07_Annotation/#annovar","title":"ANNOVAR","text":"

    ANNOVAR is a simple and efficient command line tool for variant annotation.

    In this tutorial, we will use ANNOVAR to annotate the variants in our summary statistics (hg19).

    "},{"location":"07_Annotation/#install","title":"Install","text":"

    Download ANNOVAR from here (registration required; freely available to personal, academic and non-profit use only.)

    You will receive an email with the download link after registration. Download it and decompress:

    tar -xvzf annovar.latest.tar.gz\n

    For refGene annotation for hg19, we do not need to download additional files.

    "},{"location":"07_Annotation/#format-input-file","title":"Format input file","text":"

    The default input file for ANNOVAR is a 1-based coordinate file.

    We will only use the first 100000 variants as an example.

    annovar_input

    awk 'NR>1 && NR<100000 {print $1,$2,$2,$4,$5}' ../06_Association_tests/1kgeas.B1.glm.logistic.    hybrid > annovar_input.txt\n
    head annovar_input.txt \n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n

    With -vcfinput option, ANNOVAR can accept input files in VCF format.

    "},{"location":"07_Annotation/#annotation","title":"Annotation","text":"

    Annotate the variants with gene information.

    A minimal example of annotation using refGene

    input=annovar_input.txt\nhumandb=/home/he/tools/annovar/annovar/humandb\ntable_annovar.pl ${input} ${humandb} -buildver hg19 -out myannotation -remove -protocol refGene     -operation g -nastring . -polish\n
    Chr Start   End Ref Alt Func.refGene    Gene.refGene    GeneDetail.refGene  ExonicFunc.refGene  AAChange.    refGene\n1   13273   13273   G   C   ncRNA_exonic    DDX11L1;LOC102725121    .   .   .\n1   14599   14599   T   A   ncRNA_exonic    WASH7P  .   .   .\n1   14604   14604   A   G   ncRNA_exonic    WASH7P  .   .   .\n1   14930   14930   A   G   ncRNA_intronic  WASH7P  .   .   .\n1   69897   69897   T   C   exonic  OR4F5   .   synonymous SNV  OR4F5:NM_001005484:exon1:c.T807C:p.S269S\n1   86331   86331   A   G   intergenic  OR4F5;LOC729737 dist=16323;dist=48442   .   .\n1   91581   91581   G   A   intergenic  OR4F5;LOC729737 dist=21573;dist=43192   .   .\n1   122872  122872  T   G   intergenic  OR4F5;LOC729737 dist=52864;dist=11901   .   .\n1   135163  135163  C   T   ncRNA_exonic    LOC729737   .   .   .\n
    "},{"location":"07_Annotation/#additional-databases","title":"Additional databases","text":"

    ANNOVAR supports a wide range of commonly used databases including dbsnp , dbnsfp, clinvar, gnomad, 1000g, cadd and so forth. For details, please check ANNOVAR's official documents

    You can check the Table Name listed in the link above and download the database you need using the following command.

    Example: Downloading avsnp150 for hg19 from ANNOVAR

    annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb/\n

    An example of annotation using multiple databases

    # input file is in vcf format\ntable_annovar.pl \\\n  ${in_vcf} \\\n  ${humandb} \\\n  -buildver hg19 \\\n  -protocol refGene,avsnp150,clinvar_20200316,gnomad211_exome \\\n  -operation g,f,f,f \\\n  -remove \\\n  -out ${out_prefix} \\ \n  -vcfinput\n
    "},{"location":"07_Annotation/#vep-under-construction","title":"VEP (under construction)","text":""},{"location":"07_Annotation/#install_1","title":"Install","text":"
    git clone https://github.com/Ensembl/ensembl-vep.git\ncd ensembl-vep\nperl INSTALL.pl\n
    Hello! This installer is configured to install v108 of the Ensembl API for use by the VEP.\nIt will not affect any existing installations of the Ensembl API that you may have.\n\nIt will also download and install cache files from Ensembl's FTP server.\n\nChecking for installed versions of the Ensembl API...done\n\nSetting up directories\nDestination directory ./Bio already exists.\nDo you want to overwrite it (if updating VEP this is probably OK) (y/n)? y\n - fetching BioPerl\n - unpacking ./Bio/tmp/release-1-6-924.zip\n - moving files\n\nDownloading required Ensembl API files\n - fetching ensembl\n - unpacking ./Bio/tmp/ensembl.zip\n - moving files\n - getting version information\n - fetching ensembl-variation\n - unpacking ./Bio/tmp/ensembl-variation.zip\n - moving files\n - getting version information\n - fetching ensembl-funcgen\n - unpacking ./Bio/tmp/ensembl-funcgen.zip\n - moving files\n - getting version information\n - fetching ensembl-io\n - unpacking ./Bio/tmp/ensembl-io.zip\n - moving files\n - getting version information\n\nTesting VEP installation\n - OK!\n\nThe VEP can either connect to remote or local databases, or use local cache files.\nUsing local cache files is the fastest and most efficient way to run the VEP\nCache files will be stored in /home/he/.vep\nDo you want to install any cache files (y/n)? y\n\nThe following species/files are available; which do you want (specify multiple separated by spaces or 0 for all): \n1 : acanthochromis_polyacanthus_vep_108_ASM210954v1.tar.gz (69 MB)\n2 : accipiter_nisus_vep_108_Accipiter_nisus_ver1.0.tar.gz (55 MB)\n...\n466 : homo_sapiens_merged_vep_108_GRCh37.tar.gz (16 GB)\n467 : homo_sapiens_merged_vep_108_GRCh38.tar.gz (26 GB)\n468 : homo_sapiens_refseq_vep_108_GRCh37.tar.gz (13 GB)\n469 : homo_sapiens_refseq_vep_108_GRCh38.tar.gz (22 GB)\n470 : homo_sapiens_vep_108_GRCh37.tar.gz (14 GB)\n471 : homo_sapiens_vep_108_GRCh38.tar.gz (22 GB)\n\n  Total: 221 GB for all 471 files\n\n? 470\n - downloading https://ftp.ensembl.org/pub/release-108/variation/indexed_vep_cache/homo_sapiens_vep_108_GRCh37.tar.gz\n
    "},{"location":"08_LDSC/","title":"LD score regression","text":""},{"location":"08_LDSC/#table-of-contents","title":"Table of Contents","text":""},{"location":"08_LDSC/#introduction","title":"Introduction","text":"

    LDSC is one of the most commonly used command line tool to estimate inflation, hertability, genetic correlation and cell/tissue type specificity from GWAS summary statistics.

    "},{"location":"08_LDSC/#ld-linkage-disequilibrium","title":"LD: Linkage disequilibrium","text":"

    Linkage disequilibrium (LD) : non-random association of alleles at different loci in a given population. (Wiki)

    "},{"location":"08_LDSC/#ld-score","title":"LD score","text":"

    LD score \\(l_j\\) for a SNP \\(j\\) is defined as the sum of \\(r^2\\) for the SNP and other SNPs in a region.

    \\[ l_j= \\Sigma_k{r^2_{j,k}} \\]"},{"location":"08_LDSC/#ld-score-regression_1","title":"LD score regression","text":"

    Key idea: A variant will have higher test statistics if it is in LD with causal variant, and the elevation is proportional to the correlation ( \\(r^2\\) ) with the causal variant.

    \\[ E[\\chi^2|l_j] = {{Nh^2l_j}\\over{M}} + Na + 1 \\]

    For more details of LD score regression, please refer to : - Bulik-Sullivan, Brendan K., et al. \"LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.\" Nature genetics 47.3 (2015): 291-295.

    "},{"location":"08_LDSC/#install-ldsc","title":"Install LDSC","text":"

    LDSC can be downloaded from github (GPL-3.0 license): https://github.com/bulik/ldsc

    For ldsc, we need anaconda to create virtual environment (for python2). If you haven't installed Anaconda, please check how to install anaconda.

    # change to your directory for tools\ncd ~/tools\n\n# clone the ldsc github repository\ngit clone https://github.com/bulik/ldsc.git\n\n# create a virtual environment for ldsc (python2)\ncd ldsc\nconda env create --file environment.yml  \n\n# activate ldsc environment\nconda activate ldsc\n
    "},{"location":"08_LDSC/#data-preparation","title":"Data Preparation","text":"

    In this tutoial, we will use sample summary statistics for HDLC and LDLC from Jenger. - Kanai, Masahiro, et al. \"Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases.\" Nature genetics 50.3 (2018): 390-400.

    The Miami plot for the two traits:

    "},{"location":"08_LDSC/#download-sample-summary-statistics","title":"Download sample summary statistics","text":"
    # HDL-c and LDL-c in Biobank Japan\nwget -O BBJ_LDLC.txt.gz http://jenger.riken.jp/61analysisresult_qtl_download/\nwget -O BBJ_HDLC.txt.gz http://jenger.riken.jp/47analysisresult_qtl_download/\n
    "},{"location":"08_LDSC/#download-reference-files","title":"Download reference files","text":"

    # change to your ldsc directory\ncd ~/tools/ldsc\nmkdir resource\ncd ./resource\n\n# snplist\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/w_hm3.snplist.bz2\n\n# EAS ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/eas_ldscores.tar.bz2\n\n# EAS weight\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_weights_hm3_no_MHC.tgz\n\n# EAS frequency\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_plinkfiles.tgz\n\n# EAS baseline model\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_EAS_baseline_v1.2_ldscores.tgz\n\n# Cell type ld score files\nwget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/LDSC_SEG_ldscores/Cahoy_EAS_1000Gv3_ldscores.tar.gz\n
    You can then decompress the files and organize them.

    "},{"location":"08_LDSC/#munge-sumstats","title":"Munge sumstats","text":"

    Before the analysis, we need to format and clean the raw sumstats.

    Note

    Rsid is used here. If the sumstats only contained id like CHR:POS:REF:ALT, annotate it first.

    snplist=~/tools/ldsc/resource/w_hm3.snplist\nmunge_sumstats.py \\\n    --sumstats BBJ_HDLC.txt.gz \\\n    --merge-alleles $snplist \\\n    --a1 ALT \\\n    --a2 REF \\\n    --chunksize 500000 \\\n    --out BBJ_HDLC\nmunge_sumstats.py \\\n  --sumstats BBJ_LDLC.txt.gz \\\n    --a1 ALT \\\n  --a2 REF \\\n  --chunksize 500000 \\\n  --merge-alleles $snplist \\\n  --out BBJ_LDLC\n

    After munging, you will get two munged and formatted files:

    BBJ_HDLC.sumstats.gz\nBBJ_LDLC.sumstats.gz\n
    And these are the files we will use to run LD score regression.

    "},{"location":"08_LDSC/#ld-score-regression_2","title":"LD score regression","text":"

    Univariate LD score regression is utilized to estimate heritbility and confuding factors (cryptic relateness and population stratification) of a certain trait.

    Using the munged sumstats, we can run:

    ldsc.py \\\n  --h2 BBJ_HDLC.sumstats.gz \\\n  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --out BBJ_HDLC\n\nldsc.py \\\n  --h2 BBJ_LDLC.sumstats.gz \\\n  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --out BBJ_LDLC\n

    Lest's check the results for HDLC:

    cat BBJ_HDLC.log\n*********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--h2 BBJ_HDLC.sumstats.gz \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Sat Dec 24 20:40:34 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nUsing two-step estimator with cutoff at 30.\nTotal Observed scale h2: 0.1583 (0.0281)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.0563 (0.0114)\nRatio: 0.1981 (0.0402)\nAnalysis finished at Sat Dec 24 20:40:41 2022\nTotal time elapsed: 6.57s\n

    We can see that from the log:

    According to LDSC documents, Ratio measures the proportion of the inflation in the mean chi^2 that the LD Score regression intercept ascribes to causes other than polygenic heritability. The value of ratio should be close to zero, though in practice values of 10-20% are not uncommon.

    \\[ Ratio = {{intercept-1}\\over{mean(\\chi^2)-1}} \\]"},{"location":"08_LDSC/#distribution-of-h2-and-intercept-across-traits-in-ukb","title":"Distribution of h2 and intercept across traits in UKB","text":"

    The Neale Lab estimated SNP heritability using LDSC across more than 4,000 primary GWAS in UKB. You can check the distributions of SNP heritability and intercept estimates using the following link to get the idea of what you can expect from LD score regresion:

    https://nealelab.github.io/UKBB_ldsc/viz_h2.html

    "},{"location":"08_LDSC/#cross-trait-ld-score-regression","title":"Cross-trait LD score regression","text":"

    Cross-trait LD score regression is employed to estimate the genetic correlation between a pair of traits.

    Key idea: replace \\chi^2 in univariate LD score regression and the relationship (SNPs with high LD ) still holds.

    \\[ E[z_{1j}z_{2j}] = {{\\sqrt{N_1N_2}\\rho_g}\\over{M}}l_j + {{\\rho N_s}\\over{\\sqrt{N_1N_2}}} \\]

    Then we can get the genetic correlation by :

    \\[ r_g = {{\\rho_g}\\over{\\sqrt{h_1^2h_2^2}}} \\]

    ldsc.py \\\n  --rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \\\n  --out BBJ_HDLC_LDLC\n
    Let's check the results:

    *********************************************************************\n* LD Score Regression (LDSC)\n* Version 1.0.1\n* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane\n* Broad Institute of MIT and Harvard / MIT Department of Mathematics\n* GNU General Public License v3\n*********************************************************************\nCall: \n./ldsc.py \\\n--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \\\n--out BBJ_HDLC_LDLC \\\n--rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \\\n--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \n\nBeginning analysis at Thu Dec 29 21:02:37 2022\nReading summary statistics from BBJ_HDLC.sumstats.gz ...\nRead summary statistics for 1020377 SNPs.\nReading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead reference panel LD Scores for 1208050 SNPs.\nRemoving partitioned LD Scores with zero variance.\nReading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)\nRead regression weight LD Scores for 1208050 SNPs.\nAfter merging with reference panel LD, 1012040 SNPs remain.\nAfter merging with regression SNP LD, 1012040 SNPs remain.\nComputing rg for phenotype 2/2\nReading summary statistics from BBJ_LDLC.sumstats.gz ...\nRead summary statistics for 1217311 SNPs.\nAfter merging with summary statistics, 1012040 SNPs remain.\n1012040 SNPs with valid alleles.\n\nHeritability of phenotype 1\n---------------------------\nTotal Observed scale h2: 0.1054 (0.0383)\nLambda GC: 1.1523\nMean Chi^2: 1.2843\nIntercept: 1.1234 (0.0607)\nRatio: 0.4342 (0.2134)\n\nHeritability of phenotype 2/2\n-----------------------------\nTotal Observed scale h2: 0.0543 (0.0211)\nLambda GC: 1.0833\nMean Chi^2: 1.1465\nIntercept: 1.0583 (0.0335)\nRatio: 0.398 (0.2286)\n\nGenetic Covariance\n------------------\nTotal Observed scale gencov: 0.0121 (0.0106)\nMean z1*z2: -0.001\nIntercept: -0.0198 (0.0121)\n\nGenetic Correlation\n-------------------\nGenetic Correlation: 0.1601 (0.1821)\nZ-score: 0.8794\nP: 0.3792\n\n\nSummary of Genetic Correlation Results\np1                    p2      rg      se       z       p  h2_obs  h2_obs_se  h2_int  h2_int_se  gcov_int  gcov_int_se\nBBJ_HDLC.sumstats.gz  BBJ_LDLC.sumstats.gz  0.1601  0.1821  0.8794  0.3792  0.0543     0.0211  1.0583     0.0335   -0.0198       0.0121\n\nAnalysis finished at Thu Dec 29 21:02:47 2022\nTotal time elapsed: 10.39s\n
    "},{"location":"08_LDSC/#partitioned-ld-regression","title":"Partitioned LD regression","text":"

    Partitioned LD regression is utilized to evaluate the contribution of each functional group to the total SNP heriatbility.

    \\[ E[\\chi^2] = N \\sum\\limits_C \\tau_C l(j,C) + Na + 1 \\]
    ldsc.py \\\n  --h2 BBJ_HDLC.sumstats.gz \\\n  --overlap-annot \\\n  --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n  --frqfile-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_plinkfiles/1000G.EAS.QC. \\\n  --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n  --out BBJ_HDLC_baseline\n
    "},{"location":"08_LDSC/#celltype-specificity-ld-regression","title":"Celltype specificity LD regression","text":"

    LDSC-SEG : LD score regression applied to specifically expressed genes

    An extension of Partitioned LD regression. Categories are defined by tissue or cell-type specific genes.

    ldsc.py \\\n  --h2-cts BBJ_HDLC.sumstats.gz \\\n  --ref-ld-chr-cts ~/tools/ldsc/resource/Cahoy_EAS_1000Gv3_ldscores/Cahoy.EAS.ldcts \\\n  --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \\\n  --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \\\n  --out BBJ_HDLC_baseline_cts\n
    "},{"location":"08_LDSC/#reference","title":"Reference","text":""},{"location":"09_Gene_based_analysis/","title":"Gene and gene-set analysis","text":""},{"location":"09_Gene_based_analysis/#table-of-contents","title":"Table of Contents","text":""},{"location":"09_Gene_based_analysis/#magma-introduction","title":"MAGMA Introduction","text":"

    MAGMA is one the most commonly used tools for gene-based and gene-set analysis.

    Gene-level analysis in MAGMA uses two models:

    1.Multiple linear principal components regression

    MAGMA employs a multiple linear principal components regression, and F test to obtain P values for genes. The multiple linear principal components regression:

    \\[ Y = \\alpha_{0,g} + X_g \\alpha_g + W \\beta_g + \\epsilon_g \\]

    \\(X_g\\) is obtained by first projecting the variant matrix of a gene onto its PC, and removing PCs with samll eigenvalues.

    Note

    The linear principal components regression model requires raw genotype data.

    2.SNP-wise models

    SNP-wise Mean: perform tests on mean SNP association

    Note

    SNP-wise models use summary statistics and reference LD panel

    Gene-set analysis

    Quote

    Competitive gene-set analysis tests whether the genes in a gene-set are more strongly associated with the phenotype of interest than other genes.

    P values for each gene were converted to Z scores to perform gene-set level analysis.

    \\[ Z = \\beta_{0,S} + S_S \\beta_S + \\epsilon \\] "},{"location":"09_Gene_based_analysis/#install-magma","title":"Install MAGMA","text":"

    Dowload MAGMA for your operating system from the following url:

    MAGMA: https://ctg.cncr.nl/software/magma

    For example:

    cd ~/tools\nmkdir MAGMA\ncd MAGMA\nwget https://ctg.cncr.nl/software/MAGMA/prog/magma_v1.10.zip\nunzip magma_v1.10.zip\n
    Add magma to your environment path.

    Test if it is successfully installed.

    $ magma --version\nMAGMA version: v1.10 (linux)\n

    "},{"location":"09_Gene_based_analysis/#download-reference-files","title":"Download reference files","text":"

    We nedd the following reference files:

    The gene location files and LD reference panel can be downloaded from magma website.

    -> https://ctg.cncr.nl/software/magma

    The third one can be downloaded form MsigDB.

    -> https://www.gsea-msigdb.org/gsea/msigdb/

    "},{"location":"09_Gene_based_analysis/#format-input-files","title":"Format input files","text":"
    zcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,$2,$3}' > HDLC_chr3.magma.input.snp.chr.pos.txt\nzcat ../08_LDSC/BBJ_HDLC.txt.gz | awk 'NR>1 && $2==3 {print $1,10^(-$11)}' >  HDLC_chr3.magma.input.p.txt\n
    "},{"location":"09_Gene_based_analysis/#annotate-snps","title":"Annotate SNPs","text":"
    snploc=./HDLC_chr3.magma.input.snp.chr.pos.txt\nncbi37=~/tools/magma/NCBI37/NCBI37.3.gene.loc\nmagma --annotate \\\n      --snp-loc ${snploc} \\\n      --gene-loc ${ncbi37} \\\n      --out HDLC_chr3\n

    Tip

    Usually to capture the variants in the regulatory regions, we will add windows upstream and downstream of the genes with --annotate window.

    For example, --annotate window=35,10 set a 35 kilobase pair(kb) upstream and 10kb downstream window.

    "},{"location":"09_Gene_based_analysis/#gene-based-analysis","title":"Gene-based analysis","text":"
    ref=~/tools/magma/g1000_eas/g1000_eas\nmagma \\\n    --bfile $ref \\\n    --pval ./HDLC_chr3.magma.input.p.txt N=70657 \\\n    --gene-annot HDLC_chr3.genes.annot \\\n    --out HDLC_chr3\n
    "},{"location":"09_Gene_based_analysis/#gene-set-level-analysis","title":"Gene-set level analysis","text":"
    geneset=/home/he/tools/magma/MSigDB/msigdb_v2022.1.Hs_files_to_download_locally/msigdb_v2022.1.Hs_GMTs/msigdb.v2022.1.Hs.entrez.gmt\nmagma \\\n    --gene-results HDLC_chr3.genes.raw \\\n    --set-annot ${geneset} \\\n    --out HDLC_chr3\n
    "},{"location":"09_Gene_based_analysis/#reference","title":"Reference","text":""},{"location":"10_PRS/","title":"Polygenic risk scores","text":""},{"location":"10_PRS/#definition","title":"Definition","text":"

    Polygenic risk score(PRS), as known as polygenic score (PGS) or genetic risk score (GRS), is a score that summarizes the effect sizes of genetic variants on a certain disease or trait (weighted sum of disease/trait-associated alleles).

    To calculate the PRS for sample j,

    \\[PRS_j = \\sum_{i=0}^{i=M} x_{i,j} \\beta_{i}\\] "},{"location":"10_PRS/#prs-analysis-workflow","title":"PRS Analysis Workflow","text":"
    1. Developing PRS model using base data
    2. Performing validation to obtain best-fit parameters
    3. Evaluation in an independent population
    "},{"location":"10_PRS/#methods","title":"Methods","text":"Category Description Representative Methods P value thresholding P + T C+T, PRSice Beta shrinkage genome-wide PRS model LDpred, PRS-CS

    In this tutorial, we will first briefly introduce how to develop PRS model using the sample data and then demonstrate how we can download PRS models from PGS Catalog and apply to our sample genotype data.

    "},{"location":"10_PRS/#ctpt-using-plink","title":"C+T/P+T using PLINK","text":"

    P+T stands for Pruning + Thresholding, also known as Clumping and Thresholding(C+T), which is a very simple and straightforward approach to constructing PRS models.

    Clumping

    Clumping: LD-pruning based on P value. It is a approach to select variants when there are multiple significant associations in high LD in the same region.

    The three important parameters for clumping in PLINK are:

    Clumping using PLINK

    #!/bin/bash\n\nplinkFile=../04_Data_QC/sample_data.clean\nsumStats=../06_Association_tests/1kgeas.B1.glm.firth\n\nplink \\\n    --bfile ${plinkFile} \\\n    --clump-p1 0.0001 \\\n    --clump-r2 0.1 \\\n    --clump-kb 250 \\\n    --clump ${sumStats} \\\n    --clump-snp-field ID \\\n    --clump-field P \\\n    --out 1kg_eas\n

    log

    --clump: 40 clumps formed from 307 top variants.\n
    check only the header and the first \"clump\" of SNPs.

    head -n 2 1kg_eas.clumped\n  CHR    F              SNP         BP        P    TOTAL   NSIG    S05    S01   S001  S0001    SP2\n2    1   2:55513738:C:T   55513738   1.69e-15       52      0      3      1      6     42 2:55305475:A:T(1),2:55338196:T:C(1),2:55347135:G:A(1),2:55351853:A:G(1),2:55363460:G:A(1),2:55395372:A:G(1),2:55395578:G:A(1),2:55395807:C:T(1),2:55405847:C:A(1),2:55408556:C:A(1),2:55410835:C:T(1),2:55413644:C:G(1),2:55435439:C:T(1),2:55449464:T:C(1),2:55469819:A:T(1),2:55492154:G:A(1),2:55500529:A:G(1),2:55502651:A:G(1),2:55508333:G:C(1),2:55563020:A:G(1),2:55572944:T:C(1),2:55585915:A:G(1),2:55599810:C:T(1),2:55605943:A:G(1),2:55611766:T:C(1),2:55612986:G:C(1),2:55619923:C:T(1),2:55622624:G:A(1),2:55624520:C:T(1),2:55628936:G:C(1),2:55638830:T:C(1),2:55639023:A:T(1),2:55639980:C:T(1),2:55640649:G:A(1),2:55641045:G:A(1),2:55642887:C:T(1),2:55647729:A:G(1),2:55650512:G:A(1),2:55659155:A:G(1),2:55665620:A:G(1),2:55667476:G:T(1),2:55670729:A:G(1),2:55676257:C:T(1),2:55685927:C:A(1),2:55689569:A:T(1),2:55689913:T:C(1),2:55693097:C:G(1),2:55707583:T:C(1),2:55720135:C:G(1)\n
    "},{"location":"10_PRS/#beta-shrinkage-using-prs-cs","title":"Beta shrinkage using PRS-CS","text":"\\[ \\beta_j | \\Phi_j \\sim N(0,\\phi\\Phi_j) , \\Phi_j \\sim g \\]

    Reference: Ge, T., Chen, C. Y., Ni, Y., Feng, Y. C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications, 10(1), 1-10.

    "},{"location":"10_PRS/#parameter-tuning","title":"Parameter tuning","text":"Method Description Cross-validation 10-fold cross validation. This method usually requires large-scale genotype dataset. Independent population Perform validation in an independent population of the same ancestry. Pseudo-validation A few methods can estimate a single optimal shrinkage parameter using only the base GWAS summary statistics."},{"location":"10_PRS/#pgs-catalog","title":"PGS Catalog","text":"

    Just like GWAS Catalog, you can now download published PRS models from PGS catalog.

    URL: http://www.pgscatalog.org/

    Reference: Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.

    "},{"location":"10_PRS/#calculate-prs-using-plink","title":"Calculate PRS using PLINK","text":"
    plink --score <score_filename> [variant ID col.] [allele col.] [score col.] ['header']\n

    Please check here for detailed documents on plink --score.

    Example

    # genotype data\nplinkFile=../04_Data_QC/sample_data.clean\n# summary statistics for scoring\nsumStats=./t2d_plink_reduced.txt\n# SNPs after clumpping\nawk 'NR!=1{print $3}' 1kgeas.clumped >  1kgeas.valid.snp\n\nplink \\\n    --bfile ${plinkFile} \\\n    --score ${sumStats} 1 2 3 header \\\n    --extract 1kgeas.valid.snp \\\n    --out 1kgeas\n

    For thresholding using P values, we can create a range file and a p-value file.

    The options we use:

    --q-score-range <range file> <data file> [variant ID col.] [data col.] ['header']\n

    Example

    # SNP - P value file for thresholding\nawk '{print $1,$4}' ${sumStats} > SNP.pvalue\n\n# create a range file with 3 columns: range label, p-value lower bound, p-value upper bound\nhead range_list\npT0.001 0 0.001\npT0.05 0 0.05\npT0.1 0 0.1\npT0.2 0 0.2\npT0.3 0 0.3\npT0.4 0 0.4\npT0.5 0 0.5\n

    and then calculate the scores using the p-value ranges:

    plink2 \\\n--bfile ${plinkFile} \\\n--score ${sumStats} 1 2 3 header cols=nallele,scoreavgs,denom,scoresums\\\n--q-score-range range_list SNP.pvalue \\\n--extract 1kgeas.valid.snp \\\n--out 1kgeas\n

    You will get the following files:

    1kgeas.pT0.001.sscore\n1kgeas.pT0.05.sscore\n1kgeas.pT0.1.sscore\n1kgeas.pT0.2.sscore\n1kgeas.pT0.3.sscore\n1kgeas.pT0.4.sscore\n1kgeas.pT0.5.sscore\n

    Take a look at the files:

    head 1kgeas.pT0.1.sscore\n#IID    ALLELE_CT       DENOM   SCORE1_AVG      SCORE1_SUM\nHG00403 54554   54976   2.84455e-05     1.56382\nHG00404 54574   54976   5.65172e-05     3.10709\nHG00406 54284   54976   -3.91872e-05    -2.15436\nHG00407 54348   54976   -9.87606e-05    -5.42946\nHG00409 54760   54976   1.67157e-05     0.918963\nHG00410 54656   54976   3.74405e-05     2.05833\nHG00419 54052   54976   -6.4035e-05     -3.52039\nHG00421 54210   54976   -1.55942e-05    -0.857305\nHG00422 54102   54976   5.28824e-05     2.90726\n
    "},{"location":"10_PRS/#meta-scoring-methods-for-prs","title":"Meta-scoring methods for PRS","text":"

    It has been shown recently that the PRS models generated from multiple traits using a meta-scoring method potentially outperforms PRS models generated from a single trait. Inouye et al. first used this approach for generating a PRS model for CAD from multiple PRS models.

    Potential advantages of meta-score for PRS generation

    Reference: Inouye, M., Abraham, G., Nelson, C. P., Wood, A. M., Sweeting, M. J., Dudbridge, F., ... & UK Biobank CardioMetabolic Consortium CHD Working Group. (2018). Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology, 72(16), 1883-1893.

    elastic net

    Elastic net is a common approach for variable selection when there are highly correlated variables (for example, PRS of correlated diseases are often highly correlated.). When fitting linear or logistic models, L1 and L2 penalties are added (regularization).

    \\[ \\hat{\\beta} \\equiv argmin({\\parallel y- X \\beta \\parallel}^2 + \\lambda_2{\\parallel \\beta \\parallel}^2 + \\lambda_1{\\parallel \\beta \\parallel} ) \\]

    After validation, PRS can be generated from distinct PRS for other genetically correlated diseases :

    \\[PRS_{meta} = {w_1}PRS_{Trait1} + {w_2}PRS_{Trait2} + {w_3}PRS_{Trait3} + ... \\]

    An example: Abraham, G., Malik, R., Yonova-Doing, E., Salim, A., Wang, T., Danesh, J., ... & Dichgans, M. (2019). Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nature communications, 10(1), 1-10.

    "},{"location":"10_PRS/#reference","title":"Reference","text":""},{"location":"11_meta_analysis/","title":"Meta-analysis","text":""},{"location":"11_meta_analysis/#aims","title":"Aims","text":"

    Meta-analysis is one of the most commonly used statistical methods to combine the evidence from multiple studies into a single result.

    Potential problems for small-scale genome-wide association studies

    To address these problems, meta-analysis is a powerful approach to integrate multiple GWAS summary statistics, especially when more and more summary statistics are publicly available. . This method allows us to obtain increases in statistical power as sample size increases.

    What we could achieve by conducting meta-analysis

    "},{"location":"11_meta_analysis/#a-typical-workflow-of-meta-analysis","title":"A typical workflow of meta-analysis","text":""},{"location":"11_meta_analysis/#harmonization-and-qc-for-gwa-meta-analysis","title":"Harmonization and QC for GWA meta-analysis","text":"

    Before performing any type of meta-analysis, we need to make sure our datasets contain sufficient information and the datasets are QCed and harmonized. It is important to perform this step to avoid any unexpected errors and heterogeneity.

    Key points for Dataset selection

    Key points for Quality control

    Key points for Harmonization

    "},{"location":"11_meta_analysis/#fixed-effects-meta-analysis","title":"Fixed effects meta-analysis","text":"

    Simply speaking, the fixed effects we mentioned here mean that the between-study variance is zero. Under the fixed effect model, we assume a common effect size across studies for a certain SNP.

    Fixed effect model

    \\[ \\bar{\\beta_{ij}} = {{\\sum_{i=1}^{k} {w_{ij} \\beta_{ij}}}\\over{\\sum_{i=1}^{k} {w_{ij}}}} \\] "},{"location":"11_meta_analysis/#heterogeneity-test","title":"Heterogeneity test","text":"

    Cochran's Q test and \\(I^2\\)

    \\[ Q = \\sum_{i=1}^{k} {w_i (\\beta_i - \\bar{\\beta})^2} \\] \\[ I_j^2 = {{Q_j - df_j}\\over{Q_j}}\\times 100% = {{Q - (k - 1)}\\over{Q}}\\times 100% \\]"},{"location":"11_meta_analysis/#metal","title":"METAL","text":"

    METAL is one of the most commonly used tools for GWA meta-analysis. Its official documentation can be found here. METAL supports two models: (1) Sample size based approach and (2) Inverse variance based approach.

    A minimal example of meta-analysis using the IVW method

    metal_script.txt
    # classical approach, uses effect size estimates and standard errors\nSCHEME STDERR  \n\n# === DESCRIBE AND PROCESS THE FIRST INPUT FILE ===\nMARKER SNP\nALLELE REF_ALLELE OTHER_ALLELE\nEFFECT BETA\nPVALUE PVALUE \nSTDERR SE \nPROCESS inputfile1.txt\n\n# === THE SECOND INPUT FILE HAS THE SAME FORMAT AND CAN BE PROCESSED IMMEDIATELY ===\nPROCESS inputfile2.txt\n\nANALYZE\n

    Then, just run the following command to execute the metal script.

    metal meta_input.txt\n
    "},{"location":"11_meta_analysis/#random-effects-meta-analysis","title":"Random effects meta-analysis","text":"

    On the other hand, random effects mean that we need to model the between-study variance, which is not zero in this case. Under the random effect model, we assume the true effect size for a certain SNP varies across studies.

    If heterogeneity of effects exists across studies, we need to model the between-study variance to correct for the deflation of variance in fixed-effect estimates.

    "},{"location":"11_meta_analysis/#gwama","title":"GWAMA","text":"

    Random effect model

    The random effect variance component can be estimated by:

    \\[ r_j^2 = max\\left(0, {{Q_j - (N_j -1)}\\over{\\sum_iw_{ij} - ({{\\sum_iw_{ij}^2} \\over {\\sum_iw_ {ij}}})}}\\right)\\]

    Then the effect size for SNP j can be obtained by:

    \\[ \\bar{\\beta_j}^* = {{\\sum_{i=1}^{k} {w_{ij}^* \\beta_i}}\\over{\\sum_{i=1}^{k} {w_{ij}^*}}} \\]

    The weights are estimated by:

    \\[w_{ij}^* = {{1}\\over{r_j^2 + Var(\\beta_{ij})}} \\]

    The random effect model was implemented in GWAMA, which is another very popular GWA meta-analysis tool. Its official documentation can be found here.

    A minimal example of random effect meta-analysis using GWAMA

    The input file for GWAMA contains the path to each sumstats. Column names need to be standardized.

    GWAMA_script.in
    Pop1.txt\nPop2.txt\nPop3.txt\n
    GWAMA \\\n    -i GWAMA_script.in \\\n    --random \\\n    -o myresults\n
    "},{"location":"11_meta_analysis/#cross-ancestry-meta-analysis","title":"Cross-ancestry meta-analysis","text":""},{"location":"11_meta_analysis/#mantra","title":"MANTRA","text":"

    MANTRA (Meta-ANalysis of Transethnic Association studies) is one of the early efforts to address the heterogeneity for cross-ancestry meta-analysis.

    MANTRA implements a Bayesian partition model where GWASs were clustered into ancestry clusters based on a prior model of similarity between them. MANTRA then uses Markov chain Monte Carlo (MCMC) algorithms to approximate the posterior distribution of parameters (which might be quite computationally intensive). MANTRA has been shown to increase power and mapping resolution over random-effects meta-analysis over a range of models of heterogeneity situations.

    "},{"location":"11_meta_analysis/#mr-mega","title":"MR-MEGA","text":"

    MR-MEGA employs meta-regression to model the heterogeneity in effect sizes across ancestries. Its official documentation can be found here (The same first author as GWAMA).

    Meta-regression implemented in MR-MEGA

    It will first construct a matrix \\(D\\) of pairwise Euclidean distances between GWAS across autosomal variants. The elements of D , $d_{k'k} $ for a pair of studies can be expressed as the following. For each variant \\(j\\), \\(p_{kj}\\) is the allele frequency of j in study k, then:

    \\[d_{k'k} = {{\\sum_jI_j(p_{kj}-p_{k'j})^2}\\over{\\sum_jI_j}}\\]

    Then multi-dimensional scaling (MDS) will be performed to derive T axes of genetic variation (\\(x_k\\) for study k)

    For each variant j, the effect size of the reference allele can be modeled in a linear regression model as :

    \\[E[\\beta_{kj}] = \\beta_j + \\sum_{t=1}^T\\beta_{tj}x_{kj}\\]

    A minimal example of meta-analysis using MR-MEGA

    The input file for MR-MEGA contains the path to each sumstats. Column names need to be standardized like GWAMA.

    MRMEGA_script.in
    Pop1.txt.gz\nPop2.txt.gz\nPop3.txt.gz\nPop4.txt.gz\nPop5.txt.gz\nPop6.txt.gz\nPop7.txt.gz\nPop8.txt.gz\n
    MR-MEGA \\\n    -i MRMEGA_script.in \\\n    --pc 4 \\\n    -o myresults\n
    "},{"location":"11_meta_analysis/#global-biobank-meta-analysis-initiative-gbmi","title":"Global Biobank Meta-analysis Initiative (GBMI)","text":"

    As a recent success achieved by meta-analysis, GBMI showed an example of the improvement of our understanding of diseases by taking advantage of large-scale meta-analyses.

    For more details, you check check here.

    "},{"location":"11_meta_analysis/#reference","title":"Reference","text":""},{"location":"12_fine_mapping/","title":"Fine-mapping","text":""},{"location":"12_fine_mapping/#introduction","title":"Introduction","text":"

    Fine-mapping : Fine-mapping aims to identify the causal variant(s) within a locus for a disease, given the evidence of the significant association of the locus (or genomic region) in GWAS of a disease.

    Fine-mapping using individual data is usually performed by fitting the multiple linear regression model:

    \\[y = Xb + e\\]

    Fine-mapping (using Bayesian methods) aims to estimate the PIP (posterior inclusion probability), which indicates the evidence for SNP j having a non-zero effect (namely, causal).

    PIP(Posterior Inclusion Probability)

    PIP is often calculated by the sum of the posterior probabilities over all models that include variant j as causal.

    \\[ PIP_j:=Pr(b_j\\neq0|X,y) \\]

    Bayesian methods and Posterior probability

    \\[ Pr(M_m | O) = {{Pr(O | M_m) Pr(M_m)}\\over{\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}} \\]

    \\(O\\) : Observed data

    \\(M\\) : Models (the configurations of causal variants in the context of fine-mapping).

    \\(Pr(M_m | O)\\): Posterior Probability of Model m

    \\(Pr(O | M_m)\\): Likelihood (the probability of observing your dataset given Model m is true.)

    \\(Pr(M_m)\\): Prior distribution of Model m (the probability of Model m being true)

    \\({\\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}\\): Evidence (the probability of observing your dataset), namely \\(Pr(O)\\)

    Credible sets

    A credible set refers to the minimum set of variants that contains all causal SNPs with probability \\(\u03b1\\). (Under the single-causal-variant-per-locus assumption, the credible set is calculated by ranking variants based on their posterior probabilities, and then summing these until the cumulative sum is \\(>\u03b1\\)). We usually report 95% credible sets (\u03b1=95%) for fine-mapping analysis.

    Commonly used tools for fine-mapping

    Methods assuming only one causal variant in the locus

    Methods assuming multiple causal variants in the locus

    Methods assuming a small number of larger causal effects with a large number of infinitesimal effects

    Methods for Cross-ancestry fine-mapping

    You can check here for more information.

    In this tutorial, we will introduce SuSiE as an example. SuSiE stands for Sum of Single Effects\u201d model.

    The key idea behind SuSiE is :

    \\[b = \\sum_{l=1}^L b_l \\]

    where each vector \\(b_l = (b_{l1}, \u2026, b_{lJ})^T\\) is a so-called single effect vector (a vector with only one non-zero element). L is the upper bound of number of causal variants. And this model could be fitted using Iterative Bayesian Stepwise Selection (IBSS).

    For fine-mapping with summary statistics using Susie (SuSiE-RSS), IBSS was modified (IBSS-ss) to take sufficient statistics (which can be computed from other combinations of summary statistics) as input. SuSie will then approximate the sufficient statistics to run fine-mapping.

    Quote

    For details of SuSiE and SuSiE-RSS, please check : Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the \u201cSum of Single Effects\u201d model. PLoS Genetics, 18(7), e1010299. Link

    "},{"location":"12_fine_mapping/#file-preparation","title":"File Preparation","text":"

    Using python to check novel loci and extract the files.

    import gwaslab as gl\nimport pandas as pd\nimport numpy as np\n\nsumstats = gl.Sumstats(\"../06_Association_tests/1kgeas.B1.glm.firth\",fmt=\"plink2\")\n...\n\nsumstats.basic_check()\n...\n\nsumstats.get_lead()\n\nFri Jan 13 23:31:43 2023 Start to extract lead variants...\nFri Jan 13 23:31:43 2023  -Processing 1122285 variants...\nFri Jan 13 23:31:43 2023  -Significance threshold : 5e-08\nFri Jan 13 23:31:43 2023  -Sliding window size: 500  kb\nFri Jan 13 23:31:44 2023  -Found 59 significant variants in total...\nFri Jan 13 23:31:44 2023  -Identified 3 lead variants!\nFri Jan 13 23:31:44 2023 Finished extracting lead variants successfully!\n\nSNPID CHR POS EA  NEA SE  Z P OR  N STATUS\n110723  2:55574452:G:C  2 55574452  C G 0.160948  -5.98392  2.178320e-09  0.381707  503 9960099\n424615  6:29919659:T:C  6 29919659  T C 0.155457  -5.89341  3.782970e-09  0.400048  503 9960099\n635128  9:36660672:A:G  9 36660672  G A 0.160275  5.63422 1.758540e-08  2.467060  503 9960099\n
    We will perform fine-mapping for the first significant loci whose lead variant is 2:55574452:G:C.

    # filter in the variants in the this locus.\n\nlocus = sumstats.filter_value('CHR==2 & POS>55074452 & POS<56074452')\nlocus.fill_data(to_fill=[\"BETA\"])\nlocus.harmonize(basic_check=False, ref_seq=\"/Users/he/mydata/Reference/Genome/human_g1k_v37.fasta\")\nlocus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\n

    check in terminal:

    head sig_locus.tsv\nSNPID   CHR     POS     EA      NEA     BETA    SE      Z       P       OR      N       STATUS\n2:54535206:C:T  2       54535206        T       C       0.30028978      0.142461        2.10786 0.0350429       1.35025 503     9960099\n2:54536167:C:G  2       54536167        G       C       0.14885099      0.246871        0.602952        0.546541        1.1605  503     9960099\n2:54539096:A:G  2       54539096        G       A       -0.0038474211   0.288489        -0.0133355      0.98936 0.99616 503     9960099\n2:54540264:G:A  2       54540264        A       G       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n2:54540614:G:T  2       54540614        T       G       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n2:54540621:A:G  2       54540621        G       A       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n2:54540970:T:C  2       54540970        C       T       -0.049506452    0.149053        -0.332144       0.739781        0.951699        503     9960099\n2:54544229:T:C  2       54544229        C       T       -0.14338203     0.151172        -0.948468       0.342891        0.866423        503     9960099\n2:54545593:T:C  2       54545593        C       T       -0.1536723      0.165879        -0.926409       0.354234        0.857553        503     9960099\n\nhead  sig_locus.snplist\n2:54535206:C:T\n2:54536167:C:G\n2:54539096:A:G\n2:54540264:G:A\n2:54540614:G:T\n2:54540621:A:G\n2:54540970:T:C\n2:54544229:T:C\n2:54545593:T:C\n2:54546032:C:G\n

    "},{"location":"12_fine_mapping/#ld-matrix-calculation","title":"LD Matrix Calculation","text":"

    Example

    #!/bin/bash\n\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\n\n# LD r matrix\nplink \\\n  --bfile ${plinkFile} \\\n  --keep-allele-order \\\n  --r square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt\n\n# LD r2 matrix\nplink \\\n  --bfile ${plinkFile} \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt_r2\n
    Take a look at the LD matrix (first 5 rows and columns)

    head -5 sig_locus_mt.ld | cut -f 1-5\n1       -0.145634       0.252616        -0.0876317      -0.0876317\n-0.145634       1       -0.0916734      -0.159635       -0.159635\n0.252616        -0.0916734      1       0.452333        0.452333\n-0.0876317      -0.159635       0.452333        1       1\n-0.0876317      -0.159635       0.452333        1       1\n\nhead -5 sig_locus_mt_r2.ld | cut -f 1-5\n1       0.0212091       0.0638148       0.00767931      0.00767931\n0.0212091       1       0.00840401      0.0254833       0.0254833\n0.0638148       0.00840401      1       0.204605        0.204605\n0.00767931      0.0254833       0.204605        1       1\n0.00767931      0.0254833       0.204605        1       1\n
    Heatmap of the LD matrix:

    "},{"location":"12_fine_mapping/#fine-mapping-with-summary-statistics-using-susier","title":"Fine-mapping with summary statistics using SusieR","text":"

    Note

    install.packages(\"susieR\")\n\n# Fine-mapping with summary statistics\nfitted_rss2 = susie_rss(bhat = sumstats$betahat, shat = sumstats$sebetahat, R = R, n = n, L = 10)\n

    R : a p x p LD r matrix. N : Sample size. bhat : Alternative summary data giving the estimated effects (a vector of length p). This, together with shat, may be provided instead of z. shat : Alternative summary data giving the standard errors of the estimated effects (a vector of length p). This, together with bhat, may be provided instead of z. L : Maximum number of non-zero effects in the susie regression model. (defaul : L = 10)

    Quote

    For deatils, please check SusieR tutorial - Fine-mapping with susieR using summary statistics

    Use susieR in jupyter notebook (with Python):

    Please check : https://github.com/Cloufield/GWASTutorial/blob/main/12_fine_mapping/finemapping_susie.ipynb

    "},{"location":"12_fine_mapping/#reference","title":"Reference","text":""},{"location":"13_heritability/","title":"Heritability","text":"

    Heritability is a term used in genetics to describe how much phenotypic variation can be explained by genetic variation.

    For any phenotype, its variation \\(Var(P)\\) can be modeled as the combination of genetic effects \\(Var(G)\\) and environmental effects \\(Var(E)\\).

    \\[ Var(P) = Var(G) + Var(E) \\]"},{"location":"13_heritability/#broad-sense-heritability","title":"Broad-sense Heritability","text":"

    The broad-sense heritability \\(H^2_{broad-sense}\\) is mathmatically defined as :

    \\[ H^2_{broad-sense} = {Var(G)\\over{Var(P)}} \\]"},{"location":"13_heritability/#narrow-sense-heritability","title":"Narrow-sense Heritability","text":"

    Genetic effects \\(Var(G)\\) is composed of multiple effects including additive effects \\(Var(A)\\), dominant effects, recessive effects, epistatic effects and so forth.

    Narrrow-sense heritability is defined as:

    \\[ h^2_{narrow-sense} = {Var(A)\\over{Var(P)}} \\]"},{"location":"13_heritability/#snp-heritability","title":"SNP Heritability","text":"

    SNP heritability \\(h^2_{SNP}\\) : the proportion of phenotypic variance explained by tested SNPs in a GWAS.

    Common methods to estimate SNP heritability includes:

    "},{"location":"13_heritability/#liability-and-threshold-model","title":"Liability and Threshold model","text":""},{"location":"13_heritability/#observed-scale-heritability-and-liability-scaled-heritability","title":"Observed-scale heritability and liability-scaled heritability","text":"

    Issue for binary traits :

    The scale issue for binary traits

    Conversion formula (Equation 23 from Lee. 2011):

    \\[ h^2_{liability-scale} = h^2_{observed-scale} * {{K(1-K)}\\over{Z^2}} * {{K(1-K)}\\over{P(1-P)}} \\] "},{"location":"13_heritability/#further-reading","title":"Further Reading","text":""},{"location":"14_gcta_greml/","title":"SNP-Heritability estimation by GCTA-GREML","text":""},{"location":"14_gcta_greml/#introduction","title":"Introduction","text":"

    The basic model behind GCTA-GREML is the linear mixed model (LMM):

    \\[y = X\\beta + Wu + e\\] \\[ Var(y) = V = WW^{'}\\delta^2_u + I \\delta^2_e\\]

    GCTA defines \\(A = WW^{'}/N\\) and \\(\\delta^2_g\\) as the variance explained by SNPs.

    So the oringinal model can be written as:

    \\[y = X\\beta + g + e\\] \\[ Var(y) = V = A\\delta^2_g + I \\delta^2_e\\]

    Quote

    For details, please check Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82. link.

    "},{"location":"14_gcta_greml/#donwload","title":"Donwload","text":"

    Download the version of GCTA for your system from : https://yanglab.westlake.edu.cn/software/gcta/#Download

    Example

    wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip\nunzip gcta-1.94.1-linux-kernel-3-x86_64.zip\ncd gcta-1.94.1-linux-kernel-3-x86_64.zip\n\n./gcta-1.94.1\n*******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 12:22:19 JST on Sun Jan 15 2023.\nHostname: Home-Desktop\n\nError: no analysis has been launched by the option(s)\nPlease see online documentation at https://yanglab.westlake.edu.cn/software/gcta/\n

    Tip

    Add GCTA to your environment

    "},{"location":"14_gcta_greml/#make-grm","title":"Make GRM","text":"
    #!/bin/bash\nplinkFile=\"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\"\ngcta \\\n  --bfile ${plinkFile} \\\n  --autosome \\\n  --maf 0.01 \\\n  --make-grm \\\n  --out 1kg_eas\n
    *******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:21:24 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nOptions:\n\n--bfile ../04_Data_QC/sample_data.clean\n--autosome\n--maf 0.01\n--make-grm\n--out 1kg_eas\n\nNote: GRM is computed using the SNPs on the autosomes.\nReading PLINK FAM file from [../04_Data_QC/sample_data.clean.fam]...\n500 individuals to be included from FAM file.\n500 individuals to be included. 0 males, 0 females, 500 unknown.\nReading PLINK BIM file from [../04_Data_QC/sample_data.clean.bim]...\n1224104 SNPs to be included from BIM file(s).\nThreshold to filter variants: MAF > 0.010000.\nComputing the genetic relationship matrix (GRM) v2 ...\nSubset 1/1, no. subject 1-500\n  500 samples, 1224104 markers, 125250 GRM elements\nIDs for the GRM file have been saved in the file [1kg_eas.grm.id]\nComputing GRM...\n  100% finished in 7.4 sec\n1224104 SNPs have been processed.\n  Used 1128732 valid SNPs.\nThe GRM computation is completed.\nSaving GRM...\nGRM has been saved in the file [1kg_eas.grm.bin]\nNumber of SNPs in each pair of individuals has been saved in the file [1kg_eas.grm.N.bin]\n\nAnalysis finished at 17:21:32 JST on Tue Dec 26 2023\nOverall computational time: 8.51 sec.\n
    "},{"location":"14_gcta_greml/#estimation","title":"Estimation","text":"
    #!/bin/bash\n\n#the grm we calculated in step1\nGRM=1kg_eas\n\n# phenotype file\nphenotypeFile=../01_Dataset/1kgeas_binary_gcta.txt\n\n# disease prevalence used for conversion to liability-scale heritability\nprevalence=0.5\n\n# use 5PCs as covariates \nawk '{print $1,$2,$5,$6,$7,$8,$9}' ../05_PCA/plink_results_projected.sscore > 5PCs.txt\n\ngcta \\\n  --grm ${GRM} \\\n  --pheno ${phenotypeFIile} \\\n  --prevalence ${prevalence} \\\n  --qcovar  5PCs.txt \\\n  --reml \\\n  --out 1kg_eas\n
    "},{"location":"14_gcta_greml/#results","title":"Results","text":"

    Warning

    This is just to show the analysis pipeline. The trait was simulated under an unreal condition (effect size is extremely large) so the result is meaningless here.

    For real analysis, you need a larger sample size to get robust estimation. Please see the GCTA FAQ

    *******************************************************************\n* Genome-wide Complex Trait Analysis (GCTA)\n* version v1.94.1 Linux\n* Built at Nov 15 2022 21:14:25, by GCC 8.5\n* (C) 2010-present, Yang Lab, Westlake University\n* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>\n*******************************************************************\nAnalysis started at 17:36:37 JST on Tue Dec 26 2023.\nHostname: Yunye\n\nAccepted options:\n--grm 1kg_eas\n--pheno ../01_Dataset/1kgeas_binary_gcta.txt\n--prevalence 0.5\n--qcovar 5PCs.txt\n--reml\n--out 1kg_eas\n\nNote: This is a multi-thread program. You could specify the number of threads by the --thread-num option to speed up the computation if there are multiple processors in your machine.\n\nReading IDs of the GRM from [1kg_eas.grm.id].\n500 IDs are read from [1kg_eas.grm.id].\nReading the GRM from [1kg_eas.grm.bin].\nGRM for 500 individuals are included from [1kg_eas.grm.bin].\nReading phenotypes from [../01_Dataset/1kgeas_binary_gcta.txt].\nNon-missing phenotypes of 503 individuals are included from [../01_Dataset/1kgeas_binary_gcta.txt].\nReading quantitative covariate(s) from [5PCs.txt].\n5 quantitative covariate(s) of 501 individuals are included from [5PCs.txt].\nAssuming a disease phenotype for a case-control study: 248 cases and 250 controls\n5 quantitative variable(s) included as covariate(s).\n498 individuals are in common in these files.\n\nPerforming  REML analysis ... (Note: may take hours depending on sample size).\n498 observations, 6 fixed effect(s), and 2 variance component(s)(including residual variance).\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values:  0.12498 0.124846\nlogL: 95.34\nRunning AI-REML algorithm ...\nIter.   logL    V(G)    V(e)\n1       95.34   0.14264 0.10708\n2       95.37   0.18079 0.06875\n3       95.40   0.18071 0.06888\n4       95.40   0.18071 0.06888\nLog-likelihood ratio converged.\n\nCalculating the logLikelihood for the reduced model ...\n(variance component 1 is dropped from the model)\nCalculating prior values of variance components by EM-REML ...\nUpdated prior values: 0.24901\nlogL: 94.78319\nRunning AI-REML algorithm ...\nIter.   logL    V(e)\n1       94.79   0.24900\n2       94.79   0.24899\nLog-likelihood ratio converged.\n\nSummary result of REML analysis:\nSource  Variance        SE\nV(G)    0.180708        0.164863\nV(e)    0.068882        0.162848\nVp      0.249590        0.016001\nV(G)/Vp 0.724021        0.654075\nThe estimate of variance explained on the observed scale is transformed to that on the underlying liability scale:\n(Proportion of cases in the sample = 0.497992; User-specified disease prevalence = 0.500000)\nV(G)/Vp_L       1.137308        1.027434\n\nSampling variance/covariance of the estimates of variance components:\n2.717990e-02    -2.672171e-02\n-2.672171e-02   2.651955e-02\n\nSummary result of REML analysis has been saved in the file [1kg_eas.hsq].\n\nAnalysis finished at 17:36:38 JST on Tue Dec 26 2023\nOverall computational time: 0.08 sec.\n
    "},{"location":"14_gcta_greml/#reference","title":"Reference","text":""},{"location":"15_winners_curse/","title":"Winner's curse","text":""},{"location":"15_winners_curse/#winners-curse-definition","title":"Winner's curse definition","text":"

    Winner's curse refers to the phenomenon that genetic effects are systematically overestimated by thresholding or selection process in genetic association studies.

    Winner's curse in auctions

    This term was initially used to describe a phenomenon that occurs in auctions. The winning bid is very likely to overestimate the intrinsic value of an item even if all the bids are unbiased (the auctioned item is of equal value to all bidders). The thresholding process in GWAS resembles auctions, where the lead variants are the winning bids.

    Reference:

    "},{"location":"15_winners_curse/#wc-correction","title":"WC correction","text":"

    The asymptotic distribution of \\(\\beta_{Observed}\\) is:

    \\[\\beta_{Observed} \\sim N(\\beta_{True},\\sigma^2)\\]

    An example of distribution of \\(\\beta_{Observed}\\)

    It is equivalent to:

    \\[{{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}} \\sim N(0,1)\\]

    An example of distribution of \\({{\\beta_{Observed} - \\beta_{True}}\\over{\\sigma}}\\)

    We can obtain the asymptotic sampling distribution (which is a truncated normal distribution) for \\(\\beta_{Observed}\\) by:

    \\[f(x,\\beta_{True}) ={{1}\\over{\\sigma}} {{\\phi({{{x - \\beta_{True}}\\over{\\sigma}}})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]

    when

    \\[|{{x}\\over{\\sigma}}|\\geq c\\]

    From the asymptotic sampling distribution, the expectation of effect sizes for the selected variants can then be approximated by:

    \\[ E(\\beta_{Observed}; \\beta_{True}) = \\beta_{True} + \\sigma {{\\phi({{{\\beta_{True}}\\over{\\sigma}}-c}) - \\phi({{{-\\beta_{True}}\\over{\\sigma}}-c})} \\over {\\Phi({{{\\beta_{True}}\\over{\\sigma}}-c}) + \\Phi({{{-\\beta_{True}}\\over{\\sigma}}-c})}}\\]

    Derivation of this equation can be found in the Appendix A of Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074.

    Reference:

    Also see reference: https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html

    "},{"location":"16_mendelian_randomization/","title":"Mendelian randomization","text":""},{"location":"16_mendelian_randomization/#mendelian-randomization-introduction","title":"Mendelian randomization introduction","text":"

    Comparison between RCT and MR

    "},{"location":"16_mendelian_randomization/#fundamental-assumption-gene-environment-equivalence","title":"Fundamental assumption: gene-environment equivalence","text":"

    (cited from George Davey Smith Mendelian Randomization - 25th April 2024)

    The fundamental assumption of mendelian randomization (MR) is of gene-environment equivalence. MR reflects the phenocopy/ genocopy dialectic (Goldschmidt, Schmalhausen). The idea here is that all environmental effects can be mimicked by one or several mutations. (Zuckerkandl and Villet, PNAS 1988)

    Gene-environment equivalence

    If we consider BMI as the outcome, let's think about whether genetic variants related to the following exposures meet the gene-environment equivalence assumption:

    "},{"location":"16_mendelian_randomization/#methods-instrumental-variables-iv","title":"Methods: Instrumental Variables (IV)","text":"

    Instrumental variable (IV) can be defined as a variable that is correlated with the exposure X and uncorrelated with the error \\(\\epsilon\\) in the following regression:

    \\[ Y = X\\beta + \\epsilon \\]

    "},{"location":"16_mendelian_randomization/#iv-assumptions","title":"IV Assumptions","text":"

    Key Assumptions

    Assumptions Description Relevance Instrumental variables are strongly associated with the exposure.(IVs are not independent of X) Exclusion restriction Instrumental variables do not affect the outcome except through the exposure.(IV is independent of Y, conditional on X and C) Independence There are no confounders of the instrumental variables and the outcome.(IV is independent of C) Monotonicity Variants affect the exposure in the same direction for all individuals No assortative mating Assortative mating might cause bias in MR"},{"location":"16_mendelian_randomization/#two-stage-least-squares-2sls","title":"Two-stage least-squares (2SLS)","text":"\\[ X = \\mu_1 + \\beta_{IV} IV + \\epsilon_1 \\] \\[ Y = \\mu_2 + \\beta_{2SLS} \\hat{X} + \\epsilon_2 \\]"},{"location":"16_mendelian_randomization/#two-sample-mr","title":"Two-sample MR","text":"

    Two-sample MR refers to the approach that the genetic effects of the instruments on the exposure can be estimated in an independent sample other than that used to estimate effects between instruments on the outcome. As more and more GWAS summary statistics become publicly available, the scope of MR also expands with Two-sample MR methods.

    \\[ \\hat{\\beta}_{X,Y} = {{\\hat{\\beta}_{IV,Y}}\\over{\\hat{\\beta}_{IV,X}}} \\]

    Caveats

    For two-sample MR, there is an additional key assumption:

    The two samples used for MR are from the same underlying populations. (The effect size of instruments on exposure should be the same in both samples.)

    Therefore, for two-sample MR, we usually use datasets from similar non-overlapping populations in terms of not only ancestry but also contextual factors.

    "},{"location":"16_mendelian_randomization/#iv-selection","title":"IV selection","text":"

    One of the first things to do when you plan to perform any type of MR is to check the associations of instrumental variables with the exposure to avoid bias caused by weak IVs.

    The most commonly used method here is the F-statistic, which tests the association of instrumental variables with the exposure.

    "},{"location":"16_mendelian_randomization/#practice","title":"Practice","text":"

    In this tutorial, we will walk you through how to perform a minimal TwoSampleMR analysis. We will use the R package TwoSampleMR, which provides easy-to-use functions for formatting, clumping and harmonizing GWAS summary statistics.

    This package integrates a variety of commonly used MR methods for analysis, including:

    > mr_method_list()\n                             obj\n1                  mr_wald_ratio\n2               mr_two_sample_ml\n3            mr_egger_regression\n4  mr_egger_regression_bootstrap\n5               mr_simple_median\n6             mr_weighted_median\n7   mr_penalised_weighted_median\n8                         mr_ivw\n9                  mr_ivw_radial\n10                    mr_ivw_mre\n11                     mr_ivw_fe\n12                mr_simple_mode\n13              mr_weighted_mode\n14         mr_weighted_mode_nome\n15           mr_simple_mode_nome\n16                       mr_raps\n17                       mr_sign\n18                        mr_uwr\n\n                                                        name PubmedID\n1                                                 Wald ratio\n2                                         Maximum likelihood\n3                                                   MR Egger 26050253\n4                                       MR Egger (bootstrap) 26050253\n5                                              Simple median\n6                                            Weighted median\n7                                  Penalised weighted median\n8                                  Inverse variance weighted\n9                                                 IVW radial\n10 Inverse variance weighted (multiplicative random effects)\n11                 Inverse variance weighted (fixed effects)\n12                                               Simple mode\n13                                             Weighted mode\n14                                      Weighted mode (NOME)\n15                                        Simple mode (NOME)\n16                      Robust adjusted profile score (RAPS)\n17                                     Sign concordance test\n18                                     Unweighted regression\n

    "},{"location":"16_mendelian_randomization/#inverse-variance-weighted-fixed-effects","title":"Inverse variance weighted (fixed effects)","text":"

    Assumption: the underlying 'true' effect is fixed across variants

    Weight for the effect of ith variant:

    \\[W_i = {1 \\over Var(\\beta_i)}\\]

    Effect size:

    \\[\\beta = {{\\sum_{i=1}^N{w_i \\beta_i}}\\over{\\sum_{i=1}^Nw_i}}\\]

    SE:

    \\[SE = {\\sqrt{{1}\\over{\\sum_{i=1}^Nw_i}}}\\]"},{"location":"16_mendelian_randomization/#file-preparation","title":"File Preparation","text":"

    To perform two-sample MR analysis, we need summary statistics for exposure and outcome generated from independent populations with the same ancestry.

    In this tutorial, we will use sumstats from Biobank Japan pheweb and KoGES pheweb.

    "},{"location":"16_mendelian_randomization/#r-package-twosamplemr","title":"R package TwoSampleMR","text":"

    First, to use TwosampleMR, we need R>= 4.1. To install the package, run:

    library(remotes)\ninstall_github(\"MRCIEU/TwoSampleMR\")\n
    "},{"location":"16_mendelian_randomization/#loading-package","title":"Loading package","text":"
    library(TwoSampleMR)\n
    "},{"location":"16_mendelian_randomization/#reading-exposure-sumstats","title":"Reading exposure sumstats","text":"
    #format exposures dataset\n\nexp_raw <- fread(\"koges_bmi.txt.gz\")\n
    "},{"location":"16_mendelian_randomization/#extracting-instrumental-variables","title":"Extracting instrumental variables","text":"
    # select only significant variants\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_dat <- format_data( exp_raw,\n    type = \"exposure\",\n    snp_col = \"rsids\",\n    beta_col = \"beta\",\n    se_col = \"sebeta\",\n    effect_allele_col = \"alt\",\n    other_allele_col = \"ref\",\n    eaf_col = \"af\",\n    pval_col = \"pval\",\n)\n
    "},{"location":"16_mendelian_randomization/#clumping-exposure-variables","title":"Clumping exposure variables","text":"
    clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\") \n
    "},{"location":"16_mendelian_randomization/#outcome","title":"outcome","text":"
    out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n                    select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\"))\nout_dat <- format_data( out_raw,\n    type = \"outcome\",\n    snp_col = \"SNPID\",\n    beta_col = \"BETA\",\n    se_col = \"SE\",\n    effect_allele_col = \"Allele2\",\n    other_allele_col = \"Allele1\",\n    pval_col = \"p.value\",\n)\n
    "},{"location":"16_mendelian_randomization/#harmonizing-data","title":"Harmonizing data","text":"
    harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\n
    "},{"location":"16_mendelian_randomization/#perform-mr-analysis","title":"Perform MR analysis","text":"
    res <- mr(harmonized_data)\n\nid.exposure id.outcome  outcome exposure    method  nsnp    b   se  pval\n<chr>   <chr>   <chr>   <chr>   <chr>   <int>   <dbl>   <dbl>   <dbl>\n9J8pv4  IyUv6b  outcome exposure    MR Egger    28  1.3337580   0.69485260  6.596064e-02\n9J8pv4  IyUv6b  outcome exposure    Weighted median 28  0.6298980   0.09401352  2.083081e-11\n9J8pv4  IyUv6b  outcome exposure    Inverse variance weighted   28  0.5598956   0.23225806  1.592361e-02\n9J8pv4  IyUv6b  outcome exposure    Simple mode 28  0.6097842   0.15180476  4.232158e-04\n9J8pv4  IyUv6b  outcome exposure    Weighted mode   28  0.5946778   0.12820220  8.044488e-05\n
    "},{"location":"16_mendelian_randomization/#sensitivity-analysis","title":"Sensitivity analysis","text":""},{"location":"16_mendelian_randomization/#heterogeneity","title":"Heterogeneity","text":"

    Test if there is heterogeneity among the causal effect of x on y estimated from each variants.

    mr_heterogeneity(harmonized_data)\n\nid.exposure id.outcome  outcome exposure    method  Q   Q_df    Q_pval\n<chr>   <chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <dbl>\n9J8pv4  IyUv6b  outcome exposure    MR Egger    670.7022    26  1.000684e-124\n9J8pv4  IyUv6b  outcome exposure    Inverse variance weighted   706.6579    27  1.534239e-131\n
    "},{"location":"16_mendelian_randomization/#horizontal-pleiotropy","title":"Horizontal Pleiotropy","text":"

    Intercept in MR-Egger

    mr_pleiotropy_test(harmonized_data)\n\nid.exposure id.outcome  outcome exposure    egger_intercept se  pval\n<chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <dbl>\n9J8pv4  IyUv6b  outcome exposure    -0.03603697 0.0305241   0.2484472\n
    "},{"location":"16_mendelian_randomization/#single-snp-mr-and-leave-one-out-mr","title":"Single SNP MR and leave-one-out MR","text":"

    Single SNP MR

    res_single <- mr_singlesnp(harmonized_data)\nres_single\n\nexposure    outcome id.exposure id.outcome  samplesize  SNP b   se  p\n<chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <dbl>   <dbl>   <dbl>\n1   exposure    outcome 9J8pv4  IyUv6b  NA  rs10198356  0.6323140   0.2082837   2.398742e-03\n2   exposure    outcome 9J8pv4  IyUv6b  NA  rs10209994  0.9477808   0.3225814   3.302164e-03\n3   exposure    outcome 9J8pv4  IyUv6b  NA  rs10824329  0.6281765   0.3246214   5.297739e-02\n4   exposure    outcome 9J8pv4  IyUv6b  NA  rs10938397  1.2376316   0.2775854   8.251150e-06\n5   exposure    outcome 9J8pv4  IyUv6b  NA  rs11066132  0.6024303   0.2232401   6.963693e-03\n6   exposure    outcome 9J8pv4  IyUv6b  NA  rs12522139  0.2905201   0.2890240   3.148119e-01\n7   exposure    outcome 9J8pv4  IyUv6b  NA  rs12591730  0.8930490   0.3076687   3.700413e-03\n8   exposure    outcome 9J8pv4  IyUv6b  NA  rs13013021  1.4867889   0.2207777   1.646925e-11\n9   exposure    outcome 9J8pv4  IyUv6b  NA  rs1955337   0.5442640   0.2994146   6.910079e-02\n10  exposure    outcome 9J8pv4  IyUv6b  NA  rs2076308   1.1176226   0.2657969   2.613132e-05\n11  exposure    outcome 9J8pv4  IyUv6b  NA  rs2278557   0.6238587   0.2968184   3.556906e-02\n12  exposure    outcome 9J8pv4  IyUv6b  NA  rs2304608   1.5054682   0.2968905   3.961740e-07\n13  exposure    outcome 9J8pv4  IyUv6b  NA  rs2531995   1.3972908   0.3130157   8.045689e-06\n14  exposure    outcome 9J8pv4  IyUv6b  NA  rs261967    1.5303384   0.2921192   1.616714e-07\n15  exposure    outcome 9J8pv4  IyUv6b  NA  rs35332469  -0.2307314  0.3479219   5.072217e-01\n16  exposure    outcome 9J8pv4  IyUv6b  NA  rs35560038  -1.5730870  0.2018968   6.619637e-15\n17  exposure    outcome 9J8pv4  IyUv6b  NA  rs3755804   0.5314915   0.2325073   2.225933e-02\n18  exposure    outcome 9J8pv4  IyUv6b  NA  rs4470425   0.6948046   0.3079944   2.407689e-02\n19  exposure    outcome 9J8pv4  IyUv6b  NA  rs476828    1.1739083   0.1568550   7.207355e-14\n20  exposure    outcome 9J8pv4  IyUv6b  NA  rs4883723   0.5479721   0.2855004   5.494141e-02\n21  exposure    outcome 9J8pv4  IyUv6b  NA  rs509325    0.5491040   0.1598196   5.908641e-04\n22  exposure    outcome 9J8pv4  IyUv6b  NA  rs55872725  1.3501891   0.1259791   8.419325e-27\n23  exposure    outcome 9J8pv4  IyUv6b  NA  rs6089309   0.5657525   0.3347009   9.096620e-02\n24  exposure    outcome 9J8pv4  IyUv6b  NA  rs6265  0.6457693   0.1901871   6.851804e-04\n25  exposure    outcome 9J8pv4  IyUv6b  NA  rs6736712   0.5606962   0.3448784   1.039966e-01\n26  exposure    outcome 9J8pv4  IyUv6b  NA  rs7560832   0.6032080   0.2904972   3.785077e-02\n27  exposure    outcome 9J8pv4  IyUv6b  NA  rs825486    -0.6152759  0.3500334   7.878772e-02\n28  exposure    outcome 9J8pv4  IyUv6b  NA  rs9348441   -4.9786332  0.2572782   1.992909e-83\n29  exposure    outcome 9J8pv4  IyUv6b  NA  All - Inverse variance weighted 0.5598956   0.2322581   1.592361e-02\n30  exposure    outcome 9J8pv4  IyUv6b  NA  All - MR Egger  1.3337580   0.6948526   6.596064e-02\n

    leave-one-out MR

    res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\n\nexposure    outcome id.exposure id.outcome  samplesize  SNP b   se  p\n<chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <dbl>   <dbl>   <dbl>\n1   exposure    outcome 9J8pv4  IyUv6b  NA  rs10198356  0.5562834   0.2424917   2.178871e-02\n2   exposure    outcome 9J8pv4  IyUv6b  NA  rs10209994  0.5520576   0.2388122   2.079526e-02\n3   exposure    outcome 9J8pv4  IyUv6b  NA  rs10824329  0.5585335   0.2390239   1.945341e-02\n4   exposure    outcome 9J8pv4  IyUv6b  NA  rs10938397  0.5412688   0.2388709   2.345460e-02\n5   exposure    outcome 9J8pv4  IyUv6b  NA  rs11066132  0.5580606   0.2417275   2.096381e-02\n6   exposure    outcome 9J8pv4  IyUv6b  NA  rs12522139  0.5667102   0.2395064   1.797373e-02\n7   exposure    outcome 9J8pv4  IyUv6b  NA  rs12591730  0.5524802   0.2390990   2.085075e-02\n8   exposure    outcome 9J8pv4  IyUv6b  NA  rs13013021  0.5189715   0.2386808   2.968017e-02\n9   exposure    outcome 9J8pv4  IyUv6b  NA  rs1955337   0.5602635   0.2394505   1.929468e-02\n10  exposure    outcome 9J8pv4  IyUv6b  NA  rs2076308   0.5431355   0.2394403   2.330758e-02\n11  exposure    outcome 9J8pv4  IyUv6b  NA  rs2278557   0.5583634   0.2394924   1.972992e-02\n12  exposure    outcome 9J8pv4  IyUv6b  NA  rs2304608   0.5372557   0.2377325   2.382639e-02\n13  exposure    outcome 9J8pv4  IyUv6b  NA  rs2531995   0.5419016   0.2379712   2.277590e-02\n14  exposure    outcome 9J8pv4  IyUv6b  NA  rs261967    0.5358761   0.2376686   2.415093e-02\n15  exposure    outcome 9J8pv4  IyUv6b  NA  rs35332469  0.5735907   0.2378345   1.587739e-02\n16  exposure    outcome 9J8pv4  IyUv6b  NA  rs35560038  0.6734906   0.2217804   2.391474e-03\n17  exposure    outcome 9J8pv4  IyUv6b  NA  rs3755804   0.5610215   0.2413249   2.008503e-02\n18  exposure    outcome 9J8pv4  IyUv6b  NA  rs4470425   0.5568993   0.2392632   1.993549e-02\n19  exposure    outcome 9J8pv4  IyUv6b  NA  rs476828    0.5037555   0.2443224   3.922224e-02\n20  exposure    outcome 9J8pv4  IyUv6b  NA  rs4883723   0.5602050   0.2397325   1.945000e-02\n21  exposure    outcome 9J8pv4  IyUv6b  NA  rs509325    0.5608429   0.2468506   2.308693e-02\n22  exposure    outcome 9J8pv4  IyUv6b  NA  rs55872725  0.4419446   0.2454771   7.180543e-02\n23  exposure    outcome 9J8pv4  IyUv6b  NA  rs6089309   0.5597859   0.2388902   1.911519e-02\n24  exposure    outcome 9J8pv4  IyUv6b  NA  rs6265  0.5547068   0.2436910   2.282978e-02\n25  exposure    outcome 9J8pv4  IyUv6b  NA  rs6736712   0.5598815   0.2387602   1.902944e-02\n26  exposure    outcome 9J8pv4  IyUv6b  NA  rs7560832   0.5588113   0.2396229   1.969836e-02\n27  exposure    outcome 9J8pv4  IyUv6b  NA  rs825486    0.5800026   0.2367545   1.429330e-02\n28  exposure    outcome 9J8pv4  IyUv6b  NA  rs9348441   0.7378967   0.1366838   6.717515e-08\n29  exposure    outcome 9J8pv4  IyUv6b  NA  All 0.5598956   0.2322581   1.592361e-02\n
    "},{"location":"16_mendelian_randomization/#visualization","title":"Visualization","text":""},{"location":"16_mendelian_randomization/#scatter-plot","title":"Scatter plot","text":"
    res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\n
    "},{"location":"16_mendelian_randomization/#single-snp","title":"Single SNP","text":"
    res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\n
    "},{"location":"16_mendelian_randomization/#leave-one-out","title":"Leave-one-out","text":"
    res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\n
    "},{"location":"16_mendelian_randomization/#funnel-plot","title":"Funnel plot","text":"
    res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\n
    "},{"location":"16_mendelian_randomization/#mr-steiger-directionality-test","title":"MR Steiger directionality test","text":"

    MR Steiger directionality test is a method to test the causal direction.

    Steiger test: test whether the SNP-outcome correlation is greater than the SNP-exposure correlation.

    harmonized_data$\"r.outcome\" <- get_r_from_lor(\n  harmonized_data$\"beta.outcome\",\n  harmonized_data$\"eaf.outcome\",\n  45383,\n  132032,\n  0.26,\n  model = \"logit\",\n  correction = FALSE\n)\n\nout <- directionality_test(harmonized_data)\nout\n\nid.exposure id.outcome  exposure    outcome snp_r2.exposure snp_r2.outcome  correct_causal_direction    steiger_pval\n<chr>   <chr>   <chr>   <chr>   <dbl>   <dbl>   <lgl>   <dbl>\nrvi6Om  ETcv15  BMI T2D 0.02125453  0.005496427 TRUE    NA\n

    Reference: Hemani, G., Tilling, K., & Davey Smith, G. (2017). Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS genetics, 13(11), e1007081.

    "},{"location":"16_mendelian_randomization/#mr-base-web-app","title":"MR-Base (web app)","text":"

    MR-Base web app

    "},{"location":"16_mendelian_randomization/#strobe-mr","title":"STROBE-MR","text":"

    Before reporting any MR results, please check the STROBE-MR Checklist first, which consists of 20 things that should be addressed when reporting a mendelian randomization study.

    "},{"location":"16_mendelian_randomization/#references","title":"References","text":""},{"location":"17_colocalization/","title":"Colocalization","text":""},{"location":"17_colocalization/#co-localization","title":"Co-localization","text":""},{"location":"17_colocalization/#coloc-assuming-a-single-causal-variant","title":"Coloc assuming a single causal variant","text":"

    Coloc uses the assumption of 0 or 1 causal variant in each trait, and tests for whether they share the same causal variant.

    Note

    Actually such a assumption is different from fine-mapping. In fine-mapping, the aim is to find the putative causal variants, which is determined at birth. In colocalization, the aim is to find the \"signal overlapping\" to support the causality inference, like eQTL --> A trait. It is possible that the causal variants are different in two traits.

    Datasets used:

    Result interpretation:

    Basically, five configurations are calculated,

    ## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf \n##  1.73e-08  7.16e-07  2.61e-05  8.20e-05  1.00e+00 \n## [1] \"PP abf for shared variant: 100%\"\n

    \\(H_0\\): neither trait has a genetic association in the region

    \\(H_1\\): only trait 1 has a genetic association in the region

    \\(H_2\\): only trait 2 has a genetic association in the region

    \\(H_3\\): both traits are associated, but with different causal variants

    \\(H_4\\): both traits are associated and share a single causal variant

    PP.H4.abf is the posterior probability that two traits share a same causal variant.

    Then based on H4 is true, a 95% credible set could be constructed (as a shared causal variant does not necessarily mean a specific variant).

    o <- order(my.res$results$SNP.PP.H4,decreasing=TRUE)\ncs <- cumsum(my.res$results$SNP.PP.H4[o])\nw <- which(cs > 0.95)[1]\nmy.res$results[o,][1:w,]$snp\n

    References:

    Coloc: a package for colocalisation analyses

    "},{"location":"17_colocalization/#coloc-assuming-multiple-causal-variants-or-multiple-signals","title":"Coloc assuming multiple causal variants or multiple signals","text":"

    When the single-causal variant assumption is violeted, several ways could be used to relieve it.

    1. Assuming multiple causal variants in SuSiE-Coloc pipeline. In this pipeline, putative causal variants are fine-mapped, then each signal is passed to the coloc engine.

    2. Conditioning analysis using GCTA-COJO-Coloc pipeline. In this pipeline, signals are segregated, then passed to the coloc engine.

    "},{"location":"17_colocalization/#other-pipelines","title":"Other pipelines","text":"

    Many other strategies and pipelines are available for colocalization and prioritize the variants/genes/traits. For example: * HyPrColoc * OpenTargets *

    "},{"location":"18_Conditioning_analysis/","title":"Conditioning analysis","text":"

    Multiple association signals could exist in one locus, especially when observing complex LD structures in the regional plot. Conditioning on one signal allows the separation of independent signals.

    Several ways to perform the conditioning analysis:

    "},{"location":"18_Conditioning_analysis/#adding-the-lead-variant-to-the-covariates","title":"Adding the lead variant to the covariates","text":"

    First, extract the individual genotype (dosage) to the text file. Then add it to covariates.

    plink2 \\\n  --pfile chr1.dose.Rsq0.3 vzs \\\n  --extract chr1.list \\\n  --threads 1 \\\n  --export A \\\n  --out genotype/chr1\n

    The exported format could be found in Export non-PLINK 2 fileset.

    Note

    Major allele dosage would be outputted. If adding ref-first, REF allele would be outputted. It does not matter as a covariate.

    Then just paste it to the covariates table and run the association test.

    Note

    Some association test software will also provide options for condition analysis. For example, in PLINK, you can use --condition <variant ID> for condition analysis. You can simply provide a list of variant IDs to run the condition analysis.

    "},{"location":"18_Conditioning_analysis/#gcta-cojo","title":"GCTA-COJO","text":"

    If raw genotypes and phenotypes are not available, GCTA-COJO performs conditioning analysis using sumstats and external LD reference.

    cojo-top-SNPs 10 will perform a step-wise model selection to select 10 independently associated SNPs (including non-significant ones).

    gcta \\\n  --bfile chr1 \\\n  --chr 1 \\\n  --maf 0.001 \\\n  --cojo-file chr1_cojo.input \\\n  --cojo-top-SNPs 10 \\\n  --extract-region-bp 1 152383617 5000 \\\n  --out chr1_cojo.output\n

    Note

    bfile is used to generate LD. A size of > 4000 unrelated samples is suggested. Estimation of LD in GATC is based on the hard-call genotype.

    Input file format less chr1_cojo.input:

    ID      ALLELE1 ALLELE0 A1FREQ  BETA    SE      P       N\nchr1:11171:CCTTG:C      C       CCTTG   0.0831407       -0.0459889      0.0710074       0.5172  180590\nchr1:13024:G:A  A       G       1.63957e-05     -3.2714 3.26302 0.3161  180590\n
    Here A1 is the effect allele.

    Then --cojo-cond could be used to generate new sumstats conditioned on the above-selected variant(s).

    Reference:

    "},{"location":"19_ld/","title":"Linkage disequilibrium(LD)","text":""},{"location":"19_ld/#ld-definition","title":"LD Definition","text":"

    In meiosis, homologous chromosomes are recombined. Recombination rates at different DNA regions are not equal. The fragments can be detected after tens of generations, causing Linkage disequilibrium, which refers to the non-random association of alleles of different loci.

    Factors affecting LD

    "},{"location":"19_ld/#ld-estimation","title":"LD Estimation","text":"

    Suppose we have two SNPs whose alleles are \\(A/a\\) and \\(B/b\\).

    The haplotype frequencies are:

    Haplotype Frequency AB \\(p_{AB}\\) Ab \\(p_{Ab}\\) aB \\(p_{aB}\\) ab \\(p_{ab}\\)

    The allele frequencies are:

    Allele Frequency A \\(p_A=p_{AB}+p_{Ab}\\) a \\(p_A=p_{aB}+p_{ab}\\) B \\(p_A=p_{AB}+p_{aB}\\) b \\(p_A=p_{Ab}+p_{ab}\\)

    D : the level of LD between A and B can be estimated using coefficient of linkage disequilibrium (D), which is defined as:

    \\[D_{AB} = p_{AB} - p_Ap_B\\]

    If A and B are in linkage equilibrium, we can get

    \\[D_{AB} = p_{AB} - p_Ap_B = 0\\]

    which means the coefficient of linkage disequilibrium is 0 in this case.

    D can be calculated for each pair of alleles and their relationships can be expressed as:

    \\[D_{AB} = -D_{Ab} = -D_{aB} = D_{ab} \\]

    So we can simply denote \\(D = D_{AB}\\), and the relationship between haplotype frequencies and allele frequencies can be summarized in the following table.

    Allele A a Total B \\(p_{AB}=p_Ap_B+D\\) \\(p_{aB}=p_ap_B-D\\) \\(p_B\\) b \\(p_{AB}=p_Ap_b-D\\) \\(p_{AB}=p_ap_b+D\\) \\(p_b\\) Total \\(p_A\\) \\(p_a\\) 1

    The range of possible values of D depends on the allele frequencies, which is not suitable for comparison between different pairs of alleles.

    Lewontin suggested a method for the normalization of D :

    \\[D_{normalized} = {{D}\\over{D_{max}}}\\]

    where

    \\[ D_{max} = \\begin{cases} max\\{-p_Ap_B, -(1-p_A)(1-p_B)\\} & \\text{when } D \\lt 0 \\\\ min\\{ p_A(1-p_B), p_B(1-p_A) \\} & \\text{when } D \\gt 0 \\\\ \\end{cases} \\]

    It measures how much proportion of the haplotypes had undergone recombination.

    In practice, the most commonly used alternative metric to \\(D_{normalized}\\) is \\(r^2\\), the correlation coefficient, which can be obtained by:

    \\[ r^2 = {{D^2}\\over{p_A(1-p_A)p_B(1-p_B)}} \\]

    Reference: Slatkin, M. (2008). Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477-485.

    "},{"location":"19_ld/#ld-calculation-using-software","title":"LD Calculation using software","text":""},{"location":"19_ld/#ldstore2","title":"LDstore2","text":"

    LDstore2: http://www.christianbenner.com/#

    Reference: Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).

    "},{"location":"19_ld/#plink-ld","title":"PLINK LD","text":"

    Please check Calculate LD using PLINK.

    "},{"location":"19_ld/#ld-lookup-using-ldlink","title":"LD Lookup using LDlink","text":"

    LDlink

    LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups. Each included application is specialized for querying and displaying unique aspects of linkage disequilibrium.

    https://ldlink.nci.nih.gov/?tab=home

    Reference: Machiela, M. J., & Chanock, S. J. (2015). LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 31(21), 3555-3557.

    LDlink is a very useful tool for quick lookups of any information related to LD.

    "},{"location":"19_ld/#ldlink-ldpair","title":"LDlink-LDpair","text":"

    LDpair

    "},{"location":"19_ld/#ldlink-ldproxy","title":"LDlink-LDproxy","text":"

    LDproxy for rs671

    "},{"location":"19_ld/#query-in-batch-using-ldlink-api","title":"Query in batch using LDlink API","text":"

    LDlink provides API for queries using command line.

    You need to register and get a token first.

    https://ldlink.nci.nih.gov/?tab=apiaccess

    Query LD proxies for variants using LDproxy API

    curl -k -X GET 'https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs3&pop=MXL&r2_d=r2&window=500000&    genome_build=grch37&token=faketoken123'\n
    "},{"location":"19_ld/#ldlinkr","title":"LDlinkR","text":"

    There is also a related R package for LDlink.

    Query LD proxies for variants using LDlinkR

    install.packages(\"LDlinkR\")\n\nlibrary(LDlinkR)\n\nmy_proxies <- LDproxy(snp = \"rs671\", \n                      pop = \"EAS\", \n                      r2d = \"r2\", \n                      token = \"YourTokenHere123\",\n                      genome_build = \"grch38\"\n                     )\n

    Reference: Myers, T. A., Chanock, S. J., & Machiela, M. J. (2020). LDlinkR: an R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Frontiers in genetics, 11, 157.

    "},{"location":"19_ld/#ld-pruning","title":"LD-pruning","text":"

    Please check LD-pruning

    "},{"location":"19_ld/#ld-clumping","title":"LD-clumping","text":"

    Please check LD-clumping

    "},{"location":"19_ld/#ld-score","title":"LD score","text":"

    Definition: https://cloufield.github.io/GWASTutorial/08_LDSC/#ld-score

    "},{"location":"19_ld/#ldsc","title":"LDSC","text":"

    LD score can be estimated with LDSC using PLINK format genotype data as the reference panel.

    plinkPrefix=chr22\n\npython ldsc.py \\\n    --bfile ${plinkPrefix}\n    --l2 \\\n    --ld-wind-cm 1\\\n    --out ${plinkPrefix}\n

    Check here for details.

    "},{"location":"19_ld/#gcta","title":"GCTA","text":"

    GCTA also provides a function to estimate LD scores using PLINK format genotype data.

    plinkPrefix=chr22\n\ngcta64 \\\n    --bfile  ${plinkPrefix} \\\n    --ld-score \\\n    --ld-wind 1000 \\\n    --ld-rsq-cutoff 0.01 \\\n    --out  ${plinkPrefix}\n

    Check here for details.

    "},{"location":"19_ld/#ld-score-regression","title":"LD score regression","text":"

    Please check LD score regression

    "},{"location":"19_ld/#reference","title":"Reference","text":""},{"location":"20_power_analysis/","title":"Power analysis for GWAS","text":""},{"location":"20_power_analysis/#type-i-type-ii-errors-and-statistical-power","title":"Type I, type II errors and Statistical power","text":"

    This table shows the relationship between the null hypothesis \\(H_0\\) and the results of a statistical test (whether or not to reject the null hypothesis \\(H_0\\) ).

    H0 is True H0 is False Do Not Reject True negative : \\(1 - \\alpha\\) Type II error (false negative) : \\(\\beta\\) Reject Type I error (false positive) : \\(\\alpha\\) True positive : \\(1 - \\beta\\)

    \\(\\alpha\\) : significance level

    By definition, the statistical power of a test refers to the probability that the test will correctly reject the null hypothesis, namely the True positive rate in the table above.

    \\(Power = Pr ( Reject\\ | H_0\\ is\\ False) = 1 - \\beta\\)

    Power

    Factors affecting power

    "},{"location":"20_power_analysis/#non-centrality-parameter","title":"Non-centrality parameter","text":"

    NCP describes the degree of difference between the alternative hypothesis \\(H_1\\) and the null hypothesis \\(H_0\\) values.

    Consider a simple linear regression model:

    \\[y = \\mu +\\beta x + \\epsilon\\]

    The variance of the error term:

    \\[\\sigma^2 = Var(y) - Var(x)\\beta^2\\]

    Usually, the phenotypic variance that a single SNP could explain is very limited, so we can approximate \\(\\sigma^2\\) by:

    \\[ \\sigma^2 \\thickapprox Var(y)\\]

    Under Hardy-Weinberg equilibrium, we can get:

    \\[Var(x) = 2f(1-f)\\]

    So the Non-centrality parameter(NCP) \\(\\lambda\\) for \\(\\chi^2\\) distribution with degree of freedom 1:

    \\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2\\]"},{"location":"20_power_analysis/#power-for-quantitative-traits","title":"Power for quantitative traits","text":"\\[ \\lambda = ({{\\beta}\\over{SE_{\\beta}}})^2 \\thickapprox N \\times {{Var(x)\\beta^2}\\over{\\sigma^2}} \\thickapprox N \\times {{2f(1-f) \\beta^2 }\\over {Var(y)}} \\]

    Significance threshold: \\(C = CDF_{\\chi^2}^{-1}(1 - \\alpha,df=1)\\)

    \\[ Power = Pr(\\lambda > C ) = 1 - CDF_{\\chi^2}(C, ncp = \\lambda,df=1) \\] "},{"location":"20_power_analysis/#power-for-large-scale-case-control-genome-wide-association-studies","title":"Power for large-scale case-control genome-wide association studies","text":"

    Denote :

    Null hypothesis : \\(P_{case} = P_{control}\\)

    To test whether one proportion \\(P_{case}\\) equals the other proportion \\(P_{control}\\), the test statistic is:

    \\[z = {{P_{case} - P_{control}}\\over {\\sqrt{ {{P_{case}(1 - P_{case})}\\over{2N_{case}}} + {{P_{control}(1 - P_{control})}\\over{2N_{control}}} }}}\\]

    Significance threshold: \\(C = \\Phi^{-1}(1 - \\alpha / 2 )\\)

    \\[ Power = Pr(|Z|>C) = 1 - \\Phi(-C-z) + \\Phi(C-z)\\]

    GAS power calculator

    GAS power calculator implemented this method, and you can easily calculate the power using their website

    "},{"location":"20_power_analysis/#reference","title":"Reference:","text":""},{"location":"21_twas/","title":"TWAS","text":""},{"location":"21_twas/#background","title":"Background","text":"

    Most variants identified in GWAS are located in regulatory regions, and these genetic variants could potentially affect complex traits through gene expression.

    However, due to the limitation of samples and high cost, it is difficult to measure gene expression at a large scale. Consequently, many expression-trait associations have not been detected, especially for those with small effect sizes.

    To address these issues, alternative approaches have been proposed and transcriptome-wide association study (TWAS) has become a common and easy-to-perform approach to identify genes whose expression is significantly associated with complex traits in individuals without directly measured expression levels.

    GWAS and TWAS

    "},{"location":"21_twas/#definition","title":"Definition","text":"

    TWAS is a method to identify significant expression-trait associations using expression imputation from genetic data or summary statistics.

    Individual-level and summary-level TWAS

    "},{"location":"21_twas/#fusion","title":"FUSION","text":"

    In this tutorial, we will introduce FUSION, which is one of the most commonly used tools for performing transcriptome-wide association studies (TWAS) using summary-level data.

    url : http://gusevlab.org/projects/fusion/

    FUSION trains predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. (http://gusevlab.org/projects/fusion/)

    Quote

    Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., ... & Pasaniuc, B. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3), 245-252.

    "},{"location":"21_twas/#algorithm-for-imputing-expression-into-gwas-summary-statistics","title":"Algorithm for imputing expression into GWAS summary statistics","text":"

    ImpG-Summary algorithm was extended to impute the Z scores for the cis genetic component of expression.

    FUSION statistical model

    \\(Z\\) : a vector of standardized effect sizes (z scores) of SNPs for the target trait at a given locus

    We impute the Z score of the expression and trait as a linear combination of elements of \\(Z\\) with weights \\(W\\).

    \\[ W = \\Sigma_{e,s}\\Sigma_{s,s}^{-1} \\]

    Both \\(\\Sigma_{e,s}\\) and \\(\\Sigma_{s,s}\\) are estimated from reference datsets.

    \\[ Z \\sim N(0, \\Sigma_{s,s} ) \\]

    The variance of \\(WZ\\) (imputed z score of expression and trait)

    \\[ Var(WZ) = W\\Sigma_{s,s}W^t \\]

    The imputation Z score can be obtained by:

    \\[ {{WZ}\\over{W\\Sigma_{s,s}W^t}^{1/2}} \\]

    ImpG-Summary algorithm

    Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., ... & Price, A. L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906-2914.

    "},{"location":"21_twas/#installation","title":"Installation","text":"

    Download FUSION from github and install

    wget https://github.com/gusevlab/fusion_twas/archive/master.zip\nunzip master.zip\ncd fusion_twas-master\n

    Download and unzip the LD reference data (1000 genome)

    wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2\ntar xjvf LDREF.tar.bz2\n

    Download and unzip plink2R

    wget https://github.com/gabraham/plink2R/archive/master.zip\nunzip master.zip\n

    Install R packages

    # R >= 4.0\nR\n\ninstall.packages(c('optparse','RColorBrewer'))\ninstall.packages('plink2R-master/plink2R/',repos=NULL)\n

    "},{"location":"21_twas/#example","title":"Example","text":"

    FUSION framework

    Input:

    1. GWAS summary statistics (in LDSC format)
    2. pre-computed gene expression weights (from http://gusevlab.org/projects/fusion/)

    Input GWAS sumstats fromat

    1. SNP (rsID)
    2. A1 (effect allele)
    3. A2 (non-effect allele)
    4. Z (Z score)

    Example:

    SNP A1  A2  N   CHISQ   Z\nrs6671356   C   T   70100.0 0.172612905312  0.415467092935\nrs6604968   G   A   70100.0 0.291125788806  0.539560736902\nrs4970405   A   G   70100.0 0.102204513891  0.319694407037\nrs12726255  G   A   70100.0 0.312418295691  0.558943911042\nrs4970409   G   A   70100.0 0.0524226849517 0.228960007319\n

    Get sample sumstats and weights

    wget https://data.broadinstitute.org/alkesgroup/FUSION/SUM/PGC2.SCZ.sumstats\n\nmkdir WEIGHTS\ncd WEIGHTS\nwget https://data.broadinstitute.org/alkesgroup/FUSION/WGT/GTEx.Whole_Blood.tar.bz2\ntar xjf GTEx.Whole_Blood.tar.bz2\n

    WEIGHTS

    files in each WEIGHTS folder

    RDat weight files for each gene in a tissue type

    GTEx.Whole_Blood.ENSG00000002549.8.LAP3.wgt.RDat         GTEx.Whole_Blood.ENSG00000166394.10.CYB5R2.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002822.11.MAD1L1.wgt.RDat      GTEx.Whole_Blood.ENSG00000166435.11.XRRA1.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002919.10.SNX11.wgt.RDat       GTEx.Whole_Blood.ENSG00000166436.11.TRIM66.wgt.RDat\nGTEx.Whole_Blood.ENSG00000002933.3.TMEM176A.wgt.RDat     GTEx.Whole_Blood.ENSG00000166444.13.ST5.wgt.RDat\nGTEx.Whole_Blood.ENSG00000003137.4.CYP26B1.wgt.RDat      GTEx.Whole_Blood.ENSG00000166471.6.TMEM41B.wgt.RDat\n...\n

    Expression imputation

    Rscript FUSION.assoc_test.R \\\n--sumstats PGC2.SCZ.sumstats \\\n--weights ./WEIGHTS/GTEx.Whole_Blood.pos \\\n--weights_dir ./WEIGHTS/ \\\n--ref_ld_chr ./LDREF/1000G.EUR. \\\n--chr 22 \\\n--out PGC2.SCZ.22.dat\n

    Results

    head PGC2.SCZ.22.dat\nPANEL   FILE    ID  CHR P0  P1  HSQ BEST.GWAS.ID    BEST.GWAS.Z EQTL.ID EQTL.R2 EQTL.Z  EQTL.GWAS.Z NSNP    NWGT    MODEL   MODELCV.R2  MODELCV.PV  TWAS.Z  TWAS.P\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000273311.1.DGCR11.wgt.RDat DGCR11  22  19033675    19035888    0.0551  rs2238767   -2.98   rs2283641    0.013728     4.33   2.5818 408  1  top1    0.014   0.018    2.5818 9.83e-03\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000100075.5.SLC25A1.wgt.RDat    SLC25A1 22  19163095    19166343    0.0740  rs2238767   -2.98   rs762523     0.080367     5.36  -1.8211 406  1  top1    0.08    7.2e-08 -1.8216.86e-02\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000070371.11.CLTCL1.wgt.RDat    CLTCL1  22  19166986    19279239    0.1620  rs4819843    3.04   rs809901     0.072193     5.53  -1.9928 456 19  enet    0.085   2.8e-08 -1.8806.00e-02\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000232926.1.AC000078.5.wgt.RDat AC000078.5  22  19874812    19875493    0.2226  rs5748555   -3.15   rs13057784   0.052796     5.60  -0.1652 514 44  enet    0.099   2e-09  0.0524   9.58e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000185252.13.ZNF74.wgt.RDat ZNF74   22  20748405    20762745    0.1120  rs595272     4.09   rs1005640    0.001422     3.44  -1.3677 301  8  enet    0.008   0.054   -0.8550 3.93e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000099940.7.SNAP29.wgt.RDat SNAP29  22  21213771    21245506    0.1286  rs595272     4.09   rs4820575    0.061763     5.94  -1.1978 416 27  enet    0.079   9.4e-08 -1.0354 3.00e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000272600.1.AC007308.7.wgt.RDat AC007308.7  22  21243494    21245502    0.2076  rs595272     4.09   rs165783     0.100625     6.79  -0.8871 408 12  lasso   0.16    5.4e-1-1.2049   2.28e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000183773.11.AIFM3.wgt.RDat AIFM3   22  21319396    21335649    0.0676  rs595272     4.09   rs565979     0.036672     4.50  -0.4474 362  1  top1    0.037   0.00024 -0.4474 6.55e-01\nNA  ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000230513.1.THAP7-AS1.wgt.RDat  THAP7-AS1   22  21356175    21357118    0.2382  rs595272     4.09   rs2239961    0.105307    -7.04  -0.3783 347  5  lasso   0.15    7.6e-1 0.2292   8.19e-01\n

    Descriptions of the output (cited from http://gusevlab.org/projects/fusion/ )

    Colume number Column header Value Usage 1 FILE \u2026 Full path to the reference weight file used 2 ID FAM109B Feature/gene identifier, taken from --weights file 3 CHR 22 Chromosome 4 P0 42470255 Gene start (from --weights) 5 P1 42475445 Gene end (from --weights) 6 HSQ 0.0447 Heritability of the gene 7 BEST.GWAS.ID rs1023500 rsID of the most significant GWAS SNP in locus 8 BEST.GWAS.Z -5.94 Z-score of the most significant GWAS SNP in locus 9 EQTL.ID rs5758566 rsID of the best eQTL in the locus 10 EQTL.R2 0.058680 cross-validation R2 of the best eQTL in the locus 11 EQTL.Z -5.16 Z-score of the best eQTL in the locus 12 EQTL.GWAS.Z -5.0835 GWAS Z-score for this eQTL 13 NSNP 327 Number of SNPs in the locus 14 MODEL lasso Best performing model 15 MODELCV.R2 0.058870 cross-validation R2 of the best performing model 16 MODELCV.PV 3.94e-06 cross-validation P-value of the best performing model 17 TWAS.Z 5.1100 TWAS Z-score (our primary statistic of interest) 18 TWAS.P 3.22e-07 TWAS P-value"},{"location":"21_twas/#limitations","title":"Limitations","text":"
    1. Significant loci identified in TWAS also contain multiple tarit-associated genes. GWAS often identifies multiple variants in LD. Similarly, TWAS frequently identifies multiple genes in a locus.

    2. Co-regulation may cause false positive results. Just like SNPs are correlated due to LD, gene expressions are often correlated due to co-regulation.

    3. Sometimes even when co-regulation is not captured, the shared variants (or variants in strong LD) in different expression prediction models may cause false positive results.

    4. Predicted expression account for only a limited portion of total gene expression. Total expression is affected not only by genetic components like cis-eQTL but also by other factors like environmental and technical components.

    5. Other factors. For example, the window size for selecting variants may affect association results.

    "},{"location":"21_twas/#criticism","title":"Criticism","text":"

    TWAS aims to test the relationship of the phenotype with the genetic component of the gene expression. But under current framework, TWAS only test the relationship of the phenotype with the predicted gene expression without accounting for the uncertainty in that prediction. The key point here is that the current framework omits the fact that the gene expression data is also the result of a sampling process from the analysis.

    \"Consequently, the test of association between that predicted genetic component and a phenotype reduces to merely a (weighted) test of joint association of the SNPs with the phenotype, which means that they cannot be used to infer a genetic relationship between gene expression and the phenotype on a population level.\"

    Quote

    de Leeuw, C., Werme, J., Savage, J. E., Peyrot, W. J., & Posthuma, D. (2021). On the interpretation of transcriptome-wide association studies. bioRxiv, 2021-08.

    "},{"location":"21_twas/#reference","title":"Reference","text":""},{"location":"32_whole_genome_regression/","title":"Whole-genome regression : REGENIE","text":""},{"location":"32_whole_genome_regression/#concepts","title":"Concepts","text":""},{"location":"32_whole_genome_regression/#overview","title":"Overview","text":"

    Overview of REGENIE

    Reference: https://rgcgithub.github.io/regenie/overview/

    "},{"location":"32_whole_genome_regression/#whole-genome-model","title":"Whole genome model","text":""},{"location":"32_whole_genome_regression/#stacked-regressions","title":"Stacked regressions","text":""},{"location":"32_whole_genome_regression/#firth-correction","title":"Firth correction","text":""},{"location":"32_whole_genome_regression/#tutorial","title":"Tutorial","text":""},{"location":"32_whole_genome_regression/#installation","title":"Installation","text":"

    Please check here

    "},{"location":"32_whole_genome_regression/#step1","title":"Step1","text":"

    Sample codes for running step 1

    plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\n# revise the header of covariate file\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n  --step 1 \\\n  --bed ${plinkFile} \\\n  --extract ${extract} \\\n  --phenoFile ${phenoFile} \\\n  --covarFile ${covarFile} \\\n  --covarColList ${covarList} \\\n  --bt \\\n  --bsize 1000 \\\n  --lowmem \\\n  --lowmem-prefix tmpdir/regenie_tmp_preds \\\n  --out 1kg_eas_step1_BT\n
    "},{"location":"32_whole_genome_regression/#step2","title":"Step2","text":"

    Sample codes for running step 2

    plinkFile=../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020\nphenoFile=../01_Dataset/1kgeas_binary_regenie.txt\ncovarFile=../05_PCA/plink_results_projected.sscore\ncovarList=\"PC1_AVG,PC2_AVG,PC3_AVG,PC4_AVG,PC5_AVG,PC6_AVG,PC7_AVG,PC8_AVG,PC9_AVG,PC10_AVG\"\nextract=../05_PCA/plink_results.prune.in\n\nsed -i 's/#FID/FID/' ../05_PCA/plink_results_projected.sscore\nmkdir tmpdir\n\nregenie \\\n  --step 2 \\\n  --bed ${plinkFile} \\\n  --ref-first \\\n  --phenoFile ${phenoFile} \\\n  --covarFile ${covarFile} \\\n  --covarColList ${covarList} \\\n  --bt \\\n  --bsize 400 \\\n  --firth --approx --pThresh 0.01 \\\n  --pred 1kg_eas_step1_BT_pred.list \\\n  --out 1kg_eas_step1_BT\n
    "},{"location":"32_whole_genome_regression/#visualization","title":"Visualization","text":""},{"location":"32_whole_genome_regression/#reference","title":"Reference","text":""},{"location":"55_measure_of_effect/","title":"Measure of effect","text":""},{"location":"55_measure_of_effect/#concepts","title":"Concepts","text":""},{"location":"55_measure_of_effect/#risk","title":"Risk","text":"

    Risk: the probability that a subject within a population will develop a given disease, or other health outcome, over a specified follow-up period.

    \\[ R = {{E}\\over{E + N}} \\] "},{"location":"55_measure_of_effect/#odds","title":"Odds","text":"

    Odds: the likelihood of a new event occurring rather than not occurring. It is the probability that an event will occur divided by the probability that the event will not occur.

    \\[ Odds = {E \\over N } \\]"},{"location":"55_measure_of_effect/#hazard","title":"Hazard","text":"

    Hazard function \\(h(t)\\): the event rate at time \\(t\\) conditional on survival until time \\(t\\) (namely, \\(T\u2265t\\))

    \\[ h(t) = Pr(t<=T<t_{+1} | T>=t ) \\]

    T\u00a0is a discrete random variable indicating the time of occurrence of the event.

    "},{"location":"55_measure_of_effect/#relative-risk-rr-and-odds-ratio-or","title":"Relative risk (RR) and Odds ratio (OR)","text":""},{"location":"55_measure_of_effect/#22-contingency-table","title":"2\u00d72 Contingency Table","text":"Intervention I Control C Events E IE CE Non-events N IN CN"},{"location":"55_measure_of_effect/#relative-risk-rr","title":"Relative risk (RR)","text":"

    RR: relative risk (risk ratio), usually used in cohort studies.

    \\[ RR = {{R_{Intervention}}\\over{R_{ conrol}}}={{IE/(IE+IN)}\\over{CE/(CE+CN)}} \\]"},{"location":"55_measure_of_effect/#odds-ratio-or","title":"Odds ratio (OR)","text":"

    OR: usually used in case control studies.

    \\[ OR = {{Odds_{Intervention}}\\over{Odds_{ conrol}}}={{IE/IN}\\over{CE/CN}} = {{IE * CN}\\over{CE * IN}} \\]

    When the event occurs in less than 10% of the unexposed population, the OR provides a reasonable approximation of the RR.

    "},{"location":"55_measure_of_effect/#hazard-ratios-hr","title":"Hazard ratios (HR)","text":"

    Hazard ratios (relative hazard) are usually estimated from Cox proportional hazards model:

    \\[ h_i(t) = h_0(t) \\times e^{\\beta_0 + \\beta_1X_{i1} + ... + \\beta_nX_{in} } = h_0(t) \\times e^{X_i\\beta } \\]

    HR: the ratio of the hazard rates corresponding to the conditions characterised by two distinct levels of a treatment variable of interest.

    \\[ HR = {{h(t | X_i)}\\over{h(t|X_j)}} = {{h_0(t) \\times e^{X_i\\beta }}\\over{h_0(t) \\times e^{X_j\\beta }}} = e^{(X_i-X_j)\\beta} \\]"},{"location":"60_awk/","title":"AWK","text":""},{"location":"60_awk/#awk-introduction","title":"AWK Introduction","text":"

    'awk' is one of the most powerful text processing tools for tabular text files.

    "},{"location":"60_awk/#awk-syntax","title":"AWK syntax","text":"
    awk OPTION 'CONDITION {PROCESS}' FILENAME\n

    Some special variables in awk:

    "},{"location":"60_awk/#examples","title":"Examples","text":"

    Using the sample sumstats, we will demonstrate some simple but useful one-liners.

    # sample sumstats\nhead ../02_Linux_basics/sumstats.txt \n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"60_awk/#example-1","title":"Example 1","text":"

    Select variants on chromosome 2 (keeping the headers)

    awk 'NR==1 ||  $1==2 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n2   22398   2:22398:C:T C   T   T   ADD 503 1.287540.161017 1.56962 0.116503    .\n2   24839   2:24839:C:T C   T   T   ADD 503 1.318170.179754 1.53679 0.124344    .\n2   26844   2:26844:C:T C   T   T   ADD 503 1.3173  0.161302    1.70851 0.0875413   .\n2   28786   2:28786:T:C T   C   C   ADD 503 1.3043  0.161184    1.64822 0.0993082   .\n2   30091   2:30091:C:G C   G   G   ADD 503 1.3043  0.161184    1.64822 0.0993082   .\n2   30762   2:30762:A:G A   G   A   ADD 503 1.099560.158614 0.598369    0.549594    .\n2   34503   2:34503:G:T G   T   T   ADD 503 1.323720.179789 1.55988 0.118789    .\n2   39340   2:39340:A:G A   G   G   ADD 503 1.3043  0.161184    1.64822 0.0993082   .\n2   55237   2:55237:T:C T   C   C   ADD 503 1.314860.161988 1.68983 0.0910614   .\n

    The NR here means row number. The condition here NR==1 || $1==2 means if it is the first row or the first column is equal to 2, conduct the process print $0, which mean print all columns.

    "},{"location":"60_awk/#example-2","title":"Example 2","text":"

    Select all genome-wide significant variants (p<5e-8)

    awk 'NR==1 ||  $13 <5e-8 {print $0}' ../02_Linux_basics/sumstats.txt | head\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"60_awk/#example-3","title":"Example 3","text":"

    Create a bed-like format for annotation

    awk 'NR>1 {print $1,$2,$2,$4,$5}' ../02_Linux_basics/sumstats.txt | head\n1 13273 13273 G C\n1 14599 14599 T A\n1 14604 14604 A G\n1 14930 14930 A G\n1 69897 69897 T C\n1 86331 86331 A G\n1 91581 91581 G A\n1 122872 122872 T G\n1 135163 135163 C T\n1 233473 233473 C G\n
    "},{"location":"60_awk/#awk-workflow","title":"AWK workflow","text":"

    The workflow of awk can be summarized in the following figure:

    awk workflow

    "},{"location":"60_awk/#awk-variables","title":"AWK variables","text":"

    Frequently used awk variables

    Variable Desciption NR The number of input records NF The number of input fields FS The input field separator. The default value is \" \" OFS The output field separator. The default value is \" \" RS The input record separator. The default value is \"\\n\" ORS The output record separator.The default value is \"\\n\" FILENAME The name of the current input file. FNR The current record number in the current file

    Handle csv and tsv files

    head ../03_Data_formats/sample_data.csv\n#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE\n1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.\n1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.\n1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.\n1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.\n1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.\n1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.\n1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.\n
    awk -v FS=',' -v OFS=\"\\t\" '{print $1,$2}' sample_data.csv\n#CHROM  POS\n1       13273\n1       14599\n1       14604\n1       14930\n1       69897\n1       86331\n1       91581\n1       122872\n1       135163\n

    convert csv to tsv

    awk 'BEGIN { FS=\",\"; OFS=\"\\t\" } {$1=$1; print}' sample_data.csv\n

    Skip and replace headers

    awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"CHR\\tPOS\"} NR>1 {print $1,$2}' sample_data.csv\n\nCHR     POS\n1       13273\n1       14599\n1       14604\n1       14930\n1       69897\n1       86331\n1       91581\n1       122872\n1       135163\n

    Extract a line

    awk 'NR==4' sample_data.csv\n\n1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.\n

    Print the last two columns

    awk -v FS=',' '{print $(NF-1),$(NF)}' sample_data.csv\nP ERRCODE\n0.305961 .\n0.0104299 .\n0.0104299 .\n0.0269602 .\n0.0188466 .\n0.102694 .\n0.522847 .\n0.703856 .\n0.155079 .\n
    "},{"location":"60_awk/#awk-operators","title":"AWK operators","text":"

    Arithmetic Operators

    Arithmetic Operators Desciption + add - subtract * multiply \\ divide % modulus division ** x**y : x raised to the y-th power

    Logical Operators

    Logical Operators Desciption \\|\\| or && and ! not"},{"location":"60_awk/#awk-functions","title":"AWK functions","text":"

    Numeric functions in awk

    Convert OR and P to BETA and -log10(P)

    awk -v FS=',' -v OFS=\"\\t\" 'BEGIN{print \"SNPID\\tBETA\\tMLOG10P\"}NR>1{print $3,log($10),-log($13)/log(10)}' sample_data.csv\nSNPID   BETA    MLOG10P\n1:13273:G:C     -0.287458       0.514334\n1:14599:T:A     0.593172        1.98172\n1:14604:A:G     0.593172        1.98172\n1:14930:A:G     0.531446        1.56928\n1:69897:T:C     0.457438        1.72477\n1:86331:A:G     0.385303        0.988455\n1:91581:G:A     -0.0785866      0.281625\n1:122872:T:G    0.0687142       0.152516\n1:135163:C:T    -0.339927       0.809447\n

    String manipulating functions in awk

    "},{"location":"60_awk/#awk-options","title":"AWK options","text":"
    $ awk --help\nUsage: awk [POSIX or GNU style options] -f progfile [--] file ...\nUsage: awk [POSIX or GNU style options] [--] 'program' file ...\nPOSIX options:          GNU long options: (standard)\n        -f progfile             --file=progfile\n        -F fs                   --field-separator=fs\n        -v var=val              --assign=var=val\nShort options:          GNU long options: (extensions)\n        -b                      --characters-as-bytes\n        -c                      --traditional\n        -C                      --copyright\n        -d[file]                --dump-variables[=file]\n        -D[file]                --debug[=file]\n        -e 'program-text'       --source='program-text'\n        -E file                 --exec=file\n        -g                      --gen-pot\n        -h                      --help\n        -i includefile          --include=includefile\n        -l library              --load=library\n        -L[fatal|invalid]       --lint[=fatal|invalid]\n        -M                      --bignum\n        -N                      --use-lc-numeric\n        -n                      --non-decimal-data\n        -o[file]                --pretty-print[=file]\n        -O                      --optimize\n        -p[file]                --profile[=file]\n        -P                      --posix\n        -r                      --re-interval\n        -S                      --sandbox\n        -t                      --lint-old\n        -V                      --version\n\nTo report bugs, see node `Bugs' in `gawk.info', which is\nsection `Reporting Problems and Bugs' in the printed version.\n\ngawk is a pattern scanning and processing language.\nBy default it reads standard input and writes standard output.\n\nExamples:\n        gawk '{ sum += $1 }; END { print sum }' file\n        gawk -F: '{ print $1 }' /etc/passwd\n
    "},{"location":"60_awk/#reference","title":"Reference","text":""},{"location":"61_sed/","title":"sed","text":"

    sed is also one of the most commonly used test-editing command in Linux, which is short for stream editor. sed command edits the text from standard input in a line-by-line approach.

    "},{"location":"61_sed/#sed-syntax","title":"sed syntax","text":"
    sed [OPTIONS] PROCESS [FILENAME]\n
    "},{"location":"61_sed/#examples","title":"Examples","text":""},{"location":"61_sed/#sample-input","title":"sample input","text":"
    head ../02_Linux_basics/sumstats.txt\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"61_sed/#example-1-replacing-strings","title":"Example 1: Replacing strings","text":"

    s for substitute g for global

    Replacing strings

    \"Replace the separator from : to _\"

    head 02_Linux_basics/sumstats.txt | sed 's/:/_/g'\n#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE\n1   13273   1_13273_G_C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1_14599_T_A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1_14604_A_G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1_14930_A_G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1_69897_T_C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1_86331_A_G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1_91581_G_A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1_122872_T_G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1_135163_C_T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n

    "},{"location":"61_sed/#example-2-delete-headerthe-first-line","title":"Example 2: Delete header(the first line)","text":"

    -d for deletion

    Delete header(the first line)

    head 02_Linux_basics/sumstats.txt | sed '1d'\n1   13273   1:13273:G:C G   C   C   ADD 503 0.7461490.282904    -1.03509    0.300628    .\n1   14599   1:14599:T:A T   A   A   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14604   1:14604:A:G A   G   G   ADD 503 1.676930.240899 2.14598 0.0318742   .\n1   14930   1:14930:A:G A   G   G   ADD 503 1.643590.242872 2.04585 0.0407708   .\n1   69897   1:69897:T:C T   C   T   ADD 503 1.691420.200238 2.62471 0.00867216  .\n1   86331   1:86331:A:G A   G   G   ADD 503 1.418870.238055 1.46968 0.141649    .\n1   91581   1:91581:G:A G   A   A   ADD 503 0.9313040.123644    -0.575598   0.564887    .\n1   122872  1:122872:T:G    T   G   G   ADD 503 1.048280.182036 0.259034    0.795609    .\n1   135163  1:135163:C:T    C   T   T   ADD 503 0.6766660.242611    -1.60989    0.107422    .\n
    "},{"location":"69_resources/","title":"Resources","text":""},{"location":"69_resources/#sandbox","title":"Sandbox","text":"

    Sandbox provides tutorials for you to learn how to use bioinformatics tools right from your browser. Everything runs in a sandbox, so you can experiment all you want.

    "},{"location":"69_resources/#explain-shell","title":"Explain Shell","text":"

    explainshell is a tool (with a web interface) capable of parsing man pages, extracting options and explain a given command-line by matching each argument to the relevant help text in the man page.

    "},{"location":"71_python_resources/","title":"Python Resources","text":""},{"location":"71_python_resources/#python","title":"Python\u30d7\u30ed\u30b0\u30e9\u30df\u30f3\u30b0\u5165\u9580","text":""},{"location":"75_R_basics/","title":"R","text":""},{"location":"75_R_basics/#installing-r","title":"Installing R","text":""},{"location":"75_R_basics/#download-r-from-cran","title":"Download R from CRAN","text":"

    R can be downloaded from its official website CRAN (The Comprehensive R Archive Network).

    CRAN

    https://cran.r-project.org/

    "},{"location":"75_R_basics/#install-r-using-conda","title":"Install R using conda","text":"

    It is convenient to use conda to manage your R environment.

    conda install -c conda-forge r-base=4.x.x\n
    "},{"location":"75_R_basics/#ide-for-r-positrstudio","title":"IDE for R: Posit(Rstudio)","text":"

    Posit(Rstudio) is one of the most commonly used Integrated development environment(IDE) for R.

    https://posit.co/

    "},{"location":"75_R_basics/#use-r-in-interactive-mode","title":"Use R in interactive mode","text":"
    R\n
    "},{"location":"75_R_basics/#run-r-script","title":"Run R script","text":"
    Rscript mycode.R\n
    "},{"location":"75_R_basics/#installing-and-using-r-packages","title":"Installing and Using R packages","text":"
    install.packages(\"package_name\")\n\nlibrary(package_name)\n
    "},{"location":"75_R_basics/#basic-syntax","title":"Basic syntax","text":""},{"location":"75_R_basics/#assignment-and-evaluation","title":"Assignment and Evaluation","text":"
    > x <- 1\n\n> x\n[1] 1\n\n> print(x)\n[1] 1\n
    "},{"location":"75_R_basics/#data-types","title":"Data types","text":""},{"location":"75_R_basics/#atomic-data-types","title":"Atomic data types","text":"

    logical, integer, real, complex, string (or character)

    Atomic data types Description Examples logical boolean TRUE, FALSE integer integer 1,2 numeric float number 0.01 complex complex number 1+0i string string or chracter abc"},{"location":"75_R_basics/#vectors","title":"Vectors","text":"
    myvector <- c(1,2,3)\nmyvector < 1:3\n\nmyvector <- c(TRUE,FALSE)\nmyvector <- c(0.01, 0.02)\nmyvector <- c(1+0i, 2+3i)\nmyvector <- c(\"a\",\"bc\")\n
    "},{"location":"75_R_basics/#matrices","title":"Matrices","text":"
    > mymatrix <- matrix(1:6, nrow = 2, ncol = 3)\n> mymatrix\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n\n> ncol(mymatrix)\n[1] 3\n> nrow(mymatrix)\n[1] 2\n> dim(mymatrix)\n[1] 2 3\n> length(mymatrix)\n[1] 6\n
    "},{"location":"75_R_basics/#list","title":"List","text":"

    list() is a special vector-like data type that can contain different data types.

    > mylist <- list(1, 0.02, \"a\", FALSE, c(1,2,3), matrix(1:6,nrow=2,ncol=3))\n> mylist\n[[1]]\n[1] 1\n\n[[2]]\n[1] 0.02\n\n[[3]]\n[1] \"a\"\n\n[[4]]\n[1] FALSE\n\n[[5]]\n[1] 1 2 3\n\n[[6]]\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n
    "},{"location":"75_R_basics/#dataframe","title":"Dataframe","text":"
    > df <- data.frame(score = c(90,80,70,60),  rank = c(\"a\", \"b\", \"c\", \"d\"))\n> df\n  score rank\n1    90    a\n2    80    b\n3    70    c\n4    60    d\n
    "},{"location":"75_R_basics/#subsetting","title":"Subsetting","text":"
    myvector\n[1] 1 2 3\n> myvector[0]\ninteger(0)\n> myvector[1]\n[1] 1\nmyvector[1:2]\n[1] 1 2\n> myvector[-1]\n[1] 2 3\n> myvector[-1:-2]\n[1] 3\n
    > mymatrix\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n> mymatrix[0]\ninteger(0)\n> mymatrix[1]\n[1] 1\n> mymatrix[1,]\n[1] 1 3 5\n> mymatrix[1,2]\n[1] 3\n> mymatrix[1:2,2]\n[1] 3 4\n> mymatrix[,2]\n[1] 3 4\n
    > df\n  score rank\n1    90    a\n2    80    b\n3    70    c\n4    60    d\n> df[score]\nError in `[.data.frame`(df, score) : object 'score' not found\n> df[[score]]\nError in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x,  :\n  object 'score' not found\n> df[[\"score\"]]\n[1] 90 80 70 60\n> df[\"score\"]\n  score\n1    90\n2    80\n3    70\n4    60\n> df[1, \"score\"]\n[1] 90\n> df[1:2, \"score\"]\n[1] 90 80\n> df[1:2,2]\n[1] \"a\" \"b\"\n> df[1:2,1]\n[1] 90 80\n> df[,c(\"rank\",\"score\")]\n  rank score\n1    a    90\n2    b    80\n3    c    70\n4    d    60\n
    "},{"location":"75_R_basics/#data-input-and-output","title":"Data Input and Output","text":"
    mydata <- read.table(\"data.txt\", header=T)\n\nwrite.table(mydata, \"data.txt\")\n
    "},{"location":"75_R_basics/#control-flow","title":"Control flow","text":""},{"location":"75_R_basics/#if","title":"if","text":"
    if (x > y){\n  print (\"x\")\n} else if (x < y){\n  print (\"y\")\n} else {\n  print(\"tie\")\n}\n
    "},{"location":"75_R_basics/#for","title":"for","text":"
    > for (x in 1:5) {\n    print(x)\n}\n\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n
    "},{"location":"75_R_basics/#while","title":"while","text":"
    x<-0\nwhile (x<5)\n{\n    x<-x+1\n    print(\"Hello world\")\n}\n\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n[1] \"Hello world\"\n
    "},{"location":"75_R_basics/#functions","title":"Functions","text":"
    myfunction <- function(x){\n  // actual code here\n  return(result)\n}\n\n> my_add_function <- function(x,y){\n  c = x + y\n  return(c)\n}\n> my_add_function(1,3)\n[1] 4\n
    "},{"location":"75_R_basics/#statistical-functions","title":"Statistical functions","text":""},{"location":"75_R_basics/#normal-distribution","title":"Normal distribution","text":"Function Description dnorm(x, mean = 0, sd = 1, log = FALSE) probability density function pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) cumulative density function qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) quantile function rnorm(n, mean = 0, sd = 1) generate random values from normal distribution
    > dnorm(1.96)\n[1] 0.05844094\n\n> pnorm(1.96)\n[1] 0.9750021\n\n> pnorm(1.96, lower.tail=FALSE)\n[1] 0.0249979\n\n> qnorm(0.975)\n[1] 1.959964\n\n> rnorm(10)\n [1] -0.05595019  0.83176199  0.58362601 -0.89434812  0.85722843  0.96199308\n [7]  0.47782706 -0.46322066  0.03525421 -1.00715141\n
    "},{"location":"75_R_basics/#chi-square-distribution","title":"Chi-square distribution","text":"Function Description dchisq(x, df, ncp = 0, log = FALSE) probability density function pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) cumulative density function qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) quantile function rchisq(n, df, ncp = 0) generate random values from normal distribution"},{"location":"75_R_basics/#regression","title":"Regression","text":"
    lm(formula, data, subset, weights, na.action,\n   method = \"qr\", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,\n   singular.ok = TRUE, contrasts = NULL, offset, \u2026)\n\n# linear regression\nresults <- lm(formula = y ~ x1 + x2)\n\n# logistic regression\nresults <- lm(formula = y ~ x1 + x2, family = \"binomial\")\n

    Reference: - https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

    "},{"location":"76_R_resources/","title":"R Resources","text":""},{"location":"80_anaconda/","title":"Anaconda","text":"

    Conda is an open-source package and environment management system.

    It is a very handy tool when you need to manage python packages.

    "},{"location":"80_anaconda/#download","title":"Download","text":"

    https://www.anaconda.com/products/distribution

    For example, download the latest linux version:

    wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh\n

    "},{"location":"80_anaconda/#install","title":"Install","text":"
    # give it permission to execute\nchmod +x Anaconda3-2021.11-Linux-x86_64.sh \n\n# install\nbash ./Anaconda3-2021.11-Linux-x86_64.sh\n

    Follow the instructions on : https://docs.anaconda.com/anaconda/install/linux/

    If everything goes well, then you can see the (base) before the prompt, which indicate the base environment:

    (base) [heyunye@gc019 ~]$\n

    For how to use conda, please check : https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html

    Examples:

    # install a specific version of python package\nconda install pandas==1.5.2\n\n#create a new python 3.9 virtual environment with the name \"mypython39\"\nconda create -n mypython39 python=3.9\n\n#use environment.yml to create a virtual environment\nconda env create --file environment.yml\n\n# activate a virtual environment called ldsc\nconda activate ldsc\n\n# change back to base environment\nconda deactivate\n\n# list all packages in your current environment \nconda list\n\n# list all your current environments \nconda env list\n

    "},{"location":"81_jupyter_notebook/","title":"Jupyter notebook","text":"

    Usyally, the conda will install the jupyter notebook (and the ipykernel) by default.

    If not, using conda to install it:

    conda install jupyter\n

    "},{"location":"81_jupyter_notebook/#using-jupyter-notebook-on-a-local-or-remote-server","title":"Using Jupyter notebook on a local or remote server","text":""},{"location":"81_jupyter_notebook/#using-the-default-configuration","title":"Using the default configuration","text":""},{"location":"81_jupyter_notebook/#local-machine","title":"Local machine","text":"

    You could open it in the Anaconda interface or some other IDE.

    If using the terminal, just typing:

    jupyter-lab --port 9000 &          \n

    Then open the link in the browser.

    http://localhost:9000/lab?token=???\nhttp://127.0.0.1:9000/lab?token=???\n

    "},{"location":"81_jupyter_notebook/#remote-server","title":"Remote server","text":"

    Start in the command line of the remote server, adding a port.

    jupyter-lab --ip 0.0.0.0 --port 9000 --no-browser &\n
    It will generate an address the same as above.

    Then, on the local machine, using ssh to listen to the port.

    ssh -NfL localhost:9000:localhost:9000 user@host\n
    Note that the localhost:9000:localhost:9000 is localmachine:localport:remotemachine:remotehost and user@host is the user id and address of the remote server.

    When this is finished, open the above in the browser.

    "},{"location":"81_jupyter_notebook/#using-customized-configuration","title":"Using customized configuration","text":"

    Steps:

    "},{"location":"81_jupyter_notebook/#create-the-configuration-file","title":"Create the configuration file","text":"

    Create a jupyter notebook configuration file if there is no such file

    jupyter notebook --generate-config\n

    The file is usually stored at:

    ~/.jupyter/jupyter_notebook_config.py\n

    What the first few lines of Configuration file look like:

    head ~/.jupyter/jupyter_notebook_config.py\n# Configuration file for jupyter-notebook.\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n
    "},{"location":"81_jupyter_notebook/#add-the-port-information","title":"Add the port information","text":"

    Simply add c.NotebookApp.port =8889 to the configuration file and then save. Note: you can change the port you want to use.

    # Configuration file for jupyter-notebook.\n\nc.NotebookApp.port = 8889\n\n#------------------------------------------------------------------------------\n# Application(SingletonConfigurable) configuration\n#------------------------------------------------------------------------------\n\n## This is an application.\n

    "},{"location":"81_jupyter_notebook/#run-jupyter-notebook-server-on-remote-host","title":"Run jupyter notebook server on remote host","text":"

    On host side, set up the jupyter notebook server:

    jupyter notebook\n

    "},{"location":"81_jupyter_notebook/#use-ssh-tunnel-to-connect-to-the-remote-server-from-your-local-machine","title":"Use ssh tunnel to connect to the remote server from your local machine","text":"

    On your local machine, use ssh tunnel to connect to the jupyter notebook server:

    ssh -N -f -L localhost:8889:localhost:8889 username@your_remote_host_name\n
    "},{"location":"81_jupyter_notebook/#use-jupyter-notebook-in-your-browser","title":"Use jupyter notebook in your browser","text":"

    Then you can access juptyer notebook on your local browser using the link generated by jupyter notebook server. http://127.0.0.1:8889/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    "},{"location":"82_windows_linux_subsystem/","title":"Window Linux Subsystem","text":"

    In this section, we will briefly demostrate how to install a linux subsystem on windows.

    "},{"location":"82_windows_linux_subsystem/#official-documents","title":"Official Documents","text":""},{"location":"82_windows_linux_subsystem/#prerequisites","title":"Prerequisites","text":"

    \"You must be running Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11.\"

    "},{"location":"82_windows_linux_subsystem/#steps","title":"Steps","text":"

    "},{"location":"83_git_and_github/","title":"Git and Github","text":""},{"location":"83_git_and_github/#git","title":"Git","text":"

    Git is very powerful version control software. Git can track the changes in all the files of your projects and allow collarboration of multiple contributors.

    For details, please check: https://git-scm.com/

    "},{"location":"83_git_and_github/#github","title":"Github","text":"

    Github is an online platform, offering a cloud-based Git repository.

    https://github.com/

    "},{"location":"83_git_and_github/#create-a-new-id","title":"Create a new id","text":"

    Github signup page:

    https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home

    "},{"location":"83_git_and_github/#clone-a-repository","title":"Clone a repository","text":"

    Syntax: git colne <the url you just copied>

    Example: git clone https://github.com/Cloufield/GWASTutorial.git

    "},{"location":"83_git_and_github/#update-the-current-repository","title":"Update the current repository","text":"

    git pull

    "},{"location":"83_git_and_github/#git-setup","title":"git setup","text":"
    $ git config --global user.name \"myusername\"\n$ git config --global user.email myusername@myemail.com\n
    "},{"location":"83_git_and_github/#create-access-tokens","title":"Create access tokens","text":"

    Please see github official documents on how to create a personal token:

    https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token

    Useful Resources

    "},{"location":"84_ssh/","title":"SSH","text":"

    SSH stands for Secure Shell Protocol, which enables you to connect to remote server safely.

    "},{"location":"84_ssh/#login-to-remote-server","title":"Login to remote server","text":"
    ssh <username>@<host>\n

    Before you login in, you need to generate keys for ssh connection:

    "},{"location":"84_ssh/#keys","title":"Keys","text":"

    ssh-keygen -t rsa -b 4096\n
    You will get two keys, a public one and a private one.

    Warning

    Don't share your private key with others.

    What you need to do is just add you local public key to ~/.ssh/authorized_keys on host server.

    "},{"location":"84_ssh/#file-transfer","title":"File transfer","text":"

    Suppose you are using a local machine:

    Donwload files from remote host to local machine

    scp <username>@<host>:remote_path local_path\n

    Upload files from local machine to remote host

    scp local_path <username>@<host>:remote_path\n

    Info

    -r : copy recursively. This option is needed when you want to transfer an entire directory.

    Example

    Copy the local work directory to remote home directory

    $ scp -r /home/gwaslab/work gwaslab@remote.com:/home/gwaslab \n

    "},{"location":"84_ssh/#ssh-tunneling","title":"SSH Tunneling","text":"

    Quote

    In this forwarding type, the SSH client listens on a given port and tunnels any connection to that port to the specified port on the remote SSH server, which then connects to a port on the destination machine. The destination machine can be the remote SSH server or any other machine. https://linuxize.com/post/how-to-setup-ssh-tunneling/

    -L : Local port forwarding

    ssh -L [local_IP:]local_PORT:destination:destination_PORT <username>@<host>\n
    "},{"location":"84_ssh/#other-ssh-options","title":"other SSH options","text":""},{"location":"85_job_scheduler/","title":"Job scheduling system","text":"

    (If needed) Try to use job scheduling system to run a simple script:

    Two of the most commonly used job scheduling systems:

    "},{"location":"90_Recommended_Reading/","title":"Recommended reading","text":""},{"location":"90_Recommended_Reading/#textbooks","title":"Textbooks","text":"Year Category Reference 2020 Statistical Genetics An Introduction to Statistical Genetic Data Analysis By Melinda C. Mills, Nicola Barban and Felix C. Tropf https://mitpress.mit.edu/books/introduction-statistical-genetic-data-analysis 2019 Statistical Genetics Handbook of Statistical Genomics: Fourth Edition https://onlinelibrary.wiley.com/doi/book/10.1002/9781119487845 2009 Statistical Analysis and Machine Learning The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)introduction-statistical-genetic-data-analysis. Trevor Hastie, Robert Tibshirani, Jerome Friedman. https://hastie.su.domains/ElemStatLearn/ (PDF book is available)"},{"location":"90_Recommended_Reading/#overview-reviews","title":"Overview Reviews","text":"Year Reference Link 2021 Uffelmann, E., Huang, Q. Q., Munung, N. S., De Vries, J., Okada, Y., Martin, A. R., \u2026 & Posthuma, D. (2021). Genome-wide association studies. Nature Reviews Methods Primers, 1(1), 1-21. Pubmed 2019 Tam, V., Patel, N., Turcotte, M., Boss\u00e9, Y., Par\u00e9, G., & Meyre, D. (2019). Benefits and limitations of genome-wide association studies. Nature Reviews Genetics, 20(8), 467-484. Pubmed 2017 Pasaniuc, B., & Price, A. L. (2017). Dissecting the genetics of complex traits using summary association statistics. Nature reviews genetics, 18(2), 117-127. Pubmed 2023 Abdellaoui, A., Yengo, L., Verweij, K. J., & Visscher, P. M. (2023). 15 years of GWAS discovery: Realizing the promise. The American Journal of Human Genetics. Pubmed 2017 Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1), 5-22. Pubmed 2005 Hirschhorn, J. N., & Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits. Nature reviews genetics, 6(2), 95-108. Pubmed 2006 Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nature reviews genetics, 7(10), 781-791. Pubmed 2008 McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J., & Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5), 356-369. Pubmed 2010 Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature reviews genetics, 11(7), 459-463. Pubmed 2009 Ioannidis, J., Thomas, G., & Daly, M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nature Reviews Genetics, 10(5), 318-329. Pubmed"},{"location":"90_Recommended_Reading/#topic-specific","title":"Topic-specific","text":""},{"location":"90_Recommended_Reading/#ld","title":"LD","text":"Year Reference Link 2008 Slatkin, M. (2008). Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477-485. Pubmed"},{"location":"90_Recommended_Reading/#imputation","title":"Imputation","text":"Year Reference Link 2010 Marchini, J., & Howie, B. (2010). Genotype imputation for genome-wide association studies. Nature Reviews Genetics, 11(7), 499-511. Pubmed 2018 Das S, Abecasis GR, Browning BL. (2018). Genotype Imputation from Large Reference Panels. Annu. Rev. Genomics Hum. Genet. link"},{"location":"90_Recommended_Reading/#heritability","title":"Heritability","text":"Year Reference Link 2017 Yang, J., Zeng, J., Goddard, M. E., Wray, N. R., & Visscher, P. M. (2017). Concepts, estimation and interpretation of SNP-based heritability. Nature genetics, 49(9), 1304-1310. Pubmed 2009 Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., \u2026 & Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature, 461 (7265), 747-753. Pubmed"},{"location":"90_Recommended_Reading/#genetic-correlation","title":"Genetic correlation","text":"Year Reference Link 2019 Van Rheenen, W., Peyrot, W. J., Schork, A. J., Lee, S. H., & Wray, N. R. (2019). Genetic correlations of polygenic disease traits: from theory to practice. Nature Reviews Genetics, 20(10), 567-581. Pubmed"},{"location":"90_Recommended_Reading/#fine-mapping","title":"Fine-mapping","text":"Year Reference Link 2019 Schaid, D. J., Chen, W., & Larson, N. B. (2018). From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics, 19(8), 491-504. Pubmed 2023 \u738b \u9752\u6ce2, \u30b2\u30ce\u30e0\u30ef\u30a4\u30c9\u95a2\u9023\u89e3\u6790\u306e\u305d\u306e\u5148\u3078\uff1a\u7d71\u8a08\u7684fine-mapping\u306e\u57fa\u790e\u3068\u767a\u5c55, JSBi Bioinformatics Review, 2023, 4 \u5dfb, 1 \u53f7, p. 35-51 J-STAGE ### Polygenic risk scores Year Reference Link 2022 Wang, Y., Tsuo, K., Kanai, M., Neale, B. M., & Martin, A. R. (2022). Challenges and opportunities for developing more generalizable polygenic risk scores. Annual review of biomedical data science. link 2020 Choi, S. W., Mak, T. S. H., & O\u2019Reilly, P. F. (2020). Tutorial: a guide to performing polygenic risk score analyses. Nature protocols, 15(9), 2759-2772. Pubmed 2019 Martin, A. R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B. M., & Daly, M. J. (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nature genetics, 51(4), 584-591. Pubmed"},{"location":"90_Recommended_Reading/#rare-variants","title":"Rare variants","text":"Year Reference Link 2014 Lee, S., Abecasis, G. R., Boehnke, M., & Lin, X. (2014). Rare-variant association analysis: study designs and statistical tests. The American Journal of Human Genetics, 95(1), 5-23. Pubmed 2015 Auer, P. L., & Lettre, G. (2015). Rare variant association studies: considerations, challenges and opportunities. Genome medicine, 7(1), 1-11. Pubmed"},{"location":"90_Recommended_Reading/#genetic-architecture","title":"Genetic architecture","text":"Year Reference Link 2018 Timpson, N. J., Greenwood, C. M., Soranzo, N., Lawson, D. J., & Richards, J. B. (2018). Genetic architecture: the shape of the genetic contribution to human traits and disease. Nature Reviews Genetics, 19(2), 110-124. Pubmed"},{"location":"90_Recommended_Reading/#useful-websites","title":"Useful Websites","text":"Description Link A Bioinformatician's UNIX Toolbox http://lh3lh3.users.sourceforge.net/biounix.shtml Osaka university, Department of Statistical Genetics Homepage http://www.sg.med.osaka-u.ac.jp/school_2021.html Genome analysis wiki (Abecasis Group Wiki) https://genome.sph.umich.edu/wiki/Main_Page EPI 511, Advanced Population and Medical Genetics(Alkes Price, Harvard School of Public Health) https://alkesgroup.broadinstitute.org/EPI511 fiveMinuteStats(Matthew Stephens, Statistics and Human Genetics at the University of Chicago) https://stephens999.github.io/fiveMinuteStats Course homepage and digital textbook for Human Genome Variation with Computational Lab https://mccoy-lab.github.io/hgv_modules/"},{"location":"90_Recommended_Reading/#_1","title":"\u548c\u6587","text":"Year Category Reference 2015 Linux \u65b0\u3057\u3044Linux\u306e\u6559\u79d1\u66f8 \u5358\u884c\u672c \u2013 2015/6/6 \u4e09\u5b85 \u82f1\u660e (\u8457), \u5927\u89d2 \u7950\u4ecb (\u8457) 2012 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u306f\u3058\u3081\u3066\u306e\u30d1\u30bf\u30fc\u30f3\u8a8d\u8b58 \u5358\u884c\u672c\uff08\u30bd\u30d5\u30c8\u30ab\u30d0\u30fc\uff09 \u2013 2012/7/31 \u5e73\u4e95 \u6709\u4e09 (\u8457) 1991 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u7d71\u8a08\u5b66\u5165\u9580 (\u57fa\u790e\u7d71\u8a08\u5b66\u2160) \u5358\u884c\u672c \u2013 1991/7/9 \u6771\u4eac\u5927\u5b66\u6559\u990a\u5b66\u90e8\u7d71\u8a08\u5b66\u6559\u5ba4 (\u7de8\u96c6) 1992 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u81ea\u7136\u79d1\u5b66\u306e\u7d71\u8a08\u5b66 (\u57fa\u790e\u7d71\u8a08\u5b66) \u5358\u884c\u672c \u2013 1992/8/1 \u6771\u4eac\u5927\u5b66\u6559\u990a\u5b66\u90e8\u7d71\u8a08\u5b66\u6559\u5ba4 (\u7de8\u96c6) 2012 \u7d71\u8a08\u89e3\u6790\uff08\u3068\u5c11\u3057\u6a5f\u68b0\u5b66\u7fd2\uff09 \u30c7\u30fc\u30bf\u89e3\u6790\u306e\u305f\u3081\u306e\u7d71\u8a08\u30e2\u30c7\u30ea\u30f3\u30b0\u5165\u9580\u2015\u2015\u4e00\u822c\u5316\u7dda\u5f62\u30e2\u30c7\u30eb\u30fb\u968e\u5c64\u30d9\u30a4\u30ba\u30e2\u30c7\u30eb\u30fbMCMC (\u78ba\u7387\u3068\u60c5\u5831\u306e\u79d1\u5b66) \u5358\u884c\u672c \u2013 2012/5/19 \u4e45\u4fdd \u62d3\u5f25 (\u8457) 2015 \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u907a\u4f1d\u7d71\u8a08\u5b66\u5165\u9580 (\u5ca9\u6ce2\u30aa\u30f3\u30c7\u30de\u30f3\u30c9\u30d6\u30c3\u30af\u30b9) \u30aa\u30f3\u30c7\u30de\u30f3\u30c9 (\u30da\u30fc\u30d1\u30fc\u30d0\u30c3\u30af) \u2013 2015/12/10 \u938c\u8c37 \u76f4\u4e4b (\u8457) 2020 \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u5b9f\u9a13\u533b\u5b66 2020\u5e743\u6708 Vol.38 No.4 GWAS\u3067\u8907\u96d1\u5f62\u8cea\u3092\u89e3\u304f\u305e! \u301c\u591a\u56e0\u5b50\u75be\u60a3\u30fb\u5f62\u8cea\u306e\u30d0\u30a4\u30aa\u30ed\u30b8\u30fc\u306b\u6311\u3080\u6b21\u4e16\u4ee3\u306e\u30b2\u30ce\u30e0\u533b\u79d1\u5b66 \u5358\u884c\u672c \u2013 2020/2/23 \u938c\u8c37 \u6d0b\u4e00\u90ce (\u8457) 2020 \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u30bc\u30ed\u304b\u3089\u5b9f\u8df5\u3059\u308b \u907a\u4f1d\u7d71\u8a08\u5b66\u30bb\u30df\u30ca\u30fc\u301c\u75be\u60a3\u3068\u30b2\u30ce\u30e0\u3092\u7d50\u3073\u3064\u3051\u308b \u5358\u884c\u672c \u2013 2020/3/13 \u5ca1\u7530 \u968f\u8c61 (\u8457) ~ \u907a\u4f1d\u7d71\u8a08\u5b66\u5168\u822c \uff08\u57fa\u790e\u304b\u3089\u767a\u5c55\u307e\u3067\uff09 \u907a\u4f1d\u5b50\u533b\u5b66 \u30b7\u30ea\u30fc\u30ba\u4f01\u753b Statistical Genetics\u3000\u3008\u907a\u4f1d\u7d71\u8a08\u5b66\u306e\u57fa\u790e\u3009 - \u938c\u8c37 \u6d0b\u4e00\u90ce + \u03b1"},{"location":"95_Assignment/","title":"Self training","text":""},{"location":"95_Assignment/#pca-using-1000-genome-project-dataset","title":"PCA using 1000 Genome Project Dataset","text":"

    In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.

    Aim

    Aim:

    1. Download 1000 Genome VCF files.
    2. Perform PCA using 1000 Genome samples.
    3. Plot the PCs of these individuals.
    4. Interpret the results.

    Here is a brief overview of this mini project.

    The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.

    Tip

    Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.

    "},{"location":"95_Assignment/#1-download-the-publicly-available-1000-genome-vcf","title":"1. Download the publicly available 1000 Genome VCF","text":"

    Download the files we need from 1000 Genomes Project FTP site:

    1. Autosome VCF files
    2. Ancestry information file
    3. Reference genome sequence
    4. Strict mask

    Tip

    Note

    If it takes too long or if you are using your local laptop, you can just download the files for chr1.

    Sample shell script for downloading the files

    #!/bin/bash\nfor chr in $(seq 1 22)  #Note: If it takes too long, you can download just chr1.\ndo\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi\ndone\n\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz\nwget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai\n\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel\nwget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed\n
    "},{"location":"95_Assignment/#2-re-align-normalize-and-remove-duplication","title":"2. Re-align, normalize and remove duplication","text":"

    We need to use bcftools to process the raw vcf files.

    Install bcftools

    http://www.htslib.org/download/

    Since the variants are not normalized and also have many duplications, we need to clean the vcf files.

    Re-align with the reference genome, normalize variants and remove duplications

    #!/bin/bash\nfor chr in $(seq 1 22)\ndo\n    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \\\n      ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \\\n      bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \\\n        bcftools norm -Ob --rm-dup both \\\n          > ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \n    bcftools index ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf\ndone\n
    "},{"location":"95_Assignment/#3-convert-vcf-files-to-plink-binary-format","title":"3. Convert VCF files to plink binary format","text":"

    Example

    #!/bin/bash\nfor chr in $(seq 1 22)\ndo\nplink \\\n      --bcf ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \\\n      --keep-allele-order \\\n      --vcf-idspace-to _ \\\n      --const-fid \\\n      --allow-extra-chr 0 \\\n      --split-x b37 no-fail \\\n      --make-bed \\\n      --out ALL.chr\"${chr}\".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes\ndone\n
    "},{"location":"95_Assignment/#4-using-snps-only-in-strict-masks","title":"4. Using SNPs only in strict masks","text":"

    Strict masks are in this directory.

    Strict mask

    The overlapped region with this mask is \u201ccallable\u201d (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/

    Tip

    Use plink --make-set option with the BED files to extract SNPs in the strict mask.

    "},{"location":"95_Assignment/#5-qc-it-and-prune-it-to-100k-variants","title":"5. QC it and prune it to ~ 100K variants.","text":"

    Tip

    Use PLINK.

    QC: only SNPs (exclude indels), MAF>0.1

    Pruning: plink --indep-pariwise

    "},{"location":"95_Assignment/#6-perform-pca","title":"6. Perform PCA","text":"

    Tip

    plink --pca

    "},{"location":"95_Assignment/#7-visualization-and-interpretation","title":"7. Visualization and interpretation.","text":"

    Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.

    Tip

    You can use R, python, or any other tools you like (even Excel can do the job.)

    (If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)

    "},{"location":"95_Assignment/#checklist","title":"Checklist","text":""},{"location":"95_Assignment/#reference","title":"Reference","text":""},{"location":"96_Assignment2/","title":"The final presentation for \u57fa\u790e\u6f14\u7fd2II","text":"

    Note

    "},{"location":"96_Assignment2/#outline","title":"Outline","text":"

    (Just an example, there is no need to strictly follow this.)

    "},{"location":"99_About/","title":"GWAS Tutorial - Fundamental Exercise II","text":"

    This tutorial is provided by the Laboratory of Complex Trait Genomics (Kamatani Lab) in the Deparment of Computational Biology and Medical Sciences at the Univerty of Tokyo. This tutorial is designed for the graduate course Fundamental Exercise II.

    "},{"location":"99_About/#main-contributors","title":"Main Contributors","text":""},{"location":"99_About/#contact-us","title":"Contact Us","text":"

    This repository is currently maintained by Yunye He.

    If you have any questions or suggestions, please feel free to contact gwaslab@gmail.com.

    Enjoy this real \"Manhattan plot\"!

    "},{"location":"Imputation/","title":"Imputation","text":"

    The missing data imputation is not a task specific to genetic studies. By comparing the genotyping array (generally 500k\u20131M markers) to the reference panel (WGSed), missing markers on the array are filled. The tabular data imputation methods could be used to impute the genotype data. However, haplotypes are coalesced from the ancestors, and the recombination events during gametogenesis, each individual's haplotype is a mosaic of all haplotypes in a population. Given these properties, hidden Markov model (HMM) based methods usually outperform tabular data-based ones.

    This HMM was first described in Li & Stephens 2003. Here we will not go through tools over the past 20 years. We will introduce the concept and the usage of Minimac.

    "},{"location":"Imputation/#figure-illustration","title":"Figure illustration","text":"

    In the figure, each row in the above panel represents a reference haplotype. The middle panel shows the genotyping array. Genotyped markers are squared and WGS-only markers are circled. The two colors represent the ref and alt alleles. You could also think they represent different haplotype fragments. The red triangles indicate the recombination hot spots, which a crossover between the reference haplotypes is more likely to happen.

    Given the genotyped marker, matching probabilities are calculated for all potential paths through reference haplotypes. Then, in this example (the real case is not this simple), we assumed at each recombination hotspot, there is a free recombination. You will see that all paths chained by dark blue match 2 of the 4 genotyped markers. So these paths have equal probability.

    Finally, missing markers are filled with the probability-weighted alleles on each path. For the left three circles, two paths are cyan and one path is orange, the imputation result will be 1/3 orange and 2/3 cyan.

    "},{"location":"Imputation/#how-to-do-imputation","title":"How to do imputation","text":"

    The simplest way is to use the Michigan or TOPMed imputation server, if you don't have resources of WGS data. Just make your vcf, submit it to the server, and select the favored reference panel. There are built-in phasing, liftover, and QC on the server, but we would strongly suggest checking the data and doing these steps by yourself. For example:

    Another way is to run the job locally. Recent tools are memory and computation efficient, you may run it in a small in-house server or even PC.

    A typical workflow of Minimac is:

    Parameter estimation (this step will create a m3vcf reference panel file):

    Minimac3 \\\n  --refHaps ./phased_reference.vcf.gz \\\n  --processReference \\\n  --prefix ./phased_reference \\\n  --log\n

    Imputation:

    minimac4 \\\n  --refHaps ./phased_reference.m3vcf \\\n  --haps ./phased_target.vcf.gz \\\n  --prefix ./result \\\n  --format GT,DS,HDS,GP,SD \\\n  --meta \\\n  --log \\\n  --cpus 10\n

    Details of the options.

    "},{"location":"Imputation/#after-imputation","title":"After imputation","text":"

    The output is a vcf file. First, we need to examine the imputation quality. It can be a long long story and I will not explain it in detail. Most of the time, when the following criteria meet,

    The standard imputation quality metric, named Rsq, efficiently discriminates the well-imputed variants at a threshold 0.7 (may loosen it to 0.3 to allow more variants in the GWAS).

    "},{"location":"Imputation/#before-gwas","title":"Before GWAS","text":"

    Three types of genotypes are widely used in GWAS -- best-guess genotype, allelic dosage, and genotype probability. Using Dosage (DS) keeps the dataset smallest while most association test software only requires this information.

    "},{"location":"PRS_evaluation/","title":"Polygenic risk scores evaluation","text":""},{"location":"PRS_evaluation/#regressions-for-evaluation-of-prs","title":"Regressions for evaluation of PRS","text":"\\[Phenotype \\sim PRS_{phenotype} + Covariates\\] \\[logit(P) \\sim PRS_{phenotype} + Covariates\\]

    Covariates usually include sex, age and top 10 PCs.

    "},{"location":"PRS_evaluation/#evaluation","title":"Evaluation","text":""},{"location":"PRS_evaluation/#roc-aic-auc-and-c-index","title":"ROC, AIC, AUC, and C-index","text":"

    ROC

    ROC: receiver operating characteristic curve shows the performance of a classification model at all thresholds.

    AUC

    AUC: area under the ROC Curve, a common measure for the performance of a classification model.

    AIC

    Akaike Information Criterion (AIC): a measure for comparison of different statistical models.

    \\[AIC = 2k - 2ln(\\hat{L})\\]

    C-index

    C-index: Harrell\u2019s C-index (concordance index), which is a metric to evaluate the predictive performance of models and is commonly used in survival analysis. It is a measure of the probability that the predicted scores \\(M_i\\) and \\(M_j\\) by a model of two randomly selected individuals \\(i\\) and \\(j\\), have the reverse relative order as their true event times \\(T_i, T_j\\).

    \\[ C = Pr (M_j > M_i | T_j < T_i) \\]

    Interpretation: Individuals with higher scores should have higher risks of the disease events

    "},{"location":"PRS_evaluation/#r2-and-pseudo-r2","title":"R2 and pseudo-R2","text":"

    Coefficient of determination

    \\(R^2\\) : coefficient of determination, which measures the amount of variance explained by the regression model.

    In linear regression:

    \\[ R^2 = 1 - {{RSS}\\over{TSS}} \\]

    Pseudo-R2 (Nagelkerke)

    In logistic regression,

    One of the most commonly used Pseudo-R2 for PRS analysis is Nagelkerke's \\(R^2\\)

    \\[R^2_{Nagelkerke} = {{1 - ({{L_0}\\over{L_M}})^{2/n}}\\over{1 - L_0^{2/n}}}\\] "},{"location":"PRS_evaluation/#r2-on-the-liability-scale-lee","title":"R2 on the liability scale (Lee)","text":"

    R2 on liability scale

    \\(R^2\\) on the liability scale for ascertained case-control studies

    \\[ R^2_l = {{R_o^2 C}\\over{1 + R_o^2 \\theta C }} \\]

    Reference : Lee, S. H., Goddard, M. E., Wray, N. R., & Visscher, P. M. (2012). A better coefficient of determination for genetic profile analysis. Genetic epidemiology, 36(3), 214-224.

    The authors also provided R codes for calculation (removed unrelated codes for simplicity)

    # R2 on the liability scale using the transformation\n\nnt = total number of the sample\nncase = number of cases\nncont = number of controls\nthd = the threshold on the normal distribution which truncates the proportion of disease prevalence\nK = population prevalence\nP = proportion of cases in the case-control samples\n\n#threshold\nthd = -qnorm(K,0,1)\n\n#value of standard normal density function at thd\nzv = dnorm(thd) \n\n#mean liability for case\nmv = zv/K \n\n#linear model\nlmv = lm(y\u223cg) \n\n#R20 : R2 on the observed scale\nR2O = var(lmv$fitted.values)/(ncase/nt*ncont/nt)\n\n# calculate correction factors\ntheta = mv*(P-K)/(1-K)*(mv*(P-K)/(1-K)-thd) \ncv = K*(1-K)/zv^2*K*(1-K)/(P*(1-P)) \n\n# convert to R2 on the liability scale\nR2 = R2O*cv/(1+R2O*theta*cv)\n
    "},{"location":"PRS_evaluation/#bootstrap-confidence-interval-methods-for-r2","title":"Bootstrap Confidence Interval Methods for R2","text":"

    Bootstrap is a commonly used resampling method to generate a sampling distribution from the known sample dataset. It repeatedly takes random samples with replacement from the known sample dataset.

    Steps:

    The percentile bootstrap interval is then defined as the interval between \\(100 \\times \\alpha /2\\) and \\(100 \\times (1 - \\alpha /2)\\) percentiles of the parameters estimated by bootstrapping. We can use this method to estimate the bootstrap interval for \\(R^2\\).

    "},{"location":"PRS_evaluation/#reference","title":"Reference","text":""},{"location":"Phasing/","title":"Phasing","text":"

    Human genome is diploid. Distribution of variants between homologous chromosomes can affect the interpretation of genotype data, such as allele specific expression, context-informed annotation, loss-of-function compound heterozygous events.

    Example

    ( SHAPEIT5 )

    In the above illustration, when LoF variants are on both copies of a gene, the gene is thought knocked out

    Trio data and long read sequencing can solve the haplotyping problem. That is not always possible. Statistical phasing is based on the Li & Stephens Markov model. The haploid version of this model (see Imputation) is easier to understand. Because the maternal and paternal haplotypes are independent, unphased genotype could be constructed by the addition of two haplotypes.

    Recent methods had incopoorates long IBD sharing, local haplotypes, etc, to make it tractable for large datasets. You could read the following methods if you are interested.

    "},{"location":"Phasing/#how-to-do-phasing","title":"How to do phasing","text":"

    In most of the cases, phasing is just a pre-step of imputation, and we do not care about how the phasing goes. But there are several considerations, like reference-based or reference-free, large and small sample size, rare variants cutoff. There is no single method that could best fit all cases.

    Here I show one example using EAGLE2.

    eagle \\\n    --vcf=target.vcf.gz \\\n    --geneticMapFile=genetic_map_hg19_withX.txt.gz \\\n    --chrom=19 \\\n    --outPrefix=target.eagle \\\n    --numThreads=10\n
    "},{"location":"TwoSampleMR/","title":"TwoSampleMR Tutorial","text":"In\u00a0[1]: Copied!
    library(data.table)\nlibrary(TwoSampleMR)\n
    library(data.table) library(TwoSampleMR)
    TwoSampleMR version 0.5.6 \n[>] New: Option to use non-European LD reference panels for clumping etc\n[>] Some studies temporarily quarantined to verify effect allele\n[>] See news(package='TwoSampleMR') and https://gwas.mrcieu.ac.uk for further details\n\n\n
    In\u00a0[2]: Copied!
    exp_raw <- fread(\"koges_bmi.txt.gz\")\n\nexp_raw <- subset(exp_raw,exp_raw$pval<5e-8)\n\nexp_raw$phenotype <- \"BMI\"\n\nexp_raw$n <- 72282\n\nexp_dat <- format_data( exp_raw,\n    type = \"exposure\",\n    snp_col = \"rsids\",\n    beta_col = \"beta\",\n    se_col = \"sebeta\",\n    effect_allele_col = \"alt\",\n    other_allele_col = \"ref\",\n    eaf_col = \"af\",\n    pval_col = \"pval\",\n    phenotype_col = \"phenotype\",\n    samplesize_col= \"n\"\n)\nclumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")\n
    exp_raw <- fread(\"koges_bmi.txt.gz\") exp_raw <- subset(exp_raw,exp_raw$pval<5e-8) exp_raw$phenotype <- \"BMI\" exp_raw$n <- 72282 exp_dat <- format_data( exp_raw, type = \"exposure\", snp_col = \"rsids\", beta_col = \"beta\", se_col = \"sebeta\", effect_allele_col = \"alt\", other_allele_col = \"ref\", eaf_col = \"af\", pval_col = \"pval\", phenotype_col = \"phenotype\", samplesize_col= \"n\" ) clumped_exp <- clump_data(exp_dat,clump_r2=0.01,pop=\"EAS\")
    Warning message in .fun(piece, ...):\n\u201cDuplicated SNPs present in exposure data for phenotype 'BMI. Just keeping the first instance:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs4665740\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrs7201608\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\u201d\nAPI: public: http://gwas-api.mrcieu.ac.uk/\n\nPlease look at vignettes for options on running this locally if you need to run many instances of this command.\n\nClumping rvi6Om, 2452 variants, using EAS population reference\n\nRemoving 2420 of 2452 variants due to LD with other variants or absence from LD reference panel\n\n
    In\u00a0[16]: Copied!
    out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\",\n                    select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\"))\n\nout_raw$phenotype <- \"T2D\"\n\nout_dat <- format_data( out_raw,\n    type = \"outcome\",\n    snp_col = \"SNPID\",\n    beta_col = \"BETA\",\n    se_col = \"SE\",\n    effect_allele_col = \"Allele2\",\n    other_allele_col = \"Allele1\",\n    pval_col = \"p.value\",\n    phenotype_col = \"phenotype\",\n    samplesize_col= \"n\",\n    eaf_col=\"AF_Allele2\"\n)\n
    out_raw <- fread(\"hum0197.v3.BBJ.T2D.v1/GWASsummary_T2D_Japanese_SakaueKanai2020.auto.txt.gz\", select=c(\"SNPID\",\"Allele1\",\"Allele2\",\"BETA\",\"SE\",\"p.value\",\"N\",\"AF_Allele2\")) out_raw$phenotype <- \"T2D\" out_dat <- format_data( out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", se_col = \"SE\", effect_allele_col = \"Allele2\", other_allele_col = \"Allele1\", pval_col = \"p.value\", phenotype_col = \"phenotype\", samplesize_col= \"n\", eaf_col=\"AF_Allele2\" )
    Warning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201ceffect_allele column has some values that are not A/C/T/G or an indel comprising only these characters or D/I. These SNPs will be excluded.\u201d\nWarning message in format_data(out_raw, type = \"outcome\", snp_col = \"SNPID\", beta_col = \"BETA\", :\n\u201cThe following SNP(s) are missing required information for the MR tests and will be excluded\n1:1142714:t:<cn0>\n1:4288465:t:<ins:me:alu>\n1:4882232:t:<cn0>\n1:5172414:g:<cn0>\n1:5173809:t:<cn0>\n1:5934301:g:<ins:me:alu>\n1:6814818:a:<ins:me:alu>\n1:7921468:c:<cn2>\n1:8502010:t:<ins:me:alu>\n1:8924066:c:<cn0>\n1:9171841:c:<cn0>\n1:9403667:a:<cn2>\n1:9595360:a:<cn0>\n1:9846036:c:<cn0>\n1:10067190:g:<cn0>\n1:10482499:g:<cn0>\n1:11682873:t:<cn0>\n1:11830220:t:<ins:me:sva>\n1:11988599:c:<cn0>\n1:12475666:t:<ins:me:sva>\n1:12737575:a:<ins:me:alu>\n1:12842004:a:<cn0>\n1:14437074:t:<cn0>\n1:14437868:a:<cn0>\n1:14713511:t:<cn2>\n1:14735732:g:<cn0>\n1:15343948:g:<cn0>\n1:16151682:c:<cn0>\n1:16329336:t:<ins:me:sva>\n1:16358741:g:<cn0>\n1:17676165:a:<cn0>\n1:19486410:c:<ins:me:alu>\n1:19855608:a:<cn2>\n1:20257109:t:<ins:me:alu>\n1:20310746:g:<cn0>\n1:20496899:c:<cn0>\n1:20497183:c:<cn0>\n1:20864015:t:<cn0>\n1:20944751:c:<ins:me:alu>\n1:21346279:a:<cn0>\n1:21492591:c:<ins:me:alu>\n1:21786418:t:<cn0>\n1:22302473:t:<cn0>\n1:22901908:t:<ins:me:alu>\n1:23908383:g:<cn0>\n1:24223580:g:<cn0>\n1:24520350:g:<cn0>\n1:24804603:c:<cn0>\n1:25055152:g:<cn0>\n1:26460095:a:<cn0>\n1:26961278:g:<cn0>\n1:29373390:t:<ins:me:alu>\n1:31090520:t:<ins:me:alu>\n1:31316259:t:<cn0>\n1:31720009:a:<cn0>\n1:32535965:g:<cn0>\n1:32544371:a:<cn0>\n1:33785116:c:<cn0>\n1:35101427:c:<cn0>\n1:35177287:g:<cn0>\n1:35627104:t:<cn0>\n1:36474694:t:<ins:me:alu>\n1:36733282:t:<cn0>\n1:37215810:a:<ins:me:alu>\n1:37816478:a:<cn0>\n1:38132306:t:<cn0>\n1:39084231:a:<cn0>\n1:39677675:t:<ins:me:alu>\n1:40524704:t:<ins:me:alu>\n1:40552356:a:<cn0>\n1:40976681:g:<cn0>\n1:41021684:a:<cn0>\n1:41785500:a:<ins:me:line1>\n1:42390318:c:<ins:me:alu>\n1:43694061:t:<cn0>\n1:44059290:a:<inv>\n1:45021223:t:<cn0>\n1:45708588:a:<cn0>\n1:45822649:t:<cn0>\n1:46333195:a:<ins:me:alu>\n1:46794814:t:<ins:me:alu>\n1:47267517:t:<cn0>\n1:47346571:a:<cn0>\n1:47623401:a:<cn0>\n1:47913001:t:<cn0>\n1:48820285:t:<ins:me:alu>\n1:48972537:g:<ins:me:alu>\n1:49357693:t:<ins:me:alu>\n1:49428756:t:<ins:me:line1>\n1:49861993:g:<ins:me:alu>\n1:50912662:c:<ins:me:alu>\n1:51102445:t:<cn0>\n1:52146313:a:<cn0>\n1:53594175:t:<cn0>\n1:53595112:c:<cn0>\n1:55092043:g:<cn0>\n1:55341923:c:<cn0>\n1:55342224:g:<cn0>\n1:55927718:a:<cn0>\n1:56268665:t:<ins:me:line1>\n1:56405404:t:<ins:me:line1>\n1:56879062:t:<ins:me:alu>\n1:57100960:t:<ins:me:sva>\n1:57208746:a:<cn0>\n1:58722032:t:<cn2>\n1:58743910:a:<cn0>\n1:58795378:a:<cn0>\n1:59205317:t:<ins:me:alu>\n1:59591483:t:<ins:me:alu>\n1:59871876:t:<ins:me:alu>\n1:60046725:a:<cn0>\n1:60048628:c:<cn0>\n1:60470604:t:<ins:me:alu>\n1:60487912:t:<cn0>\n1:60715714:t:<ins:me:line1>\n1:61144594:c:<ins:me:alu>\n1:62082822:a:<cn0>\n1:62113386:c:<cn0>\n1:62479250:t:<cn0>\n1:62622902:g:<cn0>\n1:62654739:c:<cn0>\n1:63841704:c:<ins:me:alu>\n1:64720497:a:<cn0>\n1:64850193:a:<ins:me:sva>\n1:65346960:t:<ins:me:alu>\n1:65412505:a:<cn0>\n1:68375746:a:<cn0>\n1:70061670:g:<ins:me:alu>\n1:70091056:t:<ins:me:alu>\n1:70093557:c:<ins:me:alu>\n1:70412360:t:<ins:me:alu>\n1:70424730:t:<cn2>\n1:70820401:t:<cn0>\n1:70912433:g:<ins:me:alu>\n1:72449620:a:<cn0>\n1:72755694:t:<cn0>\n1:72766343:t:<cn0>\n1:72778537:g:<cn0>\n1:73092779:c:<cn2>\n1:74312425:a:<cn0>\n1:75148055:t:<ins:me:alu>\n1:75192907:c:<ins:me:line1>\n1:75301685:t:<ins:me:alu>\n1:75557174:c:<ins:me:alu>\n1:76392967:t:<ins:me:alu>\n1:76416074:a:<ins:me:alu>\n1:76900598:c:<cn0>\n1:77577928:t:<ins:me:alu>\n1:77634327:a:<ins:me:alu>\n1:77764994:t:<ins:me:alu>\n1:77830614:t:<cn0>\n1:78446240:c:<ins:me:sva>\n1:78607067:t:<ins:me:alu>\n1:78649157:a:<cn0>\n1:78800902:t:<ins:me:line1>\n1:79108845:t:<ins:me:alu>\n1:79331208:c:<ins:me:alu>\n1:79582082:t:<ins:me:alu>\n1:79855600:c:<cn0>\n1:80221781:t:<cn0>\n1:80299106:t:<ins:me:alu>\n1:80504615:t:<cn0>\n1:80554065:t:<cn0>\n1:80955976:t:<ins:me:line1>\n1:81422415:c:<cn0>\n1:82312054:g:<ins:me:alu>\n1:82850409:g:<ins:me:alu>\n1:83041946:t:<cn0>\n1:84056670:a:<cn0>\n1:84388330:g:<cn0>\n1:84517858:a:<cn0>\n1:84712009:g:<cn0>\n1:84913274:c:<ins:me:alu>\n1:85293152:g:<ins:me:alu>\n1:85620127:t:<ins:me:alu>\n1:85910957:g:<cn0>\n1:86400829:t:<cn0>\n1:86696940:a:<ins:me:alu>\n1:87064962:c:<cn2>\n1:87096974:c:<cn0>\n1:87096990:t:<cn0>\n1:88813625:t:<ins:me:alu>\n1:89209563:t:<ins:me:alu>\n1:89733616:t:<ins:me:line1>\n1:89811425:g:<cn0>\n1:90370569:t:<ins:me:alu>\n1:90914512:g:<ins:me:line1>\n1:91878937:g:<cn0>\n1:92131841:g:<inv>\n1:92232051:t:<cn0>\n1:93291972:c:<cn0>\n1:93498232:t:<ins:me:alu>\n1:94288372:c:<cn0>\n1:95192010:a:<ins:me:line1>\n1:95342701:g:<ins:me:alu>\n1:95522242:t:<cn0>\n1:97458273:t:<inv>\n1:98605297:t:<ins:me:alu>\n1:99610528:a:<ins:me:alu>\n1:99698454:g:<ins:me:alu>\n1:100355940:a:<ins:me:alu>\n1:100645536:g:<ins:me:alu>\n1:100994221:g:<ins:me:alu>\n1:101693230:t:<cn0>\n1:101695346:a:<cn0>\n1:101770067:g:<ins:me:alu>\n1:101978980:t:<ins:me:line1>\n1:102568923:g:<ins:me:line1>\n1:102920544:t:<ins:me:alu>\n1:103054499:t:<ins:me:alu>\n1:104359763:g:<cn0>\n1:104443176:t:<cn0>\n1:104574487:t:<ins:me:alu>\n1:105054083:t:<ins:me:alu>\n1:105070244:c:<ins:me:alu>\n1:105138650:t:<ins:me:alu>\n1:105231111:t:<ins:me:alu>\n1:105832823:g:<cn0>\n1:106015797:t:<cn0>\n1:106978443:t:<cn0>\n1:107896853:g:<cn0>\n1:107949843:t:<ins:me:alu>\n1:108142479:t:<ins:me:alu>\n1:108369370:a:<cn0>\n1:108402972:a:<cn0>\n1:109366972:g:<cn0>\n1:109573240:a:<cn0>\n1:110187159:a:<cn0>\n1:110225019:c:<cn0>\n1:111013750:a:<cn0>\n1:111472607:g:<cn0>\n1:111802597:g:<ins:me:sva>\n1:111827762:a:<cn0>\n1:111896187:c:<ins:me:sva>\n1:112032284:t:<ins:me:alu>\n1:112123691:t:<ins:me:alu>\n1:112691740:a:<cn0>\n1:112736007:a:<ins:me:alu>\n1:112992009:t:<ins:me:alu>\n1:113799625:g:<cn0>\n1:114925678:t:<cn0>\n1:115178042:c:<cn0>\n1:116229468:c:<cn0>\n1:116983571:t:<ins:me:alu>\n1:117593370:a:<cn0>\n1:119526940:a:<cn0>\n1:119553366:c:<ins:me:line1>\n1:120012853:a:<cn0>\n1:152555495:g:<cn0>\n1:152643788:a:<cn0>\n1:152760084:c:<cn0>\n1:153133703:a:<cn0>\n1:154123770:t:<ins:me:alu>\n1:154324167:g:<cn0>\n1:154865017:g:<ins:me:alu>\n1:157173860:t:<cn0>\n1:157363502:t:<ins:me:alu>\n1:157540655:g:<cn0>\n1:157887236:t:<inv>\n1:158371473:a:<ins:me:alu>\n1:158488410:a:<cn0>\n1:158726918:a:<cn0>\n1:160979498:c:<cn0>\n1:162263027:t:<ins:me:alu>\n1:163088865:t:<ins:me:alu>\n1:163314443:g:<ins:me:alu>\n1:163639693:t:<ins:me:alu>\n1:165553149:t:<ins:me:line1>\n1:165861400:t:<ins:me:sva>\n1:166189445:t:<ins:me:alu>\n1:167506110:g:<ins:me:alu>\n1:167712862:g:<ins:me:alu>\n1:168926083:a:<ins:me:sva>\n1:169004356:c:<cn0>\n1:169042039:c:<cn0>\n1:169225213:t:<cn0>\n1:169524859:t:<ins:me:line1>\n1:170603451:a:<ins:me:alu>\n1:170991168:c:<ins:me:alu>\n1:171358314:t:<ins:me:alu>\n1:172177959:g:<cn0>\n1:172825753:g:<cn0>\n1:173811663:a:<cn0>\n1:174654509:g:<cn0>\n1:174796517:t:<cn0>\n1:174894014:g:<cn0>\n1:175152408:g:<cn0>\n1:177509016:g:<cn0>\n1:177544393:g:<cn0>\n1:177946159:a:<cn0>\n1:178397612:t:<ins:me:alu>\n1:178495321:a:<cn0>\n1:178692798:t:<ins:me:alu>\n1:179491966:t:<ins:me:alu>\n1:179607260:a:<cn0>\n1:180272299:a:<cn0>\n1:180857564:c:<ins:me:alu>\n1:181043348:a:<cn0>\n1:181588360:t:<ins:me:alu>\n1:181601286:t:<ins:me:alu>\n1:181853551:g:<ins:me:alu>\n1:182420857:t:<ins:me:alu>\n1:183308627:a:<cn0>\n1:185009806:t:<cn0>\n1:185504717:c:<ins:me:alu>\n1:185584799:t:<ins:me:alu>\n1:185857064:a:<cn0>\n1:187464747:t:<cn0>\n1:187522081:g:<ins:me:alu>\n1:187609013:t:<cn0>\n1:187716053:g:<cn0>\n1:187932575:t:<cn0>\n1:187955397:c:<ins:me:alu>\n1:188174657:t:<ins:me:alu>\n1:188186464:t:<ins:me:alu>\n1:188438213:t:<ins:me:alu>\n1:188615934:g:<ins:me:alu>\n1:189247039:a:<ins:me:alu>\n1:190052658:t:<cn0>\n1:190309695:t:<cn0>\n1:190773296:t:<ins:me:alu>\n1:190874469:t:<ins:me:alu>\n1:191466954:t:<ins:me:line1>\n1:191580781:a:<ins:me:alu>\n1:191817437:c:<ins:me:alu>\n1:191916438:t:<cn0>\n1:192008678:t:<ins:me:line1>\n1:192262268:a:<ins:me:line1>\n1:193549655:c:<ins:me:line1>\n1:193675125:t:<ins:me:alu>\n1:193999047:t:<cn0>\n1:194067859:t:<ins:me:alu>\n1:194575585:t:<cn0>\n1:194675140:c:<ins:me:alu>\n1:195146820:c:<ins:me:alu>\n1:195746415:a:<ins:me:line1>\n1:195885406:g:<cn0>\n1:195904499:g:<cn0>\n1:196464453:a:<ins:me:line1>\n1:196602664:a:<cn0>\n1:196728877:g:<cn0>\n1:196734744:a:<cn0>\n1:196761370:t:<ins:me:alu>\n1:197756784:c:<inv>\n1:197894025:c:<cn0>\n1:198093872:c:<ins:me:alu>\n1:198243300:t:<ins:me:alu>\n1:198529696:t:<ins:me:line1>\n1:198757296:t:<cn0>\n1:198773749:t:<cn0>\n1:198815313:a:<ins:me:alu>\n1:202961159:t:<ins:me:alu>\n1:203684252:t:<cn0>\n1:204238474:c:<ins:me:alu>\n1:204345055:t:<ins:me:alu>\n1:204381864:c:<cn0>\n1:205178526:t:<inv>\u201d\n
    In\u00a0[17]: Copied!
    harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)\n
    harmonized_data <- harmonise_data(clumped_exp,out_dat,action=1)
    Harmonising BMI (rvi6Om) and T2D (ETcv15)\n\n
    In\u00a0[18]: Copied!
    harmonized_data\n
    harmonized_data A data.frame: 28 \u00d7 29 SNPeffect_allele.exposureother_allele.exposureeffect_allele.outcomeother_allele.outcomebeta.exposurebeta.outcomeeaf.exposureeaf.outcomeremove\u22efpval.exposurese.exposuresamplesize.exposureexposuremr_keep.exposurepval_origin.exposureid.exposureactionmr_keepsamplesize.outcome <chr><chr><chr><chr><chr><dbl><dbl><dbl><dbl><lgl>\u22ef<dbl><dbl><dbl><chr><lgl><chr><chr><dbl><lgl><lgl> 1rs10198356GAGA 0.044 0.0278218160.4500.46949841FALSE\u22ef1.5e-170.005172282BMITRUEreportedrvi6Om1TRUENA 2rs10209994CACA 0.030 0.0284334240.6400.65770918FALSE\u22ef2.0e-080.005472282BMITRUEreportedrvi6Om1TRUENA 3rs10824329AGAG 0.029 0.0182171190.5100.56240335FALSE\u22ef1.7e-080.005172282BMITRUEreportedrvi6Om1TRUENA 4rs10938397GAGA 0.036 0.0445547360.2800.29915686FALSE\u22ef1.0e-100.005672282BMITRUEreportedrvi6Om1TRUENA 5rs11066132TCTC-0.053-0.0319288060.1600.24197159FALSE\u22ef1.0e-130.007172282BMITRUEreportedrvi6Om1TRUENA 6rs12522139GTGT-0.037-0.0107492430.2700.24543922FALSE\u22ef1.8e-100.005772282BMITRUEreportedrvi6Om1TRUENA 7rs12591730AGAG 0.037 0.0330428120.2200.25367536FALSE\u22ef1.5e-080.006572282BMITRUEreportedrvi6Om1TRUENA 8rs13013021TCTC 0.070 0.1040752230.9070.90195307FALSE\u22ef1.9e-150.008872282BMITRUEreportedrvi6Om1TRUENA 9rs1955337 TGTG 0.036 0.0195935030.3000.24112816FALSE\u22ef7.4e-110.005672282BMITRUEreportedrvi6Om1TRUENA 10rs2076308 CGCG 0.037 0.0413520380.3100.31562874FALSE\u22ef3.4e-110.005572282BMITRUEreportedrvi6Om1TRUENA 11rs2278557 GCGC 0.034 0.0212111960.3200.29052039FALSE\u22ef7.4e-100.005572282BMITRUEreportedrvi6Om1TRUENA 12rs2304608 ACAC 0.031 0.0466695150.4700.44287320FALSE\u22ef1.1e-090.005172282BMITRUEreportedrvi6Om1TRUENA 13rs2531995 TCTC 0.031 0.0433160150.3700.33584772FALSE\u22ef5.2e-090.005372282BMITRUEreportedrvi6Om1TRUENA 14rs261967 CACA 0.032 0.0489708280.4400.39718313FALSE\u22ef3.5e-100.005172282BMITRUEreportedrvi6Om1TRUENA 15rs35332469CTCT-0.035 0.0080755980.2200.17678428FALSE\u22ef3.6e-080.006372282BMITRUEreportedrvi6Om1TRUENA 16rs35560038TATA-0.047 0.0739350890.5900.61936434FALSE\u22ef1.4e-190.005272282BMITRUEreportedrvi6Om1TRUENA 17rs3755804 TCTC 0.043 0.0228541340.2800.30750660FALSE\u22ef1.5e-140.005672282BMITRUEreportedrvi6Om1TRUENA 18rs4470425 ACAC-0.030-0.0208441370.4500.44152032FALSE\u22ef4.9e-090.005172282BMITRUEreportedrvi6Om1TRUENA 19rs476828 CTCT 0.067 0.0786518590.2700.25309742FALSE\u22ef2.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 20rs4883723 AGAG 0.039 0.0213709100.2800.22189601FALSE\u22ef8.3e-120.005772282BMITRUEreportedrvi6Om1TRUENA 21rs509325 GTGT 0.065 0.0356917590.2800.26816326FALSE\u22ef7.8e-310.005772282BMITRUEreportedrvi6Om1TRUENA 22rs55872725TCTC 0.090 0.1215170230.1200.20355108FALSE\u22ef1.8e-310.007772282BMITRUEreportedrvi6Om1TRUENA 23rs6089309 CTCT-0.033-0.0186698330.7000.65803267FALSE\u22ef3.5e-090.005672282BMITRUEreportedrvi6Om1TRUENA 24rs6265 TCTC-0.049-0.0316426960.4600.40541994FALSE\u22ef6.1e-220.005172282BMITRUEreportedrvi6Om1TRUENA 25rs6736712 GCGC-0.053-0.0297168990.9170.93023505FALSE\u22ef2.1e-080.009572282BMITRUEreportedrvi6Om1TRUENA 26rs7560832 CACA-0.150-0.0904811950.0120.01129784FALSE\u22ef2.0e-090.025072282BMITRUEreportedrvi6Om1TRUENA 27rs825486 TCTC-0.031 0.0190735540.6900.75485104FALSE\u22ef3.1e-080.005672282BMITRUEreportedrvi6Om1TRUENA 28rs9348441 ATAT-0.036 0.1792307940.4700.42502848FALSE\u22ef1.3e-120.005172282BMITRUEreportedrvi6Om1TRUENA In\u00a0[6]: Copied!
    res <- mr(harmonized_data)\n
    res <- mr(harmonized_data)
    Analysing 'rvi6Om' on 'hff6sO'\n\n
    In\u00a0[7]: Copied!
    res\n
    res A data.frame: 5 \u00d7 9 id.exposureid.outcomeoutcomeexposuremethodnsnpbsepval <chr><chr><chr><chr><chr><int><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 281.33375800.694852606.596064e-02 rvi6Omhff6sOT2DBMIWeighted median 280.62989800.085163151.399605e-13 rvi6Omhff6sOT2DBMIInverse variance weighted280.55989560.232258061.592361e-02 rvi6Omhff6sOT2DBMISimple mode 280.60978420.133054299.340189e-05 rvi6Omhff6sOT2DBMIWeighted mode 280.59467780.126803557.011481e-05 In\u00a0[8]: Copied!
    mr_heterogeneity(harmonized_data)\n
    mr_heterogeneity(harmonized_data) A data.frame: 2 \u00d7 8 id.exposureid.outcomeoutcomeexposuremethodQQ_dfQ_pval <chr><chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMIMR Egger 670.7022261.000684e-124 rvi6Omhff6sOT2DBMIInverse variance weighted706.6579271.534239e-131 In\u00a0[9]: Copied!
    mr_pleiotropy_test(harmonized_data)\n
    mr_pleiotropy_test(harmonized_data) A data.frame: 1 \u00d7 7 id.exposureid.outcomeoutcomeexposureegger_interceptsepval <chr><chr><chr><chr><dbl><dbl><dbl> rvi6Omhff6sOT2DBMI-0.036036970.03052410.2484472 In\u00a0[10]: Copied!
    res_single <- mr_singlesnp(harmonized_data)\n
    res_single <- mr_singlesnp(harmonized_data) In\u00a0[11]: Copied!
    res_single\n
    res_single A data.frame: 30 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs10198356 0.63231400.20828372.398742e-03 2BMIT2Drvi6Omhff6sONArs10209994 0.94778080.32258143.302164e-03 3BMIT2Drvi6Omhff6sONArs10824329 0.62817650.32462145.297739e-02 4BMIT2Drvi6Omhff6sONArs10938397 1.23763160.27758548.251150e-06 5BMIT2Drvi6Omhff6sONArs11066132 0.60243030.22324016.963693e-03 6BMIT2Drvi6Omhff6sONArs12522139 0.29052010.28902403.148119e-01 7BMIT2Drvi6Omhff6sONArs12591730 0.89304900.30766873.700413e-03 8BMIT2Drvi6Omhff6sONArs13013021 1.48678890.22077771.646925e-11 9BMIT2Drvi6Omhff6sONArs1955337 0.54426400.29941466.910079e-02 10BMIT2Drvi6Omhff6sONArs2076308 1.11762260.26579692.613132e-05 11BMIT2Drvi6Omhff6sONArs2278557 0.62385870.29681843.556906e-02 12BMIT2Drvi6Omhff6sONArs2304608 1.50546820.29689053.961740e-07 13BMIT2Drvi6Omhff6sONArs2531995 1.39729080.31301578.045689e-06 14BMIT2Drvi6Omhff6sONArs261967 1.53033840.29211921.616714e-07 15BMIT2Drvi6Omhff6sONArs35332469 -0.23073140.34792195.072217e-01 16BMIT2Drvi6Omhff6sONArs35560038 -1.57308700.20189686.619637e-15 17BMIT2Drvi6Omhff6sONArs3755804 0.53149150.23250732.225933e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.69480460.30799442.407689e-02 19BMIT2Drvi6Omhff6sONArs476828 1.17390830.15685507.207355e-14 20BMIT2Drvi6Omhff6sONArs4883723 0.54797210.28550045.494141e-02 21BMIT2Drvi6Omhff6sONArs509325 0.54910400.15981965.908641e-04 22BMIT2Drvi6Omhff6sONArs55872725 1.35018910.12597918.419325e-27 23BMIT2Drvi6Omhff6sONArs6089309 0.56575250.33470099.096620e-02 24BMIT2Drvi6Omhff6sONArs6265 0.64576930.19018716.851804e-04 25BMIT2Drvi6Omhff6sONArs6736712 0.56069620.34487841.039966e-01 26BMIT2Drvi6Omhff6sONArs7560832 0.60320800.29049723.785077e-02 27BMIT2Drvi6Omhff6sONArs825486 -0.61527590.35003347.878772e-02 28BMIT2Drvi6Omhff6sONArs9348441 -4.97863320.25727821.992909e-83 29BMIT2Drvi6Omhff6sONAAll - Inverse variance weighted 0.55989560.23225811.592361e-02 30BMIT2Drvi6Omhff6sONAAll - MR Egger 1.33375800.69485266.596064e-02 In\u00a0[12]: Copied!
    res_loo <- mr_leaveoneout(harmonized_data)\nres_loo\n
    res_loo <- mr_leaveoneout(harmonized_data) res_loo A data.frame: 29 \u00d7 9 exposureoutcomeid.exposureid.outcomesamplesizeSNPbsep <chr><chr><chr><chr><lgl><chr><dbl><dbl><dbl> 1BMIT2Drvi6Omhff6sONArs101983560.55628340.24249172.178871e-02 2BMIT2Drvi6Omhff6sONArs102099940.55205760.23881222.079526e-02 3BMIT2Drvi6Omhff6sONArs108243290.55853350.23902391.945341e-02 4BMIT2Drvi6Omhff6sONArs109383970.54126880.23887092.345460e-02 5BMIT2Drvi6Omhff6sONArs110661320.55806060.24172752.096381e-02 6BMIT2Drvi6Omhff6sONArs125221390.56671020.23950641.797373e-02 7BMIT2Drvi6Omhff6sONArs125917300.55248020.23909902.085075e-02 8BMIT2Drvi6Omhff6sONArs130130210.51897150.23868082.968017e-02 9BMIT2Drvi6Omhff6sONArs1955337 0.56026350.23945051.929468e-02 10BMIT2Drvi6Omhff6sONArs2076308 0.54313550.23944032.330758e-02 11BMIT2Drvi6Omhff6sONArs2278557 0.55836340.23949241.972992e-02 12BMIT2Drvi6Omhff6sONArs2304608 0.53725570.23773252.382639e-02 13BMIT2Drvi6Omhff6sONArs2531995 0.54190160.23797122.277590e-02 14BMIT2Drvi6Omhff6sONArs261967 0.53587610.23766862.415093e-02 15BMIT2Drvi6Omhff6sONArs353324690.57359070.23783451.587739e-02 16BMIT2Drvi6Omhff6sONArs355600380.67349060.22178042.391474e-03 17BMIT2Drvi6Omhff6sONArs3755804 0.56102150.24132492.008503e-02 18BMIT2Drvi6Omhff6sONArs4470425 0.55689930.23926321.993549e-02 19BMIT2Drvi6Omhff6sONArs476828 0.50375550.24432243.922224e-02 20BMIT2Drvi6Omhff6sONArs4883723 0.56020500.23973251.945000e-02 21BMIT2Drvi6Omhff6sONArs509325 0.56084290.24685062.308693e-02 22BMIT2Drvi6Omhff6sONArs558727250.44194460.24547717.180543e-02 23BMIT2Drvi6Omhff6sONArs6089309 0.55978590.23889021.911519e-02 24BMIT2Drvi6Omhff6sONArs6265 0.55470680.24369102.282978e-02 25BMIT2Drvi6Omhff6sONArs6736712 0.55988150.23876021.902944e-02 26BMIT2Drvi6Omhff6sONArs7560832 0.55881130.23962291.969836e-02 27BMIT2Drvi6Omhff6sONArs825486 0.58000260.23675451.429330e-02 28BMIT2Drvi6Omhff6sONArs9348441 0.73789670.13668386.717515e-08 29BMIT2Drvi6Omhff6sONAAll 0.55989560.23225811.592361e-02 In\u00a0[29]: Copied!
    harmonized_data$\"r.outcome\" <- get_r_from_lor(\n  harmonized_data$\"beta.outcome\",\n  harmonized_data$\"eaf.outcome\",\n  45383,\n  132032,\n  0.26,\n  model = \"logit\",\n  correction = FALSE\n)\n
    harmonized_data$\"r.outcome\" <- get_r_from_lor( harmonized_data$\"beta.outcome\", harmonized_data$\"eaf.outcome\", 45383, 132032, 0.26, model = \"logit\", correction = FALSE ) In\u00a0[34]: Copied!
    out <- directionality_test(harmonized_data)\nout\n
    out <- directionality_test(harmonized_data) out
    r.exposure and/or r.outcome not present.\n\nCalculating approximate SNP-exposure and/or SNP-outcome correlations, assuming all are quantitative traits. Please pre-calculate r.exposure and/or r.outcome using get_r_from_lor() for any binary traits\n\n
    A data.frame: 1 \u00d7 8 id.exposureid.outcomeexposureoutcomesnp_r2.exposuresnp_r2.outcomecorrect_causal_directionsteiger_pval <chr><chr><chr><chr><dbl><dbl><lgl><dbl> rvi6OmETcv15BMIT2D0.021254530.005496427TRUENA In\u00a0[\u00a0]: Copied!
    res <- mr(harmonized_data)\np1 <- mr_scatter_plot(res, harmonized_data)\np1[[1]]\n
    res <- mr(harmonized_data) p1 <- mr_scatter_plot(res, harmonized_data) p1[[1]] In\u00a0[\u00a0]: Copied!
    res_single <- mr_singlesnp(harmonized_data)\np2 <- mr_forest_plot(res_single)\np2[[1]]\n
    res_single <- mr_singlesnp(harmonized_data) p2 <- mr_forest_plot(res_single) p2[[1]] In\u00a0[\u00a0]: Copied!
    res_loo <- mr_leaveoneout(harmonized_data)\np3 <- mr_leaveoneout_plot(res_loo)\np3[[1]]\n
    res_loo <- mr_leaveoneout(harmonized_data) p3 <- mr_leaveoneout_plot(res_loo) p3[[1]] In\u00a0[\u00a0]: Copied!
    res_single <- mr_singlesnp(harmonized_data)\np4 <- mr_funnel_plot(res_single)\np4[[1]]\n
    res_single <- mr_singlesnp(harmonized_data) p4 <- mr_funnel_plot(res_single) p4[[1]] In\u00a0[\u00a0]: Copied!
    \n
    In\u00a0[\u00a0]: Copied!
    \n
    "},{"location":"Visualization/","title":"Visualization by gwaslab","text":"In\u00a0[2]: Copied!
    import gwaslab as gl\n
    import gwaslab as gl In\u00a0[3]: Copied!
    sumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")\n
    sumstats = gl.Sumstats(\"1kgeas.B1.glm.firth\",fmt=\"plink2\")
    Tue Dec 26 15:56:49 2023 GWASLab v3.4.22 https://cloufield.github.io/gwaslab/\nTue Dec 26 15:56:49 2023 (C) 2022-2023, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\nTue Dec 26 15:56:49 2023 Start to load format from formatbook....\nTue Dec 26 15:56:49 2023  -plink2 format meta info:\nTue Dec 26 15:56:49 2023   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear\nTue Dec 26 15:56:49 2023   - format_source  : https://www.cog-genomics.org/plink/2.0/formats\nTue Dec 26 15:56:49 2023   - format_version  : Alpha 3.3 final (3 Jun)\nTue Dec 26 15:56:49 2023   - last_check_date  :  20220806\nTue Dec 26 15:56:49 2023  -plink2 to gwaslab format dictionary:\nTue Dec 26 15:56:49 2023   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\nTue Dec 26 15:56:49 2023   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\nTue Dec 26 15:56:49 2023 Start to initiate from file :1kgeas.B1.glm.firth\nTue Dec 26 15:56:50 2023  -Reading columns          : REF,ID,ALT,POS,OR,LOG(OR)_SE,Z_STAT,OBS_CT,A1,#CHROM,P,A1_FREQ\nTue Dec 26 15:56:50 2023  -Renaming columns to      : REF,SNPID,ALT,POS,OR,SE,Z,N,EA,CHR,P,EAF\nTue Dec 26 15:56:50 2023  -Current Dataframe shape : 1128732  x  12\nTue Dec 26 15:56:50 2023  -Initiating a status column: STATUS ...\nTue Dec 26 15:56:50 2023  NEA not available: assigning REF to NEA...\nTue Dec 26 15:56:50 2023  -EA,REF and ALT columns are available: assigning NEA...\nTue Dec 26 15:56:50 2023  -For variants with EA == ALT : assigning REF to NEA ...\nTue Dec 26 15:56:50 2023  -For variants with EA != ALT : assigning ALT to NEA ...\nTue Dec 26 15:56:50 2023 Start to reorder the columns...\nTue Dec 26 15:56:50 2023  -Current Dataframe shape : 1128732  x  14\nTue Dec 26 15:56:50 2023  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\nTue Dec 26 15:56:50 2023 Finished sorting columns successfully!\nTue Dec 26 15:56:50 2023  -Column: SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \nTue Dec 26 15:56:50 2023  -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\nTue Dec 26 15:56:50 2023 Finished loading data successfully!\n
    In\u00a0[4]: Copied!
    sumstats.data\n
    sumstats.data Out[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 0 1:15774:G:A 1 15774 A G 0.028283 NaN NaN NaN NaN 495 9999999 G A 1 1:15777:A:G 1 15777 G A 0.073737 NaN NaN NaN NaN 495 9999999 A G 2 1:57292:C:T 1 57292 T C 0.104675 NaN NaN NaN NaN 492 9999999 C T 3 1:77874:G:A 1 77874 A G 0.019153 0.462750 0.249299 0.803130 1.122280 496 9999999 G A 4 1:87360:C:T 1 87360 T C 0.023139 NaN NaN NaN NaN 497 9999999 C T ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 1128727 22:51217954:G:A 22 51217954 A G 0.033199 NaN NaN NaN NaN 497 9999999 G A 1128728 22:51218377:G:C 22 51218377 C G 0.033333 0.362212 -0.994457 0.320000 0.697534 495 9999999 G C 1128729 22:51218615:T:A 22 51218615 A T 0.033266 0.362476 -1.029230 0.303374 0.688618 496 9999999 T A 1128730 22:51222100:G:T 22 51222100 T G 0.039157 NaN NaN NaN NaN 498 9999999 G T 1128731 22:51239678:G:T 22 51239678 T G 0.034137 NaN NaN NaN NaN 498 9999999 G T

    1128732 rows \u00d7 14 columns

    In\u00a0[5]: Copied!
    sumstats.get_lead(sig_level=5e-8)\n
    sumstats.get_lead(sig_level=5e-8)
    Tue Dec 26 15:56:51 2023 Start to extract lead variants...\nTue Dec 26 15:56:51 2023  -Processing 1128732 variants...\nTue Dec 26 15:56:51 2023  -Significance threshold : 5e-08\nTue Dec 26 15:56:51 2023  -Sliding window size: 500  kb\nTue Dec 26 15:56:51 2023  -Found 43 significant variants in total...\nTue Dec 26 15:56:51 2023  -Identified 4 lead variants!\nTue Dec 26 15:56:51 2023 Finished extracting lead variants successfully!\n
    Out[5]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 54904 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9999999 G A 113179 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9999999 C T 549726 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9999999 T G 1088750 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9999999 T C In\u00a0[9]: Copied!
    sumstats.plot_mqq(skip=2,anno=True)\n
    sumstats.plot_mqq(skip=2,anno=True)
    Tue Dec 26 15:59:17 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:59:17 2023  -Genomic coordinates version: 99...\nTue Dec 26 15:59:17 2023    -WARNING!!! Genomic coordinates version is unknown...\nTue Dec 26 15:59:17 2023  -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:59:17 2023  -Raw input contains 1128732 variants...\nTue Dec 26 15:59:17 2023  -Plot layout mode is : mqq\nTue Dec 26 15:59:17 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:59:17 2023 Start conversion and sanity check:\nTue Dec 26 15:59:17 2023  -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:59:17 2023  -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:59:17 2023  -Removed 220793 variants with nan in P column ...\nTue Dec 26 15:59:17 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:59:17 2023  -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:59:17 2023  -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:59:17 2023  -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:59:17 2023 Finished data conversion and sanity check.\nTue Dec 26 15:59:17 2023 Start to create manhattan plot with 6866 variants:\nTue Dec 26 15:59:17 2023  -Found 4 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:17 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:17 2023  -Annotating using column CHR:POS...\nTue Dec 26 15:59:17 2023  -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:17 2023 Start to create QQ plot with 6866 variants:\nTue Dec 26 15:59:17 2023 Expected range of P: (0,1.0)\nTue Dec 26 15:59:17 2023  -Lambda GC (MLOG10P mode) at 0.5 is   0.98908\nTue Dec 26 15:59:17 2023 Finished creating QQ plot successfully!\nTue Dec 26 15:59:17 2023  -Skip saving figures!\n
    Out[9]:
    (<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)
    In\u00a0[6]: Copied!
    sumstats.basic_check()\n
    sumstats.basic_check()
    Tue Dec 27 23:08:13 2022 Start to check IDs...\nTue Dec 27 23:08:13 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:13 2022  -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)\nTue Dec 27 23:08:14 2022 Finished checking IDs successfully!\nTue Dec 27 23:08:14 2022 Start to fix chromosome notation...\nTue Dec 27 23:08:14 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:17 2022  -Vairants with standardized chromosome notation: 1122299\nTue Dec 27 23:08:19 2022  -All CHR are already fixed...\nTue Dec 27 23:08:21 2022 Finished fixing chromosome notation successfully!\nTue Dec 27 23:08:21 2022 Start to fix basepair positions...\nTue Dec 27 23:08:21 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:21 2022  -Converting to Int64 data type ...\nTue Dec 27 23:08:22 2022  -Position upper_bound is: 250,000,000\nTue Dec 27 23:08:24 2022  -Remove outliers: 0\nTue Dec 27 23:08:24 2022  -Converted all position to datatype Int64.\nTue Dec 27 23:08:24 2022 Finished fixing basepair position successfully!\nTue Dec 27 23:08:24 2022 Start to fix alleles...\nTue Dec 27 23:08:24 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:25 2022  -Detected 0 variants with alleles that contain bases other than A/C/T/G .\nTue Dec 27 23:08:25 2022  -Converted all bases to string datatype and UPPERCASE.\nTue Dec 27 23:08:27 2022 Finished fixing allele successfully!\nTue Dec 27 23:08:27 2022 Start sanity check for statistics ...\nTue Dec 27 23:08:27 2022  -Current Dataframe shape : 1122299  x  11\nTue Dec 27 23:08:27 2022  -Checking if  0 <=N<= inf  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad N.\nTue Dec 27 23:08:27 2022  -Checking if  -37.5 <Z< 37.5  ...\nTue Dec 27 23:08:27 2022  -Removed 14 variants with bad Z.\nTue Dec 27 23:08:27 2022  -Checking if  5e-300 <= P <= 1  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad P.\nTue Dec 27 23:08:27 2022  -Checking if  0 <SE< inf  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad SE.\nTue Dec 27 23:08:27 2022  -Checking if  -10 <log(OR)< 10  ...\nTue Dec 27 23:08:27 2022  -Removed 0 variants with bad OR.\nTue Dec 27 23:08:27 2022  -Checking STATUS...\nTue Dec 27 23:08:28 2022  -Coverting STAUTUS to interger.\nTue Dec 27 23:08:28 2022  -Removed 14 variants with bad statistics in total.\nTue Dec 27 23:08:28 2022 Finished sanity check successfully!\nTue Dec 27 23:08:28 2022 Start to normalize variants...\nTue Dec 27 23:08:28 2022  -Current Dataframe shape : 1122285  x  11\nTue Dec 27 23:08:29 2022  -No available variants to normalize..\nTue Dec 27 23:08:29 2022 Finished normalizing variants successfully!\n
    In\u00a0[7]: Copied!
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\")\n#2:55513738\n
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54513738,56513738),region_grid=True,build=\"19\") #2:55513738
    Tue Dec 26 15:58:10 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:10 2023  -Genomic coordinates version: 19...\nTue Dec 26 15:58:10 2023  -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:10 2023  -Raw input contains 1128732 variants...\nTue Dec 26 15:58:10 2023  -Plot layout mode is : r\nTue Dec 26 15:58:10 2023  -Region to plot : chr2:54513738-56513738.\nTue Dec 26 15:58:10 2023  -Extract SNPs in region : chr2:54513738-56513738...\nTue Dec 26 15:58:10 2023  -Extract SNPs in specified regions: 865\nTue Dec 26 15:58:10 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:10 2023 Start conversion and sanity check:\nTue Dec 26 15:58:10 2023  -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:10 2023  -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:10 2023  -Removed 160 variants with nan in P column ...\nTue Dec 26 15:58:10 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:10 2023  -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:10 2023  -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:11 2023  -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:11 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:11 2023 Start to create manhattan plot with 705 variants:\nTue Dec 26 15:58:11 2023  -Extracting lead variant...\nTue Dec 26 15:58:11 2023  -Loading gtf files from:default\n
    INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
    Tue Dec 26 15:58:40 2023  -plotting gene track..\nTue Dec 26 15:58:40 2023  -Finished plotting gene track..\nTue Dec 26 15:58:40 2023  -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:58:40 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:58:40 2023  -Annotating using column CHR:POS...\nTue Dec 26 15:58:40 2023  -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:58:40 2023  -Skip saving figures!\n
    Out[7]:
    (<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)
    In\u00a0[8]: Copied!
    gl.download_ref(\"1kg_eas_hg19\")\n
    gl.download_ref(\"1kg_eas_hg19\")
    Tue Dec 27 22:44:52 2022 Start to download  1kg_eas_hg19  ...\nTue Dec 27 22:44:52 2022  -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 27 22:52:33 2022  -Updating record in config file...\nTue Dec 27 22:52:35 2022  -Updating record in config file...\nTue Dec 27 22:52:35 2022  -Downloading to: /home/he/anaconda3/envs/py38/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi\nTue Dec 27 22:52:35 2022 Downloaded  1kg_eas_hg19  successfully!\n
    In\u00a0[8]: Copied!
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")\n
    sumstats.plot_mqq(mode=\"r\",anno=True,region=(2,54531536,56731536),region_grid=True,vcf_path=gl.get_path(\"1kg_eas_hg19\"),build=\"19\")
    Tue Dec 26 15:58:41 2023 Start to plot manhattan/qq plot with the following basic settings:\nTue Dec 26 15:58:41 2023  -Genomic coordinates version: 19...\nTue Dec 26 15:58:41 2023  -Genome-wide significance level is set to 5e-08 ...\nTue Dec 26 15:58:41 2023  -Raw input contains 1128732 variants...\nTue Dec 26 15:58:41 2023  -Plot layout mode is : r\nTue Dec 26 15:58:41 2023  -Region to plot : chr2:54531536-56731536.\nTue Dec 26 15:58:41 2023  -Checking prefix for chromosomes in vcf files...\nTue Dec 26 15:58:41 2023  -No prefix for chromosomes in the VCF files.\nTue Dec 26 15:58:41 2023  -Extract SNPs in region : chr2:54531536-56731536...\nTue Dec 26 15:58:41 2023  -Extract SNPs in specified regions: 967\nTue Dec 26 15:58:41 2023 Finished loading specified columns from the sumstats.\nTue Dec 26 15:58:41 2023 Start conversion and sanity check:\nTue Dec 26 15:58:41 2023  -Removed 0 variants with nan in CHR or POS column ...\nTue Dec 26 15:58:41 2023  -Removed 0 varaints with CHR <=0...\nTue Dec 26 15:58:41 2023  -Removed 172 variants with nan in P column ...\nTue Dec 26 15:58:41 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\nTue Dec 26 15:58:41 2023  -Sumstats P values are being converted to -log10(P)...\nTue Dec 26 15:58:41 2023  -Sanity check: 0 na/inf/-inf variants will be removed...\nTue Dec 26 15:58:41 2023  -Maximum -log10(P) values is 14.772946706439042 .\nTue Dec 26 15:58:41 2023 Finished data conversion and sanity check.\nTue Dec 26 15:58:41 2023 Start to load reference genotype...\nTue Dec 26 15:58:41 2023  -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz\nTue Dec 26 15:58:43 2023  -Retrieving index...\nTue Dec 26 15:58:43 2023  -Ref variants in the region: 71908\nTue Dec 26 15:58:43 2023  -Matching variants using POS, NEA, EA ...\nTue Dec 26 15:58:43 2023  -Calculating Rsq...\nTue Dec 26 15:58:43 2023 Finished loading reference genotype successfully!\nTue Dec 26 15:58:43 2023 Start to create manhattan plot with 795 variants:\nTue Dec 26 15:58:43 2023  -Extracting lead variant...\nTue Dec 26 15:58:44 2023  -Loading gtf files from:default\n
    INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']\n
    Tue Dec 26 15:59:12 2023  -plotting gene track..\nTue Dec 26 15:59:12 2023  -Finished plotting gene track..\nTue Dec 26 15:59:13 2023  -Found 1 significant variants with a sliding window size of 500 kb...\nTue Dec 26 15:59:13 2023 Finished creating Manhattan plot successfully!\nTue Dec 26 15:59:13 2023  -Annotating using column CHR:POS...\nTue Dec 26 15:59:13 2023  -Adjusting text positions with repel_force=0.03...\nTue Dec 26 15:59:13 2023  -Skip saving figures!\n
    Out[8]:
    (<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f55daa2f400>)
    In\u00a0[\u00a0]: Copied!
    \n
    "},{"location":"Visualization/#visualization-by-gwaslab","title":"Visualization by gwaslab\u00b6","text":""},{"location":"Visualization/#import-gwaslab-package","title":"Import gwaslab package\u00b6","text":""},{"location":"Visualization/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"Visualization/#check-the-lead-variants-in-significant-loci","title":"Check the lead variants in significant loci\u00b6","text":""},{"location":"Visualization/#create-mahattan-plot","title":"Create mahattan plot\u00b6","text":""},{"location":"Visualization/#qc-check","title":"QC check\u00b6","text":""},{"location":"Visualization/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"Visualization/#create-regional-plot-with-ld-information","title":"Create regional plot with LD information\u00b6","text":""},{"location":"finemapping_susie/","title":"Finemapping using susieR","text":"In\u00a0[1]: Copied!
    import gwaslab as gl\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n
    import gwaslab as gl import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt In\u00a0[2]: Copied!
    sumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")\n
    sumstats = gl.Sumstats(\"./1kgeas.B1.glm.firth.gz\",fmt=\"plink2\")
    2024/04/18 10:40:48 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/\n2024/04/18 10:40:48 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com\n2024/04/18 10:40:48 Start to load format from formatbook....\n2024/04/18 10:40:48  -plink2 format meta info:\n2024/04/18 10:40:48   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear\n2024/04/18 10:40:48   - format_source  : https://www.cog-genomics.org/plink/2.0/formats\n2024/04/18 10:40:48   - format_version  : Alpha 3.3 final (3 Jun)\n2024/04/18 10:40:48   - last_check_date  :  20220806\n2024/04/18 10:40:48  -plink2 to gwaslab format dictionary:\n2024/04/18 10:40:48   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR\n2024/04/18 10:40:48   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR\n2024/04/18 10:40:48 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz\n2024/04/18 10:40:49  -Reading columns          : Z_STAT,A1_FREQ,POS,ALT,REF,P,A1,OR,OBS_CT,#CHROM,LOG(OR)_SE,ID\n2024/04/18 10:40:49  -Renaming columns to      : Z,EAF,POS,ALT,REF,P,EA,OR,N,CHR,SE,SNPID\n2024/04/18 10:40:49  -Current Dataframe shape : 1128732  x  12\n2024/04/18 10:40:49  -Initiating a status column: STATUS ...\n2024/04/18 10:40:49  #WARNING! Version of genomic coordinates is unknown...\n2024/04/18 10:40:49  NEA not available: assigning REF to NEA...\n2024/04/18 10:40:49  -EA,REF and ALT columns are available: assigning NEA...\n2024/04/18 10:40:49  -For variants with EA == ALT : assigning REF to NEA ...\n2024/04/18 10:40:49  -For variants with EA != ALT : assigning ALT to NEA ...\n2024/04/18 10:40:49 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:49  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:49  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:49 Finished reordering the columns.\n2024/04/18 10:40:49  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \n2024/04/18 10:40:49  -DType   : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category\n2024/04/18 10:40:49  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       \n2024/04/18 10:40:50  -Current Dataframe memory usage: 106.06 MB\n2024/04/18 10:40:50 Finished loading data successfully!\n
    In\u00a0[3]: Copied!
    sumstats.basic_check()\n
    sumstats.basic_check()
    2024/04/18 10:40:50 Start to check SNPID/rsID...v3.4.43\n2024/04/18 10:40:50  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:50  -Checking SNPID data type...\n2024/04/18 10:40:50  -Converting SNPID to pd.string data type...\n2024/04/18 10:40:50  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)\n2024/04/18 10:40:51 Finished checking SNPID/rsID.\n2024/04/18 10:40:51 Start to fix chromosome notation (CHR)...v3.4.43\n2024/04/18 10:40:51  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB\n2024/04/18 10:40:51  -Checking CHR data type...\n2024/04/18 10:40:51  -Variants with standardized chromosome notation: 1128732\n2024/04/18 10:40:51  -All CHR are already fixed...\n2024/04/18 10:40:52 Finished fixing chromosome notation (CHR).\n2024/04/18 10:40:52 Start to fix basepair positions (POS)...v3.4.43\n2024/04/18 10:40:52  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 107.13 MB\n2024/04/18 10:40:52  -Converting to Int64 data type ...\n2024/04/18 10:40:53  -Position bound:(0 , 250,000,000)\n2024/04/18 10:40:53  -Removed outliers: 0\n2024/04/18 10:40:53 Finished fixing basepair positions (POS).\n2024/04/18 10:40:53 Start to fix alleles (EA and NEA)...v3.4.43\n2024/04/18 10:40:53  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:53  -Converted all bases to string datatype and UPPERCASE.\n2024/04/18 10:40:53  -Variants with bad EA  : 0\n2024/04/18 10:40:54  -Variants with bad NEA : 0\n2024/04/18 10:40:54  -Variants with NA for EA or NEA: 0\n2024/04/18 10:40:54  -Variants with same EA and NEA: 0\n2024/04/18 10:40:54  -Detected 0 variants with alleles that contain bases other than A/C/T/G .\n2024/04/18 10:40:55 Finished fixing alleles (EA and NEA).\n2024/04/18 10:40:55 Start to perform sanity check for statistics...v3.4.43\n2024/04/18 10:40:55  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB\n2024/04/18 10:40:55  -Comparison tolerance for floats: 1e-07\n2024/04/18 10:40:55  -Checking if 0 <= N <= 2147483647 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na N.\n2024/04/18 10:40:55  -Checking if -1e-07 < EAF < 1.0000001 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na EAF.\n2024/04/18 10:40:55  -Checking if -9999.0000001 < Z < 9999.0000001 ...\n2024/04/18 10:40:55   -Examples of invalid variants(SNPID): 1:15774:G:A,1:15777:A:G,1:57292:C:T,1:87360:C:T,1:625392:T:C ...\n2024/04/18 10:40:55   -Examples of invalid values (Z): NA,NA,NA,NA,NA ...\n2024/04/18 10:40:55  -Removed 220793 variants with bad/na Z.\n2024/04/18 10:40:55  -Checking if -1e-07 < P < 1.0000001 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na P.\n2024/04/18 10:40:55  -Checking if -1e-07 < SE < inf ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na SE.\n2024/04/18 10:40:55  -Checking if -100.0000001 < OR < 100.0000001 ...\n2024/04/18 10:40:55  -Removed 0 variants with bad/na OR.\n2024/04/18 10:40:55  -Checking STATUS and converting STATUS to categories....\n2024/04/18 10:40:56  -Removed 220793 variants with bad statistics in total.\n2024/04/18 10:40:56  -Data types for each column:\n2024/04/18 10:40:56  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \n2024/04/18 10:40:56  -DType   : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:40:56  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       \n2024/04/18 10:40:56 Finished sanity check for statistics.\n2024/04/18 10:40:56 Start to check data consistency across columns...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56  -Tolerance: 0.001 (Relative) and 0.001 (Absolute)\n2024/04/18 10:40:56  -No availalbe columns for data consistency checking...Skipping...\n2024/04/18 10:40:56 Finished checking data consistency across columns.\n2024/04/18 10:40:56 Start to normalize indels...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56  -No available variants to normalize..\n2024/04/18 10:40:56 Finished normalizing variants successfully!\n2024/04/18 10:40:56 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB\n2024/04/18 10:40:56 Finished sorting coordinates.\n2024/04/18 10:40:56 Start to reorder the columns...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:40:56 Finished reordering the columns.\n

    Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.

    In\u00a0[4]: Copied!
    sumstats.get_lead()\n
    sumstats.get_lead()
    2024/04/18 10:40:56 Start to extract lead variants...v3.4.43\n2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB\n2024/04/18 10:40:56  -Processing 907939 variants...\n2024/04/18 10:40:56  -Significance threshold : 5e-08\n2024/04/18 10:40:56  -Sliding window size: 500  kb\n2024/04/18 10:40:56  -Using P for extracting lead variants...\n2024/04/18 10:40:56  -Found 43 significant variants in total...\n2024/04/18 10:40:56  -Identified 4 lead variants!\n2024/04/18 10:40:56 Finished extracting lead variants.\n
    Out[4]: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 44298 1:167562605:G:A 1 167562605 A G 0.391481 0.159645 7.69462 1.419150e-14 3.415780 493 9960099 G A 91266 2:55513738:C:T 2 55513738 C T 0.376008 0.153159 -7.96244 1.686760e-15 0.295373 496 9960099 C T 442239 7:134368632:T:G 7 134368632 G T 0.138105 0.225526 6.89025 5.569440e-12 4.730010 496 9960099 T G 875859 20:42758834:T:C 20 42758834 T C 0.227273 0.184323 -7.76902 7.909780e-15 0.238829 495 9960099 T C In\u00a0[5]: Copied!
    sumstats.plot_mqq()\n
    sumstats.plot_mqq()
    2024/04/18 10:40:57 Start to create MQQ plot...v3.4.43:\n2024/04/18 10:40:57  -Genomic coordinates version: 99...\n2024/04/18 10:40:57  #WARNING! Genomic coordinates version is unknown.\n2024/04/18 10:40:57  -Genome-wide significance level to plot is set to 5e-08 ...\n2024/04/18 10:40:57  -Raw input contains 907939 variants...\n2024/04/18 10:40:57  -MQQ plot layout mode is : mqq\n2024/04/18 10:40:57 Finished loading specified columns from the sumstats.\n2024/04/18 10:40:57 Start data conversion and sanity check:\n2024/04/18 10:40:57  -Removed 0 variants with nan in CHR or POS column ...\n2024/04/18 10:40:57  -Removed 0 variants with CHR <=0...\n2024/04/18 10:40:57  -Removed 0 variants with nan in P column ...\n2024/04/18 10:40:57  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...\n2024/04/18 10:40:57  -Sumstats P values are being converted to -log10(P)...\n2024/04/18 10:40:57  -Sanity check: 0 na/inf/-inf variants will be removed...\n2024/04/18 10:40:57  -Converting data above cut line...\n2024/04/18 10:40:57  -Maximum -log10(P) value is 14.772946706439042 .\n2024/04/18 10:40:57 Finished data conversion and sanity check.\n2024/04/18 10:40:57 Start to create MQQ plot with 907939 variants...\n2024/04/18 10:40:58  -Creating background plot...\n2024/04/18 10:40:59 Finished creating MQQ plot successfully!\n2024/04/18 10:40:59 Start to extract variants for annotation...\n2024/04/18 10:40:59  -Found 4 significant variants with a sliding window size of 500 kb...\n2024/04/18 10:40:59 Finished extracting variants for annotation...\n2024/04/18 10:40:59 Start to process figure arts.\n2024/04/18 10:40:59  -Processing X ticks...\n2024/04/18 10:40:59  -Processing X labels...\n2024/04/18 10:40:59  -Processing Y labels...\n2024/04/18 10:40:59  -Processing Y tick lables...\n2024/04/18 10:40:59  -Processing Y labels...\n2024/04/18 10:40:59  -Processing lines...\n2024/04/18 10:40:59 Finished processing figure arts.\n2024/04/18 10:40:59 Start to annotate variants...\n2024/04/18 10:40:59  -Skip annotating\n2024/04/18 10:40:59 Finished annotating variants.\n2024/04/18 10:40:59 Start to create QQ plot with 907939 variants:\n2024/04/18 10:40:59  -Plotting all variants...\n2024/04/18 10:40:59  -Expected range of P: (0,1.0)\n2024/04/18 10:40:59  -Lambda GC (MLOG10P mode) at 0.5 is   0.98908\n2024/04/18 10:40:59  -Processing Y tick lables...\n2024/04/18 10:40:59 Finished creating QQ plot successfully!\n2024/04/18 10:40:59 Start to save figure...\n2024/04/18 10:40:59  -Skip saving figure!\n2024/04/18 10:40:59 Finished saving figure...\n2024/04/18 10:40:59 Finished creating plot successfully!\n
    Out[5]:
    (<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fa6ad1132b0>)
    In\u00a0[6]: Copied!
    locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')\n
    locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
    2024/04/18 10:41:06 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06  -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738\n2024/04/18 10:41:06 Finished filtering values.\n
    In\u00a0[7]: Copied!
    locus.fill_data(to_fill=[\"BETA\"])\n
    locus.fill_data(to_fill=[\"BETA\"])
    2024/04/18 10:41:06 Start filling data using existing columns...v3.4.43\n2024/04/18 10:41:06  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     \n2024/04/18 10:41:06  -DType   : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category\n2024/04/18 10:41:06  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       \n2024/04/18 10:41:06  -Overwrite mode:  False\n2024/04/18 10:41:06   -Skipping columns:  []\n2024/04/18 10:41:06  -Filling columns:  ['BETA']\n2024/04/18 10:41:06   - Filling Columns iteratively...\n2024/04/18 10:41:06   - Filling BETA value using OR column...\n2024/04/18 10:41:06 Finished filling data using existing columns.\n2024/04/18 10:41:06 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:06  -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:06  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:06 Finished reordering the columns.\n
    In\u00a0[8]: Copied!
    locus.data\n
    locus.data Out[8]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 91067 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960099 A T 91068 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960099 G A 91069 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960099 G A 91070 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960099 A C 91071 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960099 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 91441 2:56004219:G:T 2 56004219 G T 0.171717 0.148489 0.169557 0.875763 0.381159 1.160080 495 9960099 G T 91442 2:56007034:T:C 2 56007034 T C 0.260121 0.073325 0.145565 0.503737 0.614446 1.076080 494 9960099 T C 91443 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960099 C G 91444 2:56009480:A:T 2 56009480 A T 0.157258 0.135667 0.177621 0.763784 0.444996 1.145300 496 9960099 A T 91445 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960099 C T

    379 rows \u00d7 15 columns

    In\u00a0[9]: Copied!
    locus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")\n
    locus.harmonize(basic_check=False, ref_seq=\"/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\")
    2024/04/18 10:41:07 Start to check if NEA is aligned with reference sequence...v3.4.43\n2024/04/18 10:41:07  -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB\n2024/04/18 10:41:07  -Reference genome FASTA file: /home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta\n2024/04/18 10:41:07  -Loading fasta records:2  \n2024/04/18 10:41:19  -Checking records\n2024/04/18 10:41:19    -Building numpy fasta records from dict\n2024/04/18 10:41:20    -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )\n2024/04/18 10:41:20    -Checking records for ( len(NEA) > 4 or len(EA) > 4 )\n2024/04/18 10:41:20  -Finished checking records\n2024/04/18 10:41:20  -Variants allele on given reference sequence :  264\n2024/04/18 10:41:20  -Variants flipped :  115\n2024/04/18 10:41:20   -Raw Matching rate :  100.00%\n2024/04/18 10:41:20  -Variants inferred reverse_complement :  0\n2024/04/18 10:41:20  -Variants inferred reverse_complement_flipped :  0\n2024/04/18 10:41:20  -Both allele on genome + unable to distinguish :  0\n2024/04/18 10:41:20  -Variants not on given reference sequence :  0\n2024/04/18 10:41:20 Finished checking if NEA is aligned with reference sequence.\n2024/04/18 10:41:20 Start to adjust statistics based on STATUS code...v3.4.43\n2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.43\n2024/04/18 10:41:20  -Flipping 115 variants...\n2024/04/18 10:41:20  -Swapping column: NEA <=> EA...\n2024/04/18 10:41:20  -Flipping column: BETA = - BETA...\n2024/04/18 10:41:20  -Flipping column: Z = - Z...\n2024/04/18 10:41:20  -Flipping column: EAF = 1 - EAF...\n2024/04/18 10:41:20  -Flipping column: OR = 1 / OR...\n2024/04/18 10:41:20  -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x\n2024/04/18 10:41:20 Finished adjusting statistics based on STATUS code.\n2024/04/18 10:41:20 Start to sort the genome coordinates...v3.4.43\n2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB\n2024/04/18 10:41:20 Finished sorting coordinates.\n2024/04/18 10:41:20 Start to reorder the columns...v3.4.43\n2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.03 MB\n2024/04/18 10:41:20  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT\n2024/04/18 10:41:20 Finished reordering the columns.\n
    Out[9]:
    <gwaslab.g_Sumstats.Sumstats at 0x7fa6a33a8130>
    In\u00a0[10]: Copied!
    locus.data\n
    locus.data Out[10]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T

    379 rows \u00d7 15 columns

    In\u00a0[11]: Copied!
    locus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None)\nlocus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None)\n
    locus.data.to_csv(\"sig_locus.tsv\",sep=\"\\t\",index=None) locus.data[\"SNPID\"].to_csv(\"sig_locus.snplist\",sep=\"\\t\",index=None,header=None) In\u00a0[12]: Copied!
    !plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt\n\n!plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract sig_locus.snplist \\\n  --out sig_locus_mt_r2\n
    !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r square \\ --extract sig_locus.snplist \\ --out sig_locus_mt !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract sig_locus.snplist \\ --out sig_locus_mt_r2
    PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to sig_locus_mt.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract sig_locus.snplist\n  --keep-allele-order\n  --out sig_locus_mt\n  --r square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r square to sig_locus_mt.ld ... 0% [processingwriting]          done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to sig_locus_mt_r2.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract sig_locus.snplist\n  --keep-allele-order\n  --out sig_locus_mt_r2\n  --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to sig_locus_mt_r2.nosex .\n--extract: 379 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.992472.\n379 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting]          done.\n
    In\u00a0[13]: Copied!
    import rpy2\nimport rpy2.robjects as ro\nfrom rpy2.robjects.packages import importr\nimport rpy2.robjects.numpy2ri as numpy2ri\nnumpy2ri.activate()\n
    import rpy2 import rpy2.robjects as ro from rpy2.robjects.packages import importr import rpy2.robjects.numpy2ri as numpy2ri numpy2ri.activate()
    INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY\nINFO:rpy2.situation:R home found: /home/yunye/anaconda3/envs/gwaslab_py39/lib/R\nINFO:rpy2.situation:R library path: \nINFO:rpy2.situation:LD_LIBRARY_PATH: \nINFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save\nINFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.\n
    In\u00a0[14]: Copied!
    df = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\")\ndf\n
    df = pd.read_csv(\"sig_locus.tsv\",sep=\"\\t\") df Out[14]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT 0 2:55015281:A:T 2 55015281 T A 0.126263 -0.048075 0.193967 -0.247856 0.804246 0.953062 495 9960009 A T 1 2:55015604:G:A 2 55015604 A G 0.119192 -0.047357 0.195199 -0.242606 0.808311 0.953747 495 9960009 G A 2 2:55015764:G:A 2 55015764 A G 0.339394 0.028986 0.135064 0.214575 0.830098 1.029410 495 9960009 G A 3 2:55016143:A:C 2 55016143 C A 0.126263 0.004659 0.195728 0.023784 0.981025 1.004670 495 9960009 A C 4 2:55017199:T:C 2 55017199 C T 0.093306 0.268767 0.219657 1.223580 0.221112 1.308350 493 9960009 T C ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 374 2:56004219:G:T 2 56004219 T G 0.828283 -0.148489 0.169557 -0.875763 0.381159 0.862010 495 9960019 G T 375 2:56007034:T:C 2 56007034 C T 0.739879 -0.073325 0.145565 -0.503737 0.614446 0.929299 494 9960019 T C 376 2:56008984:C:G 2 56008984 G C 0.013185 0.205883 0.547226 0.376227 0.706748 1.228610 493 9960009 C G 377 2:56009480:A:T 2 56009480 T A 0.842742 -0.135667 0.177621 -0.763784 0.444996 0.873134 496 9960019 A T 378 2:56010434:C:T 2 56010434 T C 0.017172 0.300305 0.491815 0.610604 0.541462 1.350270 495 9960009 C T

    379 rows \u00d7 15 columns

    In\u00a0[15]: Copied!
    # import susieR as object\nsusieR = importr('susieR')\n
    # import susieR as object susieR = importr('susieR') In\u00a0[16]: Copied!
    # convert pd.DataFrame to numpy\nld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None)\nR_df = ld.values\nld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None)\nR_df2 = ld2.values\n
    # convert pd.DataFrame to numpy ld = pd.read_csv(\"sig_locus_mt.ld\",sep=\"\\t\",header=None) R_df = ld.values ld2 = pd.read_csv(\"sig_locus_mt_r2.ld\",sep=\"\\t\",header=None) R_df2 = ld2.values In\u00a0[17]: Copied!
    R_df\n
    R_df Out[17]:
    array([[ 1.00000e+00,  9.58562e-01, -3.08678e-01, ...,  1.96204e-02,\n        -3.54602e-04, -7.14868e-03],\n       [ 9.58562e-01,  1.00000e+00, -2.97617e-01, ...,  2.47755e-02,\n        -1.49234e-02, -7.00509e-03],\n       [-3.08678e-01, -2.97617e-01,  1.00000e+00, ..., -3.49335e-02,\n        -1.37163e-02, -2.12828e-02],\n       ...,\n       [ 1.96204e-02,  2.47755e-02, -3.49335e-02, ...,  1.00000e+00,\n         5.26193e-02, -3.09069e-02],\n       [-3.54602e-04, -1.49234e-02, -1.37163e-02, ...,  5.26193e-02,\n         1.00000e+00, -3.01142e-01],\n       [-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02,\n        -3.01142e-01,  1.00000e+00]])
    In\u00a0[18]: Copied!
    plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0])\nsns.heatmap(data=R_df2,ax=ax[1])\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\n
    plt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=R_df,cmap=\"Spectral\",ax=ax[0]) sns.heatmap(data=R_df2,ax=ax[1]) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[18]:
    Text(0.5, 1.0, 'LD r2 matrix')
    <Figure size 2000x2000 with 0 Axes>

    https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#fine-mapping-with-susier-using-summary-statistics

    In\u00a0[19]: Copied!
    ro.r('set.seed(123)')\nfit = susieR.susie_rss(\n    bhat = df[\"BETA\"].values.reshape((len(R_df),1)),\n    shat = df[\"SE\"].values.reshape((len(R_df),1)),\n    R = R_df,\n    L = 10,\n    n = 503\n)\n
    ro.r('set.seed(123)') fit = susieR.susie_rss( bhat = df[\"BETA\"].values.reshape((len(R_df),1)), shat = df[\"SE\"].values.reshape((len(R_df),1)), R = R_df, L = 10, n = 503 ) In\u00a0[20]: Copied!
    # show the results of susie_get_cs\nprint(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\n
    # show the results of susie_get_cs print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
    $L1\n[1] 200 218 221 224\n\n\n

    We found 1 credible set here

    In\u00a0[21]: Copied!
    # add the information to dataframe for plotting\ndf[\"cs\"] = 0\nn_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])\nfor i in range(n_cs):\n    cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]\n    df.loc[np.array(cs_index)-1,\"cs\"] = i + 1\ndf[\"pip\"] = np.array(susieR.susie_get_pip(fit))\n
    # add the information to dataframe for plotting df[\"cs\"] = 0 n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0]) for i in range(n_cs): cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i] df.loc[np.array(cs_index)-1,\"cs\"] = i + 1 df[\"pip\"] = np.array(susieR.susie_get_pip(fit)) In\u00a0[22]: Copied!
    fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))\ndf[\"MLOG10P\"] = -np.log10(df[\"P\"])\ncol_to_plot = \"MLOG10P\"\np=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot],\n           marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\naxes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\n           marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[0].set_xlabel(\"position\")\naxes[0].set_xlim((55400000, 55800000))\naxes[0].set_ylabel(col_to_plot)\naxes[0].legend()\n\np=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2)\n\naxes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"],\n           marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\")\n\nplt.colorbar( p , label=\"Rsq with the lead variant\")\naxes[1].set_xlabel(\"position\")\naxes[1].set_xlim((55400000, 55800000))\naxes[1].set_ylabel(\"PIP\")\naxes[1].legend()\n
    fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1)) df[\"MLOG10P\"] = -np.log10(df[\"P\"]) col_to_plot = \"MLOG10P\" p=axes[0].scatter(df[\"POS\"],df[col_to_plot],c=ld[df[\"P\"].idxmin()]**2) axes[0].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,col_to_plot], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot], marker='x',s=40,c=\"red\",edgecolors='black',label=\"Causal\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[0].set_xlabel(\"position\") axes[0].set_xlim((55400000, 55800000)) axes[0].set_ylabel(col_to_plot) axes[0].legend() p=axes[1].scatter(df[\"POS\"],df[\"pip\"],c=ld[df[\"P\"].idxmin()]**2) axes[1].scatter(df.loc[df[\"cs\"]==1,\"POS\"],df.loc[df[\"cs\"]==1,\"pip\"], marker='o',s=40,c=\"None\",edgecolors='black',label=\"Variants in credible set 1\") plt.colorbar( p , label=\"Rsq with the lead variant\") axes[1].set_xlabel(\"position\") axes[1].set_xlim((55400000, 55800000)) axes[1].set_ylabel(\"PIP\") axes[1].legend()
    /tmp/ipykernel_420/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.\n  axes[0].scatter(df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),\"POS\"],df.loc[(df[\"CHR\"]==2)&(df[\"POS\"]==55620927),col_to_plot],\n
    Out[22]:
    <matplotlib.legend.Legend at 0x7fa6a330d5e0>

    The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.

    Lets then check the variants in credible set

    In\u00a0[23]: Copied!
    df.loc[np.array(cs_index)-1,:]\n
    df.loc[np.array(cs_index)-1,:] Out[23]: SNPID CHR POS EA NEA EAF BETA SE Z P OR N STATUS REF ALT cs pip MLOG10P 199 2:55513738:C:T 2 55513738 T C 0.623992 1.219516 0.153159 7.96244 1.686760e-15 3.385550 496 9960019 C T 1 0.325435 14.772947 217 2:55605943:A:G 2 55605943 G A 0.685484 1.321987 0.166688 7.93089 2.175840e-15 3.750867 496 9960019 A G 1 0.267953 14.662373 220 2:55612986:G:C 2 55612986 C G 0.685223 1.302133 0.166154 7.83691 4.617840e-15 3.677133 494 9960019 G C 1 0.150449 14.335561 223 2:55622624:G:A 2 55622624 A G 0.688508 1.324109 0.167119 7.92315 2.315640e-15 3.758833 496 9960019 G A 1 0.255449 14.635329 In\u00a0[24]: Copied!
    !echo \"2:55513738:C:T\" > credible.snplist\n!echo \"2:55605943:A:G\" >> credible.snplist\n!echo \"2:55612986:G:C\" >> credible.snplist\n!echo \"2:55620927:G:A\" >> credible.snplist\n!echo \"2:55622624:G:A\" >> credible.snplist\n\n!plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract credible.snplist \\\n  --out credible_r\n\n!plink \\\n  --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\\n  --keep-allele-order \\\n  --r2 square \\\n  --extract credible.snplist \\\n  --out credible_r2\n
    !echo \"2:55513738:C:T\" > credible.snplist !echo \"2:55605943:A:G\" >> credible.snplist !echo \"2:55612986:G:C\" >> credible.snplist !echo \"2:55620927:G:A\" >> credible.snplist !echo \"2:55622624:G:A\" >> credible.snplist !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r !plink \\ --bfile \"../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\" \\ --keep-allele-order \\ --r2 square \\ --extract credible.snplist \\ --out credible_r2
    PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to credible_r.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract credible.snplist\n  --keep-allele-order\n  --out credible_r\n  --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r.ld ... 0% [processingwriting]          done.\nPLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/\n(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3\nLogging to credible_r2.log.\nOptions in effect:\n  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing\n  --extract credible.snplist\n  --keep-allele-order\n  --out credible_r2\n  --r2 square\n\n31934 MB RAM detected; reserving 15967 MB for main workspace.\n1235116 variants loaded from .bim file.\n504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.\nAmbiguous sex IDs written to credible_r2.nosex .\n--extract: 5 variants remaining.\nUsing up to 19 threads (change this with --threads).\nBefore main variant filters, 504 founders and 0 nonfounders present.\nCalculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.\nTotal genotyping rate is 0.995635.\n5 variants and 504 people pass filters and QC.\nNote: No phenotypes present.\n--r2 square to credible_r2.ld ... 0% [processingwriting]          done.\n
    In\u00a0[25]: Copied!
    credible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"]\nld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None)\nld.columns=credible_snplist\nld.index=credible_snplist\nld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None)\nld2.columns=credible_snplist\nld2.index=credible_snplist\n
    credible_snplist=[\"2:55513738:C:T\",\"2:55605943:A:G\", \"2:55612986:G:C\", \"2:55620927:G:A\", \"2:55622624:G:A\"] ld = pd.read_csv(\"credible_r.ld\",sep=\"\\t\",header=None) ld.columns=credible_snplist ld.index=credible_snplist ld2 = pd.read_csv(\"credible_r2.ld\",sep=\"\\t\",header=None) ld2.columns=credible_snplist ld2.index=credible_snplist In\u00a0[26]: Copied!
    plt.figure(figsize=(10,10),dpi=200)\nfig, ax = plt.subplots(ncols=2,figsize=(20,10))\nsns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0)\nsns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1)\nax[0].set_title(\"LD r matrix\")\nax[1].set_title(\"LD r2 matrix\")\n
    plt.figure(figsize=(10,10),dpi=200) fig, ax = plt.subplots(ncols=2,figsize=(20,10)) sns.heatmap(data=ld, cmap=\"Spectral_r\",ax=ax[0],center=0) sns.heatmap(data=ld2,cmap=\"Spectral_r\",ax=ax[1],vmin=0,vmax=1) ax[0].set_title(\"LD r matrix\") ax[1].set_title(\"LD r2 matrix\") Out[26]:
    Text(0.5, 1.0, 'LD r2 matrix')
    <Figure size 2000x2000 with 0 Axes>

    Variants in the credible set are in strong LD with the bona fide causal variant.

    This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.

    "},{"location":"finemapping_susie/#finemapping-using-susier","title":"Finemapping using susieR\u00b6","text":""},{"location":"finemapping_susie/#data-preparation","title":"Data preparation\u00b6","text":""},{"location":"finemapping_susie/#load-sumstats","title":"Load sumstats\u00b6","text":""},{"location":"finemapping_susie/#data-standardization-and-sanity-check","title":"Data standardization and sanity check\u00b6","text":""},{"location":"finemapping_susie/#extract-lead-variants","title":"Extract lead variants\u00b6","text":""},{"location":"finemapping_susie/#create-manhattan-plot-for-checking","title":"Create manhattan plot for checking\u00b6","text":""},{"location":"finemapping_susie/#extract-the-variants-around-255513738ct-for-finemapping","title":"Extract the variants around 2:55513738:C:T for finemapping\u00b6","text":""},{"location":"finemapping_susie/#convert-or-to-beta","title":"Convert OR to BETA\u00b6","text":""},{"location":"finemapping_susie/#align-nea-with-reference-sequence","title":"Align NEA with reference sequence\u00b6","text":""},{"location":"finemapping_susie/#output-the-sumstats-of-this-locus","title":"Output the sumstats of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-plink-to-get-ld-matrix-for-this-locus","title":"Run PLINK to get LD matrix for this locus\u00b6","text":""},{"location":"finemapping_susie/#finemapping","title":"Finemapping\u00b6","text":""},{"location":"finemapping_susie/#load-locus-sumstats","title":"Load locus sumstats\u00b6","text":""},{"location":"finemapping_susie/#import-sumsier","title":"Import sumsieR\u00b6","text":""},{"location":"finemapping_susie/#load-ld-matrix","title":"Load LD matrix\u00b6","text":""},{"location":"finemapping_susie/#visualize-the-ld-structure-of-this-locus","title":"Visualize the LD structure of this locus\u00b6","text":""},{"location":"finemapping_susie/#run-finemapping-use-susier","title":"Run finemapping use susieR\u00b6","text":""},{"location":"finemapping_susie/#extract-credible-sets-and-pip","title":"Extract credible sets and PIP\u00b6","text":""},{"location":"finemapping_susie/#create-regional-plot","title":"Create regional plot\u00b6","text":""},{"location":"finemapping_susie/#pitfalls","title":"Pitfalls\u00b6","text":""},{"location":"finemapping_susie/#check-ld-of-the-causal-variant-and-variants-in-the-credible-set","title":"Check LD of the causal variant and variants in the credible set\u00b6","text":""},{"location":"finemapping_susie/#load-ld-and-plot","title":"Load LD and plot\u00b6","text":""},{"location":"plot_PCA/","title":"Plotting PCA","text":"In\u00a0[1]: Copied!
    import pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n
    import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In\u00a0[2]: Copied!
    pca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\")\npca\n
    pca = pd.read_table(\"../05_PCA/plink_results_projected.sscore\",sep=\"\\t\") pca Out[2]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752

    500 rows \u00d7 14 columns

    In\u00a0[6]: Copied!
    ped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\")\nped\n
    ped = pd.read_table(\"../01_Dataset/integrated_call_samples_v3.20130502.ALL.panel\",sep=\"\\t\") ped Out[6]: sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00096 GBR EUR male NaN NaN 1 HG00097 GBR EUR female NaN NaN 2 HG00099 GBR EUR female NaN NaN 3 HG00100 GBR EUR female NaN NaN 4 HG00101 GBR EUR male NaN NaN ... ... ... ... ... ... ... 2499 NA21137 GIH SAS female NaN NaN 2500 NA21141 GIH SAS female NaN NaN 2501 NA21142 GIH SAS female NaN NaN 2502 NA21143 GIH SAS female NaN NaN 2503 NA21144 GIH SAS female NaN NaN

    2504 rows \u00d7 6 columns

    In\u00a0[7]: Copied!
    pcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\")\npcaped\n
    pcaped=pd.merge(pca,ped,right_on=\"sample\",left_on=\"IID\",how=\"inner\") pcaped Out[7]: #FID IID ALLELE_CT NAMED_ALLELE_DOSAGE_SUM PC1_AVG PC2_AVG PC3_AVG PC4_AVG PC5_AVG PC6_AVG PC7_AVG PC8_AVG PC9_AVG PC10_AVG sample pop super_pop gender Unnamed: 4 Unnamed: 5 0 HG00403 HG00403 390256 390256 0.002903 -0.024865 0.010041 0.009576 0.006943 -0.002223 0.008223 -0.001149 0.003352 0.004375 HG00403 CHS EAS male NaN NaN 1 HG00404 HG00404 390696 390696 -0.000141 -0.027965 0.025389 -0.005825 -0.002747 0.006585 0.011380 0.007777 0.015998 0.017893 HG00404 CHS EAS female NaN NaN 2 HG00406 HG00406 388524 388524 0.007074 -0.031545 -0.004370 -0.001262 -0.011493 -0.005395 -0.006202 0.004524 -0.000871 -0.002280 HG00406 CHS EAS male NaN NaN 3 HG00407 HG00407 388808 388808 0.006840 -0.025073 -0.006527 0.006797 -0.011600 -0.010233 0.013957 0.006187 0.013806 0.008253 HG00407 CHS EAS female NaN NaN 4 HG00409 HG00409 391646 391646 0.000399 -0.029033 -0.018935 -0.001360 0.029044 0.009428 -0.017119 -0.012964 0.025360 0.022907 HG00409 CHS EAS male NaN NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 495 NA19087 NA19087 390232 390232 -0.082261 0.033163 0.045499 -0.011398 0.000027 -0.006525 0.012446 -0.006743 -0.016312 0.023022 NA19087 JPT EAS female NaN NaN 496 NA19088 NA19088 391510 391510 -0.087183 0.043433 0.040188 0.003610 -0.000165 0.002317 0.000117 0.007430 -0.011886 0.007730 NA19088 JPT EAS male NaN NaN 497 NA19089 NA19089 391462 391462 -0.084082 0.036118 -0.036355 0.008738 -0.037523 0.004110 0.008653 -0.000563 -0.001599 0.015941 NA19089 JPT EAS male NaN NaN 498 NA19090 NA19090 392880 392880 -0.073580 0.026163 -0.032193 0.006599 -0.039060 0.000687 0.012213 -0.000485 -0.000336 -0.031283 NA19090 JPT EAS female NaN NaN 499 NA19091 NA19091 389664 389664 -0.081632 0.041455 -0.032200 0.003717 -0.046712 0.015191 0.003119 -0.004906 -0.001811 -0.020752 NA19091 JPT EAS male NaN NaN

    500 rows \u00d7 20 columns

    In\u00a0[8]: Copied!
    plt.figure(figsize=(10,10))\nsns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50)\n
    plt.figure(figsize=(10,10)) sns.scatterplot(data=pcaped,x=\"PC1_AVG\",y=\"PC2_AVG\",hue=\"pop\",s=50) Out[8]:
    <Axes: xlabel='PC1_AVG', ylabel='PC2_AVG'>
    "},{"location":"plot_PCA/#plotting-pca","title":"Plotting PCA\u00b6","text":""},{"location":"plot_PCA/#loading-files","title":"loading files\u00b6","text":""},{"location":"plot_PCA/#merge-pca-and-population-information","title":"Merge PCA and population information\u00b6","text":""},{"location":"plot_PCA/#plotting","title":"Plotting\u00b6","text":""},{"location":"prs_tutorial/","title":"PRS Tutorial","text":"In\u00a0[1]: Copied!
    import sys\nsys.path.insert(0,\"/Users/he/work/PRSlink/src\")\nimport prslink as pl\n
    import sys sys.path.insert(0,\"/Users/he/work/PRSlink/src\") import prslink as pl In\u00a0[2]: Copied!
    a= pl.PRS()\n
    a= pl.PRS() In\u00a0[3]: Copied!
    a.add_score(\"./1kgeas.0.1.profile\",  \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.2.profile\",  \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.3.profile\",  \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.4.profile\",  \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.5.profile\",  \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\")\na.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")\n
    a.add_score(\"./1kgeas.0.1.profile\", \"IID\",[\"SCORE\"],[\"0.1\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.05.profile\", \"IID\",[\"SCORE\"],[\"0.05\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.2.profile\", \"IID\",[\"SCORE\"],[\"0.2\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.3.profile\", \"IID\",[\"SCORE\"],[\"0.3\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.4.profile\", \"IID\",[\"SCORE\"],[\"0.4\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.5.profile\", \"IID\",[\"SCORE\"],[\"0.5\"],sep=\"\\s+\") a.add_score(\"./1kgeas.0.001.profile\",\"IID\",[\"SCORE\"],[\"0.01\"],sep=\"\\s+\")
    - Dataset shape before loading : (0, 1)\n- Loading score data from file: ./1kgeas.0.1.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.1\n  - Overlapping IDs:0\n- Loading finished successfully!\n- Dataset shape after loading : (504, 2)\n- Dataset shape before loading : (504, 2)\n- Loading score data from file: ./1kgeas.0.05.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.05\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 3)\n- Dataset shape before loading : (504, 3)\n- Loading score data from file: ./1kgeas.0.2.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.2\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 4)\n- Dataset shape before loading : (504, 4)\n- Loading score data from file: ./1kgeas.0.3.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.3\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 5)\n- Dataset shape before loading : (504, 5)\n- Loading score data from file: ./1kgeas.0.4.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.4\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 6)\n- Dataset shape before loading : (504, 6)\n- Loading score data from file: ./1kgeas.0.5.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.5\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 7)\n- Dataset shape before loading : (504, 7)\n- Loading score data from file: ./1kgeas.0.001.profile\n  - Setting ID:IID\n  - Loading score:SCORE\n  - Loaded columns: 0.01\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 8)\n
    In\u00a0[4]: Copied!
    a.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")\n
    a.add_pheno(\"../01_Dataset/t2d/1kgeas_t2d.txt\",\"IID\",[\"T2D\"],types=\"B\",sep=\"\\s+\")
    - Dataset shape before loading : (504, 8)\n- Loading pheno data from file: ../01_Dataset/t2d/1kgeas_t2d.txt\n  - Setting ID:IID\n  - Loading pheno:T2D\n  - Loaded columns: T2D\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 9)\n
    In\u00a0[5]: Copied!
    a.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")\n
    a.add_covar(\"./1kgeas.eigenvec\",\"IID\",[\"PC1\",\"PC2\",\"PC3\",\"PC4\",\"PC5\"],sep=\"\\s+\")
    - Dataset shape before loading : (504, 9)\n- Loading covar data from file: ./1kgeas.eigenvec\n  - Setting ID:IID\n  - Loading covar:PC1 PC2 PC3 PC4 PC5\n  - Loaded columns: PC1 PC2 PC3 PC4 PC5\n  - Overlapping IDs:504\n- Loading finished successfully!\n- Dataset shape after loading : (504, 14)\n
    In\u00a0[6]: Copied!
    a.data[\"T2D\"] = a.data[\"T2D\"]-1\n
    a.data[\"T2D\"] = a.data[\"T2D\"]-1 In\u00a0[7]: Copied!
    a.data\n
    a.data Out[7]: IID 0.1 0.05 0.2 0.3 0.4 0.5 0.01 T2D PC1 PC2 PC3 PC4 PC5 0 HG00403 -0.000061 -2.812450e-05 -0.000019 -2.131690e-05 -0.000024 -0.000022 0.000073 0 0.000107 0.039080 0.021048 0.016633 0.063373 1 HG00404 0.000025 4.460810e-07 0.000041 4.370760e-05 0.000024 0.000018 0.000156 1 -0.001216 0.045148 0.009013 0.028122 0.041474 2 HG00406 0.000011 2.369040e-05 -0.000009 2.928090e-07 -0.000010 -0.000008 -0.000188 0 0.005020 0.044668 0.016583 0.020077 -0.031782 3 HG00407 -0.000133 -1.326670e-04 -0.000069 -5.677710e-05 -0.000062 -0.000057 -0.000744 1 0.005408 0.034132 0.014955 0.003872 0.009794 4 HG00409 0.000010 -3.120730e-07 -0.000012 -1.873660e-05 -0.000025 -0.000023 -0.000367 1 -0.002121 0.031752 -0.048352 -0.043185 0.064674 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 499 NA19087 -0.000042 -6.215880e-05 -0.000038 -1.116230e-05 -0.000019 -0.000018 -0.000397 0 -0.067583 -0.040340 0.015038 0.039039 -0.010774 500 NA19088 0.000085 9.058670e-05 0.000047 2.666260e-05 0.000016 0.000014 0.000723 0 -0.069752 -0.047710 0.028578 0.036714 -0.000906 501 NA19089 -0.000067 -4.767610e-05 -0.000011 -1.393760e-05 -0.000019 -0.000016 -0.000126 0 -0.073989 -0.046706 0.040089 -0.034719 -0.062692 502 NA19090 0.000064 3.989030e-05 0.000022 7.445850e-06 0.000010 0.000003 -0.000149 0 -0.061156 -0.034606 0.032674 -0.016363 -0.065390 503 NA19091 0.000051 4.469220e-05 0.000043 3.089720e-05 0.000019 0.000016 0.000028 0 -0.067749 -0.052950 0.036908 -0.023856 -0.058515

    504 rows \u00d7 14 columns

    In\u00a0[13]: Copied!
    a.set_k({\"T2D\":0.2})\n
    a.set_k({\"T2D\":0.2}) In\u00a0[14]: Copied!
    a.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)\n
    a.evaluate(a.pheno_cols, a.score_cols, a.covar_cols,r2_lia=True)
     - Binary trait: fitting logistic regression...\n - Binary trait: using records with phenotype being 0 or 1...\nOptimization terminated successfully.\n         Current function value: 0.668348\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.653338\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.657903\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.654492\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.654413\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.653085\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.654681\n         Iterations 5\nOptimization terminated successfully.\n         Current function value: 0.661290\n         Iterations 5\n
    Out[14]: PHENO TYPE PRS N_CASE N BETA CI_L CI_U P R2_null R2_full Delta_R2 AUC_null AUC_full Delta_AUC R2_lia_null R2_lia_full Delta_R2_lia SE 0 T2D B 0.01 200 502 0.250643 0.064512 0.436773 0.008308 0.010809 0.029616 0.018808 0.536921 0.586821 0.049901 0.010729 0.029826 0.019096 NaN 1 T2D B 0.05 200 502 0.310895 0.119814 0.501976 0.001428 0.010809 0.038545 0.027736 0.536921 0.601987 0.065066 0.010729 0.038925 0.028196 NaN 2 T2D B 0.5 200 502 0.367803 0.169184 0.566421 0.000284 0.010809 0.046985 0.036176 0.536921 0.605397 0.068477 0.010729 0.047553 0.036824 NaN 3 T2D B 0.2 200 502 0.365641 0.169678 0.561604 0.000255 0.010809 0.047479 0.036670 0.536921 0.607318 0.070397 0.010729 0.048079 0.037349 NaN 4 T2D B 0.3 200 502 0.367788 0.171062 0.564515 0.000248 0.010809 0.047686 0.036877 0.536921 0.608493 0.071573 0.010729 0.048315 0.037585 NaN 5 T2D B 0.1 200 502 0.374750 0.181520 0.567979 0.000144 0.010809 0.050488 0.039679 0.536921 0.613957 0.077036 0.010729 0.051270 0.040540 NaN 6 T2D B 0.4 200 502 0.389232 0.189866 0.588597 0.000130 0.010809 0.051145 0.040336 0.536921 0.609238 0.072318 0.010729 0.051845 0.041116 NaN In\u00a0[15]: Copied!
    a.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)\n
    a.plot_roc(a.pheno_cols, a.score_cols, a.covar_cols)
    Optimization terminated successfully.\n         Current function value: 0.668348\n         Iterations 5\n
    In\u00a0[16]: Copied!
    a.plot_prs(a.score_cols)\n
    a.plot_prs(a.score_cols) In\u00a0[\u00a0]: Copied!
    \n
    "}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 849c93ba..e31759a0 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ