Skip to content

the miscellaneous analysis scripts used in the research of pan-Zea genome and genetics

License

Notifications You must be signed in to change notification settings

songtaogui/pan-Zea_utilities

Repository files navigation

pan-Zea-utilities

DOI

Introduction

This repository collected the miscellaneous analysis scripts used in the research of pan-Zea genome and genetics.

Class Script_name Descriptions
Genetic PANZ_SVflankSNP_LD.sh Investigate the LD level of given query variants to nearby SNPs
Genetic SV_LD_type_draw.r Companion of PANZ_SVflankSNP_LD.sh. Can also be used for visualization.
Genetic PANZ_QTL_FineMap.sh Identify QTLs from GWAS results and perform statistical fine mapping
Genetic PANZ_part_h2.sh Partition genetic variants and calculate h2 for each part.
Genetic PANZ_Rand_Variant_feature.sh Partition genetic variants and calculate h2 for each part.
Genetic PANZ_MAGMA.sh Perform regional association analysis of genic regions
Genomic PANZ_determine_core_dispensable.sh Determine if a gene is core or dispensable based on gene presence and absence matrix
Genomic PANZ_SubG_PAV.sh Identify subgroup unblanced PAV genes
Genomic PANZ_gene_age.sh Calculate gene age based on sequence similarity within the species tree
Genomic PANZ_SV_Annotation.sh Annotate SVs relative to nearby genes and TE classes
Statistic PANZ_freq_enrich.sh Enrichment analysis given frequency stat query and ref
Statistic PANZ_rankINT.sh Normalizing input with RankINT
Statistic PANZ_regional_enrich.sh Genome regional enrichment analysis

Prerequisites

Runtime environment:

  • Linux, tested with version 3.10.0-862.el7.x86_64 (Red Hat 4.8.5-28)
  • bash, tested with version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
  • perl 5, tested with v5.30.1

Most of the pipelines were wrote with bash, R, and Perl, and tested on Linux platform. However, other unix-like platform could also work. The only thing to note is that some of the pipelines used the bash redirection features that not supported in basic sh:

# in bash or zsh it works
bash -c 'cat <(echo "hello")'
# hello

# in sh it failed
sh -c 'cat <(echo "hello")'
# sh: 1: Syntax error: "(" unexpected

Thus, try not to run the pipelines through sh PANZ_some_pipe.sh.

Dependencies

Most of the pipelines would require the presence of other third-party softwares, such as GNU parallel. The detailed dependencies would be listed in the usage of certain pipeline.

Detailed usage

Genetic

PANZ_SVflankSNP_LD.sh and SV_LD_type_draw.r

These script could do the similar analysis on the representation of the query genetic variant with nearby SNPs, as described in:

Stuart T, Eichten S R, Cahn J, et al. Population scale mapping of transposable element diversity reveals links to gene regulation and epigenomic variation[J]. elife, 2016, 5: e20777.

  • PANZ_SVflankSNP_LD.sh
------------------------------------------------------------
Calculate SV LDs with nearby SNP/InDels use method in
(Stuart et al. eLife 2016;5:e20777. DOI: 10.7554/eLife.20777)
------------------------------------------------------------
Dependency: bcftools vcftools perl Rscript csvtk GNU-parallel
------------------------------------------------------------
USAGE:
    bash PANZ_SVflankSNP_LD.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                       show help and exit.
    -t, --threads    <num>    [O]    set threads (default 2)
    -q, --query      <str>    [R]    Input query vcf file, bcf or gz and indexed (SV)
    -r, --ref        <str>    [R]    Input ref vcf file, bcf or gz and indexed (SNP/Indel)
    -s, --size       <num>    [O]    Number of flanking ref records for each query (Default: 150)
    -p, --prefix     <str>    [O]    Prefix for outputs (Default: Calc_LD_OUT)
    -R, --rscript    <str>    [O]    PATH to the 'SV_LD_type_draw.r' script (Default: ./SV_LD_type_draw.r)
    -m, --mode       <str>    [O]
                    Analysis mode:{ 0; 1; 2 } (Default: 0)
                        0 ---> only out put LD type
                        1 ---> only draw heatmap and linesfig
                        2 ---> out LD type and draw figs
                    use 1,2 on all LD files is SLOW, you can run the stand-alone rscript on each LD file later.
    --clean                   [O]    Clean each variant LD result, just keep the main summary output, useful when there
                                    are too many variants (if disk file number limit is an issue), and you donot need
                                    each result file to draw the ld heatmap figs.
Note:
    needs these software in PATH to run the pipeline:
    GNU parallel; bcftools; Rscript; vcftools; csvtk;
------------------------------------------------------------


  • SV_LD_type_draw.r
--------------------------------------------------
Calculate sv_snp-snp_mid num and draw.
--------------------------------------------------
Usage:
        Rscript <Program> <mode> <vcftools.LD> <size> <outdir>
--------------------------------------------------
OPTIONs:
        mode:
                0 ---> only out put LD type
                1 ---> only draw heatmap and linesfig
                2 ---> out LD type and draw figs
        haploview.LD:
                LD file generated from the main program
        size:
                flanking ref size used in the main program
    outdir:
        path of the outputs
--------------------------------------------------

Outputs:

If you ran PANZ_SVflankSNP_LD.sh with --mode 2, you would get 4 outputs:

  1. output LD file records the pairwise LD of the variants flanking the query. The position of the variants were modified as the order of the query and flanking variants:

        # e.g. if you set the flanking of 150 SNPs, that means the resulting LD files would be the pairwise LD values of 301 items (left 150 + query + right 150):
        CHR   POS1   POS2   N_INDV   R^2
        1     1      2      662      0.000105541
        1     1      3      661      5.96288e-05
        1     1      4      663      6.86625e-06
        1     1      5      662      6.88707e-06
        1     1      6      664      6.84552e-06
        ...
        ...
  2. the LD rank summary output of the query:

    # format <query_ID>   <# query-SNP ranks over nearby SNP-SNP>   <# Total nearby SNPs>  <LD-leve tag>
    Zm00001d027244_PAV  293     300     high
  1. and 4. visualisations in PDF format of the detailed LD heatmap and the LD rank comparison line plots

NOTE: if you have ran the pipeline with --mode 0 or --mode 1, and you would like to draw plots for specific query, you could just take the output LD file of that query to SV_LD_type_draw.r :

SV_LD_type_draw.r 2 myQuery_flank150_hplv.geno.ld 150 .
# Outputs:
# myQuery  293     300     high
# heatmap Fig: ./myQuery_heatmap.pdf
# lines Fig: ./myQuery_lines.pdf

PANZ_QTL_FineMap.sh

------------------------------------------------------------
Identify QTLs and perform statistical fine mapping for PANZ 
------------------------------------------------------------
Dependence:csvtk perl HARVESTER DAP-G
------------------------------------------------------------
USAGE:
    

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                          show help and exit.
    -t, --threads       <num>   [O]     set threads (default: 2)

# I/O:
    -o, --out           <str>   [O]     Output prefix, will create a dir with this string in the current path,
                                    so make sure the dir does not exist (default: PANZ_FineMap_Out)
    -v, --gvcf          <str>   [R]     input gvcf file (should be indexed).
    --trait             <str>   [R]     input trait file name, make sure the file was named as "trait_name.prefix",
                                    File Format (TSV format, the key word "<trait>" is needed):
                                        <trait>    trait_name
    --covar             <str>   [R]     Input quantitative covariates from a plain text file in format (TSV):
                                        <covar>    covar_ID
                                        sample1        1.0
                                        sample2        2.5
    --gwas              <str>   [R]     gwas file.tsv[.gz] with header( header could be any string),
                                    should contain columns of:
                                        variant_ID, chr, start, end, p-value, effect size
    --info_cols         <str>   [R]     a set of numbers seperated by comma to indicate the column of
                                    [variant_ID, chr, start, end, p-value, effect size] in the gwas file, for example,
                                    if your gwas file looks like this:
                                        Chr    start0    End    variant_ID    p-value    beta
                                        1       1234    1235    SNP01         1E-7      0.22
                                    you should set this option as "4,1,2,3,5,6"

# gwas peak calling
    --min_peak_logP     <int>   [O]     the minimum -log(P-value) of the peak to keep. (default: 5)
    --inlimit           <int>   [O]     the minimum p-value of the variant to include in a peak region. (default: 0.001)
    --flank             <int>   [O]     flanking length (in bp) of peak to include as final peak region. (default: 10000)
    --min_count         <int>   [O]     the minimum No. of variants that passed the --inlimit cutoff within a peak. (default: 3)

# fine mapping (Bayes)
    --ld_control        <0-1>   [O]     r^2 threshold within a signal cluster (default: 0.25)
    --credible_prob     <0-1>   [O]     Prob of credible set (default: 0.95)

# Miscellaneous
    --tissue            <str>   [O]     tissue id of the trait that used in the final annotation vcf file.(default: kernel)
    --tar                       [O]     tar intermediate dirs if set. (default: do not tar intermediate dirs)

------------------------------------------------------------

PANZ_part_h2.sh

------------------------------------------------------------
Partition genetic variants and calculate variant heritability
using GWAS Summary Statistic.
This is a wrapper of LDAK pipelines.
------------------------------------------------------------
Dependence: LDAK plink csvtk parallel bedtools bgzip perl5
------------------------------------------------------------
USAGE:
    bash PANZ_part_h2.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                          show help and exit.
    -t, --threads       <num>   [O]     set threads (default: 2)
# I/O
    -v, --gvcf          <str>   [R]     bgzipped and index vcf file. SHOULD BE ABSOLUTE PATH.
    -p, --pheno         <str>   [R]     plink format pheno file. SHOULD BE ABSOLUTE PATH.
    --pheno_tag         <str>   [R]     A string tagging the pheno file (e.g. "Cob_color"), used to generate different reml outputs, in order to multiple-run of several phenotypes based on same gvcf files.
    -c, --covar         <str>   [R]     covar file in plink format. SHOULD BE ABSOLUTE PATH.
    -o, --out           <str>   [O]     output prefix (default: PANZ)
# mode
    --part_tsv          <str>   [O]     ABSOLUTE path of the files included Variant IDs and patition rules, will create a dir based on the prefix of this file.
                                    format (With header, header of Col1 must be VID, header of other cols would be used as feature tag in output. Mark excluded data with "NA"):
                                        VID        Annotations     MAF_Bins
                                        SNP1          CDS_SNPs     MAF0-0d05
                                        SNP2   INTERGENIC_SNPS     MAF0d05-0d1
                                        SNP3          CDS_SNPS     NA
    --part_bed          <str>   [O]     ABSOLUTE path of the bed files if you would like to estimatie the patition h2 by regions, will create a dir based on the prefix of this file.
                                    format (UCSC bed format, no header):
                                        <chr>   <start0>    <end>   <Feature_ID>
                                        1       12345       13456   TE_Rich_Region1
                                        2       22345       33456   Gene_Rich_Region1
                                        3       32345       43456   TE_Rich_Region1
# prune & keep
    --keep                      [O]     Keep only unrelated individuals for h2 estimation. (default: off, that is use all individuals provided)
    --window_kb         <int>   [O]     Window size to compute allelic correlations to thin variants.(default:1000)
    --window_prune      <0-1>   [O]     Max correlation squared cut-off to thin variants. (default: 0.2)
# PCA
    --PCA                       [O]     Use PCs as additional covar, that is, if --covar was provided, will combine --covar with PCs as a final covar file.
    --axes              <int>   [O]    Along with --PCA, set number of PCs to keep in the PCA of kept individuals by pruned variants, and add PCs as covar. (default: 20)
# OTHERs
    --constrain                 [O]     By default, LDAK will not restrict heritability estimates to be within [0,1]. This is generally our preference, as to obtain unbiased estimates of heritabilities near zero, it is necessary to accept the possibility of negative estimates. Set this option will force all heritability estimates within [0,1].
------------------------------------------------------------

PANZ_Rand_Variant_feature.sh

------------------------------------------------------------
Rand feature for PANZ h2 estimation
------------------------------------------------------------
Dependence:
------------------------------------------------------------
USAGE:
    bash PANZ_Rand_Variant_feature.sh <in_feature> <rand_seed> <rand_prob>

in_feature: Variant feature file in the same format as used in "PANZ_part_h2.sh":
    FORMAT: (With header, header of Col1 must be VID, header of other cols would be used as feature tag in output. Mark excluded data with "NA"):
            VID        Annotations     MAF_Bins
            SNP1          CDS_SNPs     MAF0-0d05
            SNP2   INTERGENIC_SNPS     MAF0d05-0d1
            SNP3          CDS_SNPS     NA

rand_seed: should be numeric, will output final result of "Rand[rand_seed].tsv"
rand_prob: int from 1-10, percentage of min item number to be used as rand number, for example, if the min class has 100 items, and set rand_prob to 9 will get 100 * 9 / 10 = 90 items for each class.
------------------------------------------------------------

PANZ_MAGMA.sh

------------------------------------------------------------
Perform regional association analysis of genic regions for PANZ.
A wrapper of MAGMA
------------------------------------------------------------
Dependence: MAGMA plink
------------------------------------------------------------
USAGE:
    bash PANZ_MAGMA.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                          show help and exit.
    -r, --ref           <str>   [R]     Plink bed prefix of all variants
    --annote            <str>   [R]     Gene-Variant Annotation file in format:
                                        <Gene_ID>    <Variants_sep_by_space>
                                        Gene1        SNP1 SV2 INDEL3
                                        ...          ...
    --gene_sets         <str>   [R]     Gene set file in format:
                                        <Gene_SET_ID>    <Genes_sep_by_space>
                                        Gene_family_1    Gene1 Gene2 Gene3
                                        ...              ...
    --pheno             <str>   [R]     Phenotype in plink format
                                    *** NOTE:   Phenotype should not contain duplicate samples,
                                                and the first Phenotype should not be "NA"
    --covar             <str>   [R]     covariant in plink format
    --region            <str>   [O]     Physical region included for the analysis,
                                        in format "CHR:START-END", use all variants
                                        if not provided.
    -o, --out           <str>   [O]     Output prefix, will create a dir accordingly for the outputs (default: PANZ_Gene_GWAS_out)

------------------------------------------------------------

Genomic

PANZ_determine_core_dispensable.sh

------------------------------------------------------------
Determine core and dispensable genes/gene families
using pvalue of binomial test of gene loss rate.
------------------------------------------------------------
Dependency: parallel perl Rscript
------------------------------------------------------------
USAGE:
    bash PANZ_determine_core_dispensable.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                       show help and exit.
    -m, --matrix     <str>    [R]    path of PAV matrix file.
                            format (tab-separated):
                            ID     Sample1    Sample2    Sample3
                            Gene1  1          0          1
                            Gene2  1          1          1
                            Gene3  0          0          1
    -r, --lossrate   <0-1>    [O]    min loss rate to determine a gene dispensable (Default: 0.01)
    -p,--pvalue      <0-1>    [O]    max pvalue cutoff to determine a gene dispensable (Default: 0.05)
    -o,--output      <str>    [O]    output file (default: core_disp_out.tsv)
    --prefix         <str>    [O]    prefix tag of output gene types (default: PANZ)
    -t, --threads    <num>    [O]    set threads (default: 2)
------------------------------------------------------------

PANZ_SubG_PAV.sh

------------------------------------------------------------
Determine subgroup unbalanced genes/gene families
using adjusted-pvalue (q-value, Storey's Method) of two-sided
Fisher's exact test of gene PAV matrix among different subgroups.
------------------------------------------------------------
Dependencies: GNU-parallel perl Rscript csvtk (All should be in PATH)
------------------------------------------------------------
USAGE:
    bash PANZ_SubG_PAV.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                       show help and exit.
    -m, --matrix     <str>    [R]    path of PAV matrix file.
                            format (tab-separated):
                            ID     Sample1    Sample2    Sample3
                            Gene1  1          0          1
                            Gene2  1          1          1
                            Gene3  0          0          1
    -s,--subgroup    <file>   [R]    a list of samples in subgroup, one per line.
                            Note: make sure that all samples were in the provided matrix. The comparasion
                            will be performed between the whole matrix provided and the subgroup matrix
                            extracted based on this subgroup name list.
    -p,--prefix      <str>    [O]    prefix tag of outputs, you may use subgroup name (default: PANZ)
    -f,--FDR         <0-1>    [O]    FDR cutoff to determine a gene unbalance (Default: 0.05)
    --localfdr                [O]    Use local FDR values (default: Global FDR, aka qvalue)
    -o,--output      <str>    [O]    output file (default: PAV_subgroup_unbalanced_out.tsv)
    -t, --threads    <num>    [O]    set threads (default: 2)
------------------------------------------------------------


PANZ_gene_age.sh

------------------------------------------------------------
Infering Gene ages using methods discribed in:
    Zhang, Y. E. et al. Chromosomal redistribution of male-biased
    genes in mammalian evolution with two bursts of gene gain on
    the X chromosome. PLoS Biol. 8, e1000494 (2010).
------------------------------------------------------------
Dependency in PATH:
    diamond, csvtk, taxonkit, perl, GNU-parallel, blastdbcmd
------------------------------------------------------------
USAGE:
    bash PANZ_gene_age.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                       show help and exit.
    -d, --database   <str>    [O]    Path to NCBI-NR protein database (Default: /mnt/d/Works/NnRAD/mirror_1111/ref/NR/nr/nr)
    -i, --input      <str>    [R]    Input protein query fasta sequences
    -o,--output      <str>    [O]    Output prefix (Default: PANZ_Gene_ages)
    --taxnomy        <str>    [R]    Tab-seperated Hierarchy of NCBI Taxnomy IDs. Format:
                            <Label> <Taxnomy ID> <Hierarchy Order>
                            Example:
                                For taxnomy like:
                                        |- cellular organisms (131567)
                                            |--- Eukaryota (2759)
                                                |--- Alveolata (33630)
                                The taxnomy order file would be:
                                        P1_Cellular    131567    1
                                        P2_Eukaryota   2759      2
                                        P3_Alveolata   33630     3
                            ** 1. Taxnomy with same Label would be merged in the output.
                            ** 2. Please make sure all the taxnomy IDs are in SAME taxnomy tree branch
                            ** 3.  Hierarchy order should be increased by taxnomy age from old to new
    --evalue         <num>    [O]    Evalue cut-off of the blastp output (Default 1e-5)
    --identity       <0-1>    [O]    Identity cut-off of the blastp output (Default 0.3)
    -t, --threads    <num>    [O]    set threads (default: 2)
------------------------------------------------------------

PANZ_SV_Annotation.sh

------------------------------------------------------------
PANZ annotate SV with TE and Gene gff
------------------------------------------------------------
Dependence in PATH: parallel, RepeatMasker, bedtools, bcftools
------------------------------------------------------------
USAGE:
    bash PANZ_SV_Annotation.sh [OPTIIONS]
OPTIONS:
    1. SV.vcf
    2. gene.gff
    3. TE.bed: chr start end ID class strand
    4. TE.fa: NR TE sequences
    5. Flanking_length(bp)
    6. Threads
    7. output_name
------------------------------------------------------------

Statistic

PANZ_freq_enrich.sh

------------------------------------------------------------
PANZ freq enrichment analysis: input query and ref frequency file, and do enrichment analysis.

frequency file format:(no header)
    <Item_ID>  <Freq>
    ITEM1      123
    ITME2      888
    ...        ...

------------------------------------------------------------
USAGE:
    bash PANZ_freq_enrich.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                       show help and exit.
    -q, --query      <str>    [R]    Query frequency file
    -r, --ref        <str>    [R]    Referance frequency filen
    -c, --cutoff     <0-1>    [O]    Qvaule cutoff (default: 0.05)
    -o, --output     <str>    [O]    Output file name (default: PANZ_KEGG_Enrich.tsv)
    -t, --threads    <num>    [O]    set threads (default: 2)

Output format:
    <Item>  <Type>  <ratio_in_query>  <ratio_in_ref>  <raw_Pvalue>  <Qvalue>  <Lfdr>
------------------------------------------------------------

PANZ_rankINT.sh

------------------------------------------------------------
PANZ Rank-Based Inverse Normal Transformation
------------------------------------------------------------
Dependence: Rscript
------------------------------------------------------------
USAGE:
    bash PANZ_rankINT.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                          show help and exit.
    -t, --trait         <str>   [R]     trait file in tsv format
    -c, --col           <int>   [O]     trait value column (default: 2)
    --na                <str>   [O]     NA represent string (default: NA)
    -o, --out           <str>   [O]     Output prefix (default: header of trait column)
    --SW_test                   [O]     Out put a Shapiro-Wilk norm test result to <out>.sw_test
    --adj               <0-1>   [O]     Adjust parameter for Rank-Based Inverse Normal Transformation (default: 0.5)
------------------------------------------------------------


PANZ_regional_enrich.sh

------------------------------------------------------------
PANZ_Regional_enrich: a wrapper of R/regioneR function, with
some DIY options.
------------------------------------------------------------
Dependence:
    R/regioneR package (http://bioconductor.org/packages/release/bioc/html/regioneR.html)
------------------------------------------------------------
USAGE:
    bash PANZ_regional_enrich.sh [OPTIONS]

OPTIONS: ([R]:required  [O]:optional)
    -h, --help                          show help and exit.
    -o, --out <str>            [R]      Output prefix.
    -g, --genome <str>         [R]      Genome range in bed format. Used as the bondaries of region manipulating. Eg:
                                            chr1    0   1234567
                                            chr2    0   1345678

    -a, --query <str>          [R]      Query region file in bed format, will do permutation based on this file.

    -b, --feature <str>        [R]      Feature region file in bed format, will count overlaps of each permutation of query region with this file to evaluate the enrichment.

    -r, --ref <str>            [O]      Set the range of query region permutation. Three type of parameters are supported:
                1. [bed:your_bed_file.bed] --> Use a region file as the permutation boundary, so make sure the query region file is subset of the input region file.
                2. [flank:<int>] --> use a flanking of <int> bp length of query region (both side) as the permutation boundary, eg: "flank:10000" will flank left 10kb and right 10kb of query region.
                3. [time:<int>] --> use a flanking of <int> * <length of each query region> bp of query region as the permutation boundary, eg:
                    if your query region is:
                    #    chr1    1000    1100
                    #    chr2    5010    5020
                    and you set ref as "--ref time:5", the permutation boundary would be:
                    #    chr1    500     1600    (query length=100,flanking=100*5)
                    #    chr2    4960    5070    (query length=10,flanking=10*5)
                The default behavior of --ref (if unset) is to use "--genome" file as boundary, that is permutation along the whole genome.

    -n, --ntimes <int>         [O]      Number of permutation times.A large number of permutations will produce more accurate results and a nicer-looking plot but a permutation test can be computationally expensive.(default: 100)

    --cutoff <0-1>             [O]      P-value cutoff to call the result as significantly non-random. (default: 0.05)

    --seed <int>               [O]      Set random seeds to create reproducible results. (default: 1234)

    --plot                     [O]      Plot the permutation results.

    --force_save               [O]      Force save the permutation result rdata. The default behavior is to save only the result that passed the P-value cutoff.
------------------------------------------------------------

Citations

If you use these scripts in your work, or you would like to know more details, please refer to:

Gui, S., Wei, W., Jiang, C. et al. A pan-Zea genome map for enhancing maize improvement. Genome Biol 23, 178 (2022). https://doi.org/10.1186/s13059-022-02742-7

About

the miscellaneous analysis scripts used in the research of pan-Zea genome and genetics

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published