Skip to content

eliorav/Population-Genotype-Frequency

Repository files navigation

Population Genotype Frequency

A Python script that generate population genotype frequency file from 1000G data

Built with

The script uses the following platform to run:

In order to run a linux only tools in every environment, the script uses the following docker wrappers:

Script workflow

  • step 1 - split every population's sample to a different file. for example:
grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list
  • step 2 - add more information about the SNPs - add position to the given rsid.
  • step 3 - get genotype data of 1000G phase3
    • download pgen pvar and psam of the merged dataset from cog-genomics (the bold links).
    • decompress the pgen and the pvar files. for example:
    plink2 --zst-decompress all_phase3.pgen.zst > all_phase3.pgen
    • create 1000G merged dataset VCF file by using this command:
    plink2 --pfile all_phase3 vzs --extract [your list of rsIDs] --export vcf
  • step 4 - Split the merged VCF file into multiple VCF files by population as describe here.
    • using the following command:
    vcf-subset -c [population sample list file] [merged VCF file] | fill-an-ac > [VCF by population output file]
  • step 5 - create frequency files.
    • create a frequency file for every population by using this command:
    plink2 --vcf [VCF by population file] --freq --out [frequency by population output file]
    • merge the frequency files to a single one file and add the RSID and the position to the final file.
  • step 6 - cleanup the temp files. Note that steps 1-3 can run in parallel.

Getting Started

Prerequisites

Install requirements libraries

pip install --user -r requirements.txt

Usages

you can run the script with -h flag to see the supporting arguments:

python main.py -h

returns the following:

usage: Create allele frequency by population file from 1000G data
       [-h] [--out_folder OUT_FOLDER] [--out_filename OUT_FILENAME]
       [--snps_file_path SNPS_FILE_PATH] [--no_parallel NO_PARALLEL]

optional arguments:
  -h, --help            show this help message and exit
  --out_folder OUT_FOLDER
                        The output folder of the result file. The default is "output".
  --out_filename OUT_FILENAME
                        the output file name of the result file (without suffix). The default is "allele_frequency".
  --snps_file_path SNPS_FILE_PATH
                        The path to the list of SNPs - The default is "snps.tsv".
  --no_parallel NO_PARALLEL
                        If the value is true, The script will not run in parallel mode.

Make sure to create an RSID list and pass the path to the script.

Output

The output file is a TSV file with the following header:

  • #chrom - the chromosome number.
  • position - the position of the SNP.
  • rsid - the RSID number.
  • A1 - the REF allele.
  • A2 - the ALT allele.
  • POOPULATION_NAME (e.g., ACB) - the frequency for the SNP in the given population. The file is sorted by chromosome and position.

Contact

Elior Avraham – elior.av@gmail.com

About

A Python script that generate population genotype frequency file from 1000G data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages