This repository stores input files, example output files, and description of the problem which is designed to assist in the evaluation of computational skills of job candidates and prospective MSc/Ph.D. students.
Given sequencing data for 10 study individuals in CRAM file format (one file per individual) inside input/
directory, implement an automatic workflow which for each indiviual:
-
Estimates DNA contamination and genetic ancestry. To estimate DNA contamination levels and genetic ancestry, use the VerifyBamID tool and 1000 Genome Project (1000g) GRCh38 reference panel, which is provided together with the tool. Specify 4 Principal Components (PC) with the
--NumPC
option, e.g.:VerifyBamID.Linux.x86-64 --SVDPrefix ${VERIFY_BAM_ID_HOME}/resource/1000g.phase3.100k.b38.vcf.gz.dat --Reference GRCh38_full_analysis_set_plus_decoy_hla.fa --BamFile `input/HGDP00082.GRCh38.low_coverage.cram` --NumPC 4 ...
The human genome reference file and its index can be downloaded from
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
and
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
, correspondingly.
This step will generate
*.Ancestry
and*.selfSM
output files for each study individual. The DNA contamination values are stored in*.selfSM
files in the FREEMIX column. Create a new output fileContamination.txt
, which stores a table with two tab-separated columns: SAMPLE -- sample name (i.e. HGDP00082, HGDP00450, ...), FREEMIX -- DNA contamination value from FREEMIX column in the corresponding*.selfSM
file. You can find an example of the output file in theexample/
folder. -
Visualizes estimated ancestry PCs. The
${VERIFY_BAM_ID_HOME}/resource/1000g.phase3.100k.b38.vcf.gz.dat.V
file stores pre-computed 4 PCs for each individual in the 1000g reference panel, theinput/1000G_reference_populations.txt
file stores population labels (EUR, EAS, AMR, SAS, and AFR) for each individual in the 1000g reference panel, and the*.Ancestry
files from the (1) workflow step store estimated PCs for 10 study individuals (IntendedSample column). Create the following 4 scatter plots visualizing reference individuals and 10 study individuals in the same space: PC1 vs PC2, PC2 vs PC3, PC3 vs PC4, and PC1 vs PC2 vs PC3 (i.e. 3 dimensional plot). Color reference individuals by their population labels (see plots in theexample/
folder). -
Assigns most likely super population. Using population labels of 1000g reference panel individuals and their PC coordinates (stored in
input/1000G_reference_populations.txt
and${VERIFY_BAM_ID_HOME}/resource/1000g.phase3.100k.b38.vcf.gz.dat.V
, correspondingly), assign most likely population label for each study individual. Save the assigned population labels in thePopulations.txt
file with the following 6 tab-separated columns: SAMPLE -- sample name (i.e. HGDP00082, HGDP00450, ...), PC1 - 1st PC from the IntendedSample column in the corresponding*.Ancestry
file, PC2 - 2nd PC from the IntendedSample column in the corresponding*.Ancestry
file, PC3 - 3rd PC from the IntendedSample column in the corresponding*.Ancestry
file, PC4 - 4th PC from the IntendedSample column in the corresponding*.Ancestry
file, 'POPULATION' - assigned population label (i.e. EUR, EAS, AMR, SAS, and AFR). You can find an example of the output file in theexample/
folder.
To implement this workflow you may:
- use any workflow system of your choice, which is compatible with HPC (e.g. Nextflow, Snakemake, Luigi, Cromwell)
- use any existing published open source sowtware tools and libraries (e.g. R and Python libraries for plotting such as matplotlib and ggplot2)
- use any scripting/programming language (or any combination of them) of your choice (e.g. Python, C/C++, Java, Perl, R, shell scripting)
Please, send us a single compressed archive which includes:
- README. It should provide: (a) a list of all open-source software tools (and their versions) which were used; (b) any additional requirements for the operating system and/or system libraries; (c) any compilation instructions if such exists (d) detailed step-by-step description on how to run the tool.
- Source of your scripts/code. Please include detailed comments in your source code. This will help us better understand your code.
- Contamination.txt. File generated by the 1st step in the workflow and which stores estimated DNA contamination values for 10 study individuals.
- Plots. 4 plots (
*.png
or.jpeg
format) generated by the 2nd step in the workflow (i.e. PC1 vs PC2, PC2 vs PC3, PC3 vc PC4, and PC1 vs PC2 vs PC3). - Populations.txt. File generated by the 3rd step in the workflow and which stores estimated PC coordinates and assigned population labels for 10 study individuals.
The following will be evaluated:
- Workflow can be easily installed and run.
Contamination.txt
andPopulations.txt
files have all requested fields and their values are correct.- Plots are readable (e.g. have adequate axis labels and scaled properly, colors are distinguishable, 10 study individuals can be clearly distinguished from the reference individuals).