This repository stores input files, example output files, and description of the problem which is designed to assist in the evaluation of computational skills of job candidates and prospective MSc/Ph.D. students.
Given a list of genetic variants in compressed Variant Call Format (VCF) files (one file per chromosome) and gene coordinates in compressed GENCODE GTF file, develop a command-line tool which for each variant finds:
- List of overlapping genes
- List of genes within +/-200000 base pairs from the variant
- Nearest gene
- Genetic variants are saved in compressed VCF files with the
.vcf.gz
suffix in theinput/
directory. The description of the VCF format can be found at https://samtools.github.io/hts-specs/. If you will find it useful for your solution, the corresponding index files for fast random access (.tbi
suffix) are located in the same directory. - Gene coordinates are saved in a compressed GENCODE GTF file
input/gencode.v38.annotation.gtf.gz
. The description of the GENCODE GTF format can be found at https://www.gencodegenes.org/pages/data_format.html.
The command-line tool must write results to the compressed VCF file. Specifically, for each genetic variant the following key-value pairs must be written into the INFO
field:
GENES_IN
- comma-separated list of identifiers (gene_id
key from GENCODE GTF) of overlapping genes. If there are no overlapping genes, the empty value must be denoted with.
symbol i.e.GENES_IN=.
.GENES_200KB
- comma-separated list of identifiers (gene_id
key from GENCODE GTF) of genes which are within +/-200000 base pairs from the variant. If there are no such genes, the empty value must be denoted with.
symbol i.e.GENES_200KB=.
.GENES_200KB
is a superset ofGENES_IN
.GENE_NEAREST
- identifier (gene_id
key from GENCODE GTF) of the nearest gene. IfGENE_IN
is not empty, thenGENE_NEAREST
must be empty. Empty value must be denoted with.
symbol i.e.GENE_NEAREST=.
. An example of the output file can be found inoutput/example.vcf.gz
.
To implement this command-line tool you may:
- use any scripting/programming language (or any combination of them) of your choice (e.g. Python, C/C++, Java, Perl, R, shell scripting)
- use any open-source library (e.g. for VCF reading/writing, relevant data structures and etc)
Please, send us a single compressed archive which includes:
- README. It should provide: (a) a list of all open-source software tools (and their versions) which were used; (b) any additional requirements for the operating system and/or system libraries; (c) any compilation instructions if such exists (d) detailed step-by-step description on how to run the tool.
- Source of your scripts/code. Please include detailed comments in your source code. This will help us better understand your code.
Don't send VCF output files!
The following will be evaluated:
- The tool can be easily installed and run.
- The tool uses data structures and algorithms adequate for the problem i.e. ensure reasonable running time and memory usage.
- All requested INFO fields are present in the final output VCF files.
- Values in the requested INFO fields in the final output VCF files are correct.