Skip to content

Ideal human genome reference sequences estimated from population and evolutionary data

Notifications You must be signed in to change notification settings

dekoning-lab/MajorHumans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

##Major-allele and ancestral reference genomes for Homo sapiens ####John Hall, Nathaniel Bryans and Jason de Koning #####June 2016

de Koning Lab, University of Calgary
and Bachelor of Health Sciences Bioinformatics Program
http://lab.jasondk.io

Prior to publication please cite: Hall JS, Bryans N and APJ de Koning (2016). MajorHumans: Major-allele reference genomes and exomes for Homo sapiens. University of Calgary. http://lab.jasondk.io


###Installation

You can download and unpack the current release using:

git clone https://github.com/dekoning-lab/MajorHumans
cd MajorHumans; make

###README

This folder contains the v0.2 beta-release of MajorHumans haploid reference genome/exome sequences. The data are encoded as VCF files relative to the hg19 assembly. In these reconstructions, variations such as CNVs are ignored so that position numbers refer to assembly positions on the hg19 assembly. This release also contains a script to re-encode an hg19 VCF relative to one of the new reference sequences. In most cases, this should reduce the file size between 33% and 50%. This script is called NewRefConverter.pl

Usage: perl NewRefConverter.pl <newRef>.vcf <vcfToConvert>.vcf

We include reconstructed major allele reference genomes based on several data sources including the 1000 Genomes Project (whole genome, phase 3 release), the NHLBI 6500 Exomes dataset, and the initial release of the Exome Aggregation Consortium's (ExAC) 65K Exomes.

####Resource naming

The 1000 Genomes references are named 1000GPOPMajorAllele.vcf, where POP is the 1000 Genomes population that the reference is for. The 6500 exomes references are named 6500EPOPMajorAllele.vcf where POP is the 6500 exomes population thatthe reference is for. The 65K exomes references are named 65KPOPMajorAllele.vcf where POP is the 65K exomes population that the reference is for.

Population tag 1000 genomes data 65K exomes data 6500 exomes data
African AFR 1000GAFRMajorAllele.vcf 65KAFRMajorAllele.vcf 6500AFRMajorAllele.vcf
European EUR/NFE 1000GEURMajorAllele.vcf 65KNFEMajorAllele.vcf 6500EURMajorAllele.vcf
American AMR 1000GAMRMajorAllele.vcf 65KAMRMajorAllele.vcf
South Asian SAS 1000GSASMajorAllele.vcf 65KSASMajorAllele.vcf
East Asian EAS 1000GEASRMajorAllele.vcf 65KEASMajorAllele.vcf
Finnish FIN 65KFINMajorAllele.vcf
ALL 1000GALLMajorAllele.vcf

####Ancestral Human Genome In this release, we also include a preliminary attempt at reconstructing ancestral human genome sequences. This was reconstructed using the 30x Neanderthal and Denisovan genomes to help root the tree together with 1000 Genomes data. Reconstruction was made via a pseudo-phylogenetic analysis and maximum likelihood ancestral reconstruction with RaxML. This should best be considered as 'weighted average' sequence; it should perform well as reference sequence. It is called hg00_humanAncestral.vcf.


Note: Any Info fields in the reference genome VCF files reflect the original data source's annotations and have not been recomputed based on the new reference genome.

Please report any errors or suggestions with this initial release to Jason or open an issue.

About

Ideal human genome reference sequences estimated from population and evolutionary data

Resources

Stars

Watchers

Forks

Packages

No packages published