Skip to content

Latest commit

 

History

History
3 lines (2 loc) · 2 KB

README.md

File metadata and controls

3 lines (2 loc) · 2 KB

This data set is intended to serve as a small example data set that runs very rapidly, with a well characterized data for tutorials and testing pipelines for analysis of ezRAD data. This sample data set provides real sample ezRAD data from 6 pooled population of Achatinella sowerbyana, a well studied endangered Hawaiian tree snail with high inbreeding and very strong population structure (Price and Forsman et al. in prep; contact pricemel@hawaii.edu for further details on the biology of the snails, or zac@hawaii.edu for further details on the data and analysis). The data set consists of raw illumina miSEQ reads that map to the mitochondrial genome of 6 populations of Achatinella sowerbyana. The files were generated withe program bbsplit (part of the bbmap package) by mapping whole genome libraries to the mitochondrial reference (Achatinella_sowerbyana.fasta). Also included is a fasta alignment of the consensus mitochondrial genomes for each population (ASO_mt_genomes.fasta), and a NJ tree of the whole genome alignment (tree.jpg). The data set has been tested to work with the dDocent 2.2.16 pipeline (installed via conda; see ddocent.com for further information); using the SE setting and the default parameters. The data has also been tested and successfully run using ipyrad [v.0.6.15] (see https://ipyrad.readthedocs.io/ for more information). Example input and output for each program is provided. These pipelines should produce a variety of output files, such as a .vcf file that can be converted to a wide variety of formats (with the program PGDSpider for example), and analyzed in a wide variety of programs depending on the underling questions.

Although this is real ezRAD data, it should be noted that it is pooled mitochondrial data therefore it may not accurately represent more complex loci, and is not truly diploid, nevertheless this idealized data set can be used for learning pipelines and rapidly comparing results from the many various settings compared to a fairly well known pattern of genetic structure.