This repository contains the datasets and scripts used in the following paper:
- Y. Tabatabaee, C. Zhang, S. Mirarab (2024). Species tree branch length estimation despite incomplete lineage sorting, duplication, and loss
For experiments in this study, we analyzed three sets of simulated datasets and nine biological dataset with different sources of gene tree discordance (details below). All datasets can be accessed from this Google Drive link. In all simulated datasets, the true species trees have branch lengths in substitution-units.
ILS-only simulations
- We reused the 101-taxon ILS-only dataset from Tabatabaee et. al. (2023) available at https://github.com/ytabatabaee/CASTLES-paper/. Results and intermediate data from the experiments in the paper are available here.
HGT+ILS simulations
- We generated a new 51-taxon HGT+ILS dataset based on the parameters from Davidson et. al. (2015) study. The original dataset is available at https://databank.illinois.edu/datasets/IDB-6670066. Our new dataset, as well as the intermediate data and output of methods are available here.
GDL+ILS simulations
- We updated the GDL+ILS dataset from Willson et. al. (2022, 2023) with species trees with substitution-unit branch lengths. The original datasets are available at https://databank.illinois.edu/datasets/IDB-4050038 and https://databank.illinois.edu/datasets/IDB-5748609. Results and intermediate data from the experiments in the paper are available here.
- Brids: 363-taxon dataset from Stiller et al. (2024) with 63,430 single-copy genes.
- Bees: 32-taxon dataset from Bossert et al. (2021) with 853 single-copy genes.
- Mammals: 37-taxon dataset from Song et al. (2012) with 424 single-copy genes.
- Fungi: 16-taxon dataset from Butler et al. (2009) with 706 single-copy genes and 7,180 multi-copy genes.
- Plants: 80-taxon dataset from Wickett et al. (2014) with 424 single-copy genes and 9,610 multi-copy genes.
- Eudicots: 40-taxon dataset from Chanderbali et al. (2022) with 345 single-copy genes and 2,573 multi-copy genes.
- Bacterial (core genes): 72-taxon dataset from Williams et al. (2020) with 49 single-copy genes.
- Bacterial (non-ribosomal genes): 108-taxon dataset from Petitjean et al. (2015) with 38 single-copy genes.
- Bacterial (WoL): 10,575-taxon dataset from Zhu et al. (2019) with 381 single-copy genes.