-
Notifications
You must be signed in to change notification settings - Fork 44
cuny2020
layout: workshop title: "IQ-TREE Workshop Tutorial" author: AUTHOR date: DATE docid: 100 tags:
- workshop
Table of Contents generated with DocToc
- 1) Input data
- 2) Inferring the first phylogeny
- 3) Applying partition model
- 4) Choosing the best partitioning scheme
- 5) Tree topology tests
- 6) Concordance factors
- 7) Resampling partitions and sites
- 8) Identifying most influential genes
- 9) Wrapping up
- Before you start
To do this tutorial you'll need to install a few programs and make sure they are working:
It's important that you have these programs working on whatever machine you will use before the tutorial starts - there won't be time in the tutorial itself to troubleshoot installation issues.
For the rest fo the tutorial, the folder containing your iqtree
executable should be added to your PATH environment variable so that IQ-TREE can be invoked by simply entering iqtree
at the command-line. Alternatively, you can also copy iqtree
binary into your system search.
You will know if IQ-TREE is installed properly if when you run the command
iqtree
you see something like this on the screen:
IQ-TREE multicore version 2.0.3 for Linux 64-bit built Apr 26 2020``
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,``
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.``
- Input data
For this tutorial we'll use a subset of a dataset that was put together to figure out the relationships among Turtles, Crocodiles, and Birds. This relationship has been surprisingly tough to pin down, even with really big datasets. So it's perfect for applying some more advanced tools than simply running a tree.
Please download the following input files:
- turtle.fa: The DNA alignment (in FASTA format), which is a subset of the original Turtle data set used to assess the phylogenetic position of Turtle relative to Crocodile and Bird (Chiari et al., 2012).
- turtle.nex: The partition file (in NEXUS format) defining 29 genes, which are a subset of the published 248 genes (Chiari et al., 2012).
One of the most important things in ANY phylogenetic analysis is the alignment. You should never do a phylgenetic analysis without looking at all of your alignments first. So let's start there with a couple of basic sanity checks.
QUESTIONS:
View the alignment in Jalview or your favourite alignment viewer.
Can you identify the gene boundaries from the viewer? Does they roughly match the partition file?
Is there missing data? Which taxa seem to have most missing data?
Do you think missing data might be problematic? {: .tip}
- Inferring a basic phylogeny
Now we'll reconstruct a tree with basic settings in IQ-TREE. This will reconstruct a concatenated Maximum-Likelihood tree for the Turtle data set. When you run IQ-TREE you'll see that it does all sorts of work to choose a good model, pick the right number of CPUs to run on, etc.
Note that the commandline assumes you are working from the same folder as your alignment. So if you are not already in that folder, you should cd
to it now.
iqtree -s turtle.fa -bb 1000 -nt AUTO
Options explained:
-
-s turtle.fa
to specify the input alignment asturtle.fa
. -
-bb 1000
to specify 1000 replicates for the ultrafast bootstrap (Minh et al., 2013). -
-nt AUTO
to determine the best number of CPU cores to speed up the analysis.
This simple command will perform three important steps in one go:
- Select best-fit model using ModelFinder (Kalyaanamoorthy et al., 2017).
- Reconstruct the ML tree using the IQ-TREE search algorithm (Nguyen et al., 2015).
- Assess branch supports using the ultrafast bootstrap - UFBoot (Minh et al., 2013).
Once the run is done, IQ-TREE will write several output files including:
-
turtle.fa.iqtree
: the main report file that is self-readable. You should look at this file to see the computational results. It also contains a textual representation of the final tree. -
turtle.fa.treefile
: the ML tree in NEWICK format, which can be visualized in FigTree or any other tree viewer program. -
turtle.fa.log
: log file of the entire run (also printed on the screen). -
turtle.fa.ckp.gz
: checkpoint file used to resume an interrupted analysis. - And a few other files.
QUESTIONS:
Look at the report file
turtle.fa.iqtree
. What is the best-fit model? What do you know about this model?Visualise the tree
turtle.fa.treefile
in FigTree.Compare the tree with the published tree (Chiari et al., 2012). Are they the same or different?
If different, where are the difference(s)?
Look at the boostrap supports. Which branch(es) have a low support? {: .tip}
- Using a partitioned model
We now perform a partitioned model analysis (Chernomor et al., 2016), where one allows each partition to have its own model. This can be useful - often genes in an alignment come from different very different parts of the genome and evolve in very different ways. It usually makes sense to try and allow for this in our models of molecular evolution:
iqtree -s turtle.fa -spp turtle.nex -bb 1000 -nt AUTO
Options explained:
-
-spp turtle.nex
to specify an edge-linked proportional partition model (Chernomor et al., 2016). That means, there is one set of branch lengths. But each partition is allowed to evolve at its own rate. This is usually a sensible option for most datasets (Duchene et al., 2019).
QUESTIONS:
Look at the report file
turtle.nex.iqtree
. What are the slowest- and fastest-evolving genes?Compare the AIC/AICc/BIC score of partitioned model versus the un-partitioned model above. Which model is better?
Visualise the tree
turtle.nex.treefile
in Figtree and compare it with the tree from the un-partitioned model. Are they the same or different? If different, where is the difference? Which tree agrees with the published tree (Chiari et al., 2012)?Look at the boostrap supports. Which branch(es) have a low support? {: .tip}
- Choosing the best partitioning scheme
The partitioned model is great. But one thing we risk by giving EVERY gene its own model is that we might be overparameterising our model. That is, we might be trying to infer more parameters than is justified by the limited information in our dataset. To try and address this, we now use the PartitionFinder algorithm (Lanfear et al., 2012) that tries to merge partitions to reduce the potential overparameterisation.
iqtree -s turtle.fa -spp turtle.nex -bb 1000 -nt AUTO -m MFP+MERGE -rcluster 10 -pre turtle.merge
Options explained:
-
-m MFP+MERGE
to perform PartitionFinder followed by tree reconstruction. -
-rcluster 10
to reduce computations by only examining the top 10% partitioning schemes using the relaxed clustering algorithm (Lanfear et al., 2014). -
-pre turtle.merge
to set the prefix for all output files asturtle.merge.*
. This is to avoid overwriting outputs from the previous analysis.
QUESTIONS:
Look at the report file
turtle.merge.iqtree
. How many partitions do we have now?Look at the AIC/AICc/BIC scores. Is it better or worse than those of the un-partition and partition models done previously?
How does the tree look now? How high/low are the bootstrap supports? {: .tip}
- Concordance factors
So far we have assumed that there is just one tree, i.e. that all of the gene trees are the same, and that they are all equal to the species tree (the thing we are really interested in). However, it is well known that gene trees might be can differ from each other, and from the species tree. Maybe there are errors in the gene trees, maybe there is incomplete lineage sorting, hybridisation, contamination in our data, etc. etc. One way to map this kind of variation onto a species tree is to use concordance factors. These will describe the underlying variation in the data with respect to each branch in the tree at the level of the gene (the gene concordance factor or gCF) and the site (the site concordance factor or sCF) (Minh et al., 2018).
You can read a lot more about concordance factors in Minh et al., 2018 and on Rob's blog here. It's really important to note that bootstraps and concordance factors measure very different things. The bootstrap is like the standard error of a point estimate (a branch in the tree). Concordance factors are more like the standard deviation of that estimate. So, just as it's possible to have a very precise measurement of the mean from a very spread-out distribution of numbers (e.g. if we've sampled a lot of numbers from that distribution), it's possible to have a very high bootstrap support for a branch (e.g. 100%) but a very low concordance factor (i.e. lots of disagreement among genes or sites).
To calculate gCFs, we first need to calculate the tree for every gene separately. That's easy in IQ-TREE:
iqtree -s turtle.fa -S turtle.nex -pre turtle.loci -nt 2
Options explained:
-
-S turtle.nex
to tell IQ-TREE to infer separate trees for every partition inturtle.nex
. All output files are similar to a partition analysis, except that the treeturtle.loci.treefile
now contains a set of gene trees.
Definitions:
Gene concordance factor (gCF) is the percentage of decisive gene trees concordant with a particular branch of the species tree (0% <= gCF(b) <= 100%). gCF=0% means that branch b does not occur in any gene trees, whereas gCF=100% means that branch b occurs in every gene tree.
Site concordance factor (sCF) is the percentage of decisive (parsimony informative) alignment sites supporting a particular branch of the species tree (~33% <= sCF(b) <= 100%). sCF<33% means that another discordant branch b' is more supported, whereas sCF=100% means that branch b is supported by all sites.
You can now compute the gCF and sCF values for the tree inferred under the partition model:
iqtree-beta -t turtle.nex.treefile --gcf turtle.loci.treefile -s turtle.fa --scf 100
Options explained:
-
-t turtle.nex.treefile
to specify a species tree. -
--gcf turtle.loci.treefile
to specify a gene-trees file. -
--scf 100
to draw 100 random quartets when computing sCF.
Once finished this run will write several files:
-
turtle.nex.treefile.cf.tree
: tree file where branches are annotated with bootstrap/gCF/sCF values. -
turtle.nex.treefile.cf.stat
: a table file with various statistics for every branch of the tree.
Similarly, you can compute gCF and sCF for the tree under unpartitioned model:
iqtree-beta -t turtle.fa.treefile --gcf turtle.loci.treefile -s turtle.fa --scf 100
QUESTIONS:
Visualise
turtle.nex.treefile.cf.tree
in FigTree.How do gCF and sCF values look compared with bootstrap supports?
Visualise
turtle.fa.treefile.cf.tree
. How do these values look like now on the contradicting branch? {: .tip}
Here are a couple of things to try if you have finished the tutorial but wish to understand more.
- Tree topology tests
We now want to know whether the trees inferred for the Turtle data set have significantly different log-likelihoods or not. Specifically, we may want to ask whether we can reject other trees in favour of the best tree we estimated (the one using the best model). This can be conducted with the SH test (Shimodaira and Hasegawa, 1999), or expected likelihood weights (Strimmer and Rambaut, 2002). Note that tree topology tests make a fair few assumptions about the data. The best place to go to really understand what you are doing when using these tests is Goldman et al's excellent paper (Goldman et. al, 2000)
First, concatenate the trees constructed by single and partition models into one file:
For Linux/MacOS:
cat turtle.fa.treefile turtle.nex.treefile >turtle.trees
For Windows:
type turtle.fa.treefile turtle.nex.treefile >turtle.trees
Now pass this file into IQ-TREE via -z
option:
iqtree -s turtle.fa -spp turtle.nex.best_scheme.nex -z turtle.trees -zb 1000 -n 0 -wpl -pre turtle.test
Options explained:
-
-spp turtle.nex.best_scheme.nex
to provide the partition model found previously to avoid running ModelFinder again. -
-z turtle.trees
to input a set of trees. -
-zb 1000
to specify 1000 replicates for approximate boostrap for tree topology tests. -
-n 0
to avoid tree search and just perform tree topology tests. -
-wpl
to print partition-wise log likelihoods for both trees. This will be used later in the next section. -
-pre turtle.test
to set the prefix for all output files asturtle.test.*
.
QUESTIONS:
Look at the report file
turtle.test.iqtree
. There is a new section calledUSER TREES
.Do the two trees have significantly different log-likelihoods? {: .tip}
HINTS:
- The KH and SH tests return p-values, thus a tree is rejected if its p-value < 0.05 (marked with a
-
sign). - bp-RELL and c-ELW return posterior weights which are not p-value. The weights sum up to 1 across the trees tested.
- Resampling partitions and sites
Instead of bootstrap resampling sites, it is recommended to resample partitions and then sites within resampled partitions (Hoang et al., 2018). This may help to reduce over-confident branch supports.
iqtree -s turtle.fa -spp turtle.nex -bb 1000 -nt AUTO -bsam GENESITE -pre turtle.bsam
Options explained:
-
-bsam GENESITE
to turn on resampling partition and sites strategy. -
-pre turtle.bsam
to set the prefix for all output files asturtle.bsam.*
. This is to avoid overwriting outputs from the previous analysis.
QUESTIONS:
Is there any change in the tree topology?
Do the bootstrap support values get smaller or larger? {: .tip}
- Identifying most influential genes
Now we want to investigate the cause for such topological difference between trees inferred by single and partition model. One way is to identify genes contributing most phylogenetic signal towards one tree but not the other.
How can one do this? Well, we can look at the gene-wise log-likelihood (logL) differences between the two given trees T1 and T2. Those genes having the largest logL(T1)-logL(T2) will be in favor of T1. Whereas genes showing the largest logL(T2)-logL(T1) are favoring T2.
With the -wpl
option done above, IQ-TREE will write partition-wise log-likelihoods into turtle.test.partlh
file.
QUESTIONS:
Import this file into MS Excel. Compute the partition wise log-likelihood differences between two trees.
What are the two genes that most favor the tree inferred by single model?
Have a look at the paper by (Brown and Thomson, 2016). Compare the two genes you found with those from this paper. What is special about these two genes? {: .tip}
- Wrapping up
FINAL QUESTION:
- Given all analyses done in this tutorial, which tree do you think is the true tree?
Copyright (c) 2010-2016 IQ-TREE development team.
- First example
- Model selection
- New model selection
- Codon models
- Binary, Morphological, SNPs
- Ultrafast bootstrap
- Nonparametric bootstrap
- Single branch tests
- Partitioned analysis
- Partitioning with mixed data
- Partition scheme selection
- Bootstrapping partition model
- Utilizing multi-core CPUs
- Tree topology tests
- User-defined models
- Consensus construction and bootstrap value assignment
- Computing Robinson-Foulds distance
- Generating random trees
- DNA models
- Protein models
- Codon models
- Binary, morphological models
- Ascertainment bias correction
- Rate heterogeneity
- Counts files
- First running example
- Substitution models
- Virtual population size
- Sampling method
- Bootstrap branch support
- Interpretation of branch lengths