-
Notifications
You must be signed in to change notification settings - Fork 10
/
discoaldoc.tex
242 lines (191 loc) · 20.4 KB
/
discoaldoc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
\documentclass[12pt]{article}
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{natbib}
\usepackage{setspace}
\usepackage{indentfirst}
\usepackage{listings}
\usepackage{url}
\doublespacing
\textwidth = 6.5 in
\textheight = 9 in
\oddsidemargin = 0.0 in
\evensidemargin = 0.0 in
\topmargin = 0.0 in
\headheight = 0.0 in
\headsep = 0.0 in
\parskip = 0.0in
\parindent = 0.2in
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}{Definition}
\bibliographystyle{plain}
\begin{document}
%\bibliographystyle{plainnat}
\title{\textbf{\texttt{discoal}}-- a coalescent simulator with selection}
\author{Andrew D. Kern \\
\texttt{kern@biology.rutgers.edu}\\
}
\maketitle
This file serves as the documentation for \texttt{discoal}, a program aimed at generating samples
under a coalescent model with recombination, step-wise changes in population size, and selection.
For users familiar with Richard Hudson's \texttt{ms}, the usage of \texttt{discoal} will be quite
familiar, and indeed was meant to ``play nice'' with programs the user, or others, may have written
for analyzing \texttt{ms} style output. \texttt{discoal} is not meant to take the place of \texttt{ms},
indeed in comparison it is quite limited in what it can do, but instead is meant to add a few models
not covered by other simulation programs. In particular, \texttt{discoal} can quickly generate samples
from models with selective sweeps in the history of the sample along with stepwise population size
changes.
\texttt{discoal} gets its name from the contraction of ``discrete'' and ``coalescent'', because it
handles recombination along the chromosome, and its associated programatic bookkeeping, by modeling
a discrete number of ancestral sites. This leads to a trade off favoring speed over memory usage. For larger samples
and sites modeled, do not be surprised if \texttt{discoal} eats up a lot of available RAM, but it will
do so to increase the speed at which samples can be generated. This is particularly noticeable for larger
recombination rates.
\section*{Download and Compile}
The source code for \texttt{discoal} is available for download from \url{https://github.com/kern-lab}. Unpack the archive as you would any zip file and to compile you can use the following commands at a terminal:
\begin{verbatim}
$ cd <PATH TO DISCOAL>
$ make discoal
\end{verbatim}
\texttt{discoal} should build on most unix-like systems without a problem. We have built it on numerous linux systems as
well as OS X machines.
\section*{Basic usage}
As we said above, using \texttt{discoal} is very familiar for people used to \texttt{ms} style commands. At its
most basic a \texttt{discoal} command line looks like the following
\begin{verbatim}
$ ./discoal sampleSize numReplicates nSites -t theta
\end{verbatim}
Here there are four arguments we are passing to \texttt{discoal}: sampleSize -- the size of each sample, numReplicates -- the
number of independent samples to generate, nSites-- the number of sites in the sequence to be modeled, and then following the \texttt{-t}
flag $\theta = 4N_0u$, the population mutation rate where $N_0$ is the population size currently and $u$ is the mutation rate per generation for the entire locus. This is identical to
the command lines handed to \texttt{ms} with the addition of the nSites parameter. A representative run with its associated output might look like the following:
\begin{verbatim}
$ ./discoal 3 2 100 -t 2
./discoal 3 2 100 -t 2
1665047201 686400060
//
segsites: 4
positions: 0.01835 0.09557 0.46556 0.72880
1000
0110
0111
//
segsites: 1
positions: 0.07594
0
1
0
\end{verbatim}
Again, we are following Hudson's lead here with output formatted just as \texttt{ms} does. This should mean that any secondary analysis software
used for \texttt{ms} output should work for \texttt{discoal} output as well. One note about the seeding of the random number generator (seeds are given
on the second line of the output)-- seeds are taken from \texttt{/dev/urandom}, a special file on most $\star$nix systems that creates pseudorandom numbers
from collecting thermal noise from the devices on the machine. This makes \texttt{discoal} suitable out of the box for large scale cluster computing
where multiple jobs will be launched simultaneously without having to worry about random number generator seeds.
\section*{Recombination and Gene Conversion}
Recombination (crossing over) is handled by \texttt{discoal} by adding the \texttt{-r} flag. \texttt{-r} takes one parameter, $\rho=4Nr$ the population recombination rate where $r$ is the probability of a cross over per basepair of sequence being modeled and $N$ is the current population size. Recombination can occur between any of the discrete nSites being modeled. A representative call with recombination would be:
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4
\end{verbatim}
Gene conversion (recombination without exchange of flanking markers; sometimes called a non-crossover) within the modeled chromosome segment is also simulated by \texttt{discoal} by using the \texttt{-g} flag. The \texttt{-g} option takes two parameters, $\gamma=4Ng$ the population gene conversion rate where $g$ is the probability of initiating an NCO event per basepair. The second parameter needed is the mean gene conversion tract length, as once a gene conversion (NCO) is initiated it is extended for a geometrically distributed length. For instance:
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4 -g 2.4 10
\end{verbatim}
would specify $\gamma=2.4$ with a mean gene conversion tract length of 10bp.
\section*{Population size changes}
\texttt{discoal} can simulate an arbitrary number of step-wise, instantaneous population size changes. Each population size
change in the sample history is set with a \texttt{-en} flag which specifies the time of the population size change, the ID of the population to change (zero indexed), and the ratio of the new population size to the current population size, $N_0$. \texttt{discoal} measures time in units of $4N_0$ generations, as in \texttt{ms}, thus existing \texttt{ms} commands specifying instantaneous population size change can used directly in \texttt{discoal}. Multiple changes in population size can be specified by adding additional \texttt{-en} statements to the command line. By way of example, the following command line specifies a bottleneck population history where at time 0.5 the population crashes to 10\% of its initial size and then at time 1.2 it rebounds to 80\% of its initial size.
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4 -en 0.5 0 0.1 -en 1.2 0 0.8
\end{verbatim}
\section*{Multiple populations}
\texttt{discoal} can simulate the coalescent for multiple populations under models with both continuous gene flow and pulse admixture events. The \texttt{-p} flag establishes the number of populations and the respective sample sizes from each population. Note that the sample size of a population can be zero, indicating no sampled individuals came from that population. The \texttt{-M} flag sets all migration rates between populations to a specific value (scaled in units of $4N_0$). The \texttt{-m} flag allows individual migration rates between populations to be set separately. Here is an example of a 3 population island model where 2 alleles have been sampled from each population and where there is symmetric migration at rate $4N_0m = 0.05$
\begin{verbatim}
$ ./discoal 6 2 100 -t 2 -r 2.4 -p 3 2 2 2 -M 0.05
\end{verbatim}
Population splitting events (forward in time) can be modeled using the \texttt{-ed} flag. As above, it is important to note that \texttt{discoal} uses zero indexing for populations. Let's imagine we have three populations that diverge from one another such that two are sister taxa and one is an outgroup. Call $p_0$ the $0th$ population, then our tree might be (($p_0$,$p_1$),$p_2$). We can specify this topology and its associated divergence times (1.0 and 5.0) as
\begin{verbatim}
$ ./discoal 6 2 100 -t 2 -r 2.4 -p 3 2 2 2 -ed 1.0 0 1 -ed 5.0 1 2
\end{verbatim}
Note the command above doesn't have migration and is thus pure isolation. You can add that back in using the \texttt{-m} or \texttt{-M} flag.
We might want to imagine that at there was an admixture between $p_0$ and $p_2$ recently, say $0.02$ time units in the past, and that $15\%$ of individuals $p_0$ came from $p_2$ at that time. We could add this to our model using the \texttt{-ea} flag
\begin{verbatim}
$ ./discoal 6 2 100 -t 2 -r 2.4 -p 3 2 2 2 -ed 1.0 0 1 -ed 5.0 1 2 -ea 0.02 0 1 0 0.15
\end{verbatim}
The \texttt{-ea} flag takes as its arguments a time for the event, the ID of the admixed population, the IDs of the two founding populations (one can be the same as the admixed population), and the proportion of ancestry that comes from the first founding population ID listed.
One important note about using multiple population together with selection: \texttt{discoal} only models selective sweeps in population zero ($p_0$). When the simulation enters into a sweep phase we do not allow for migration. If we were to, we would need to model the allele trajectory in two or more populations concurrently, and we have not implemented this possibility. Also, we have not yet implemented an event that would allow migration rates to change over time, this is planned and should be available soon.
\section*{Selection}
\texttt{discoal} can simulate samples with single selective sweeps in the history of a sample, requiring the site under selection to be within the bounds of the modeled locus. This is done using the now conventional technique of altering the genealogy of a sample to be conditional upon the trajectory of an allele moving through the population to eventual fixation (forward in time) \cite{Bravermanetal1995,Kim:2002wd}. \texttt{discoal} can simulate both deterministic sweep trajectories \cite{Bravermanetal1995,Kim:2002wd} and stochastic trajectories \cite{Coop:2004pj,Przeworskietal2005}. In addition coalescent simulations can be generated conditional upon the fixation of a neutral mutation in the population \cite{Tajima:1990fx}. All of these cases are easily handled by \texttt{discoal} using the group of \texttt{-w} flags. \texttt{-wd} generates deterministic selective sweeps, \texttt{-ws} stochastic selective sweeps, and \texttt{-wn} performs neutral fixations. Each \texttt{-w} flag takes one parameter, $\tau$ the time of fixation looking backward in time. Thus if $\tau = 0$ the fixation of the focal site has occurred immediately prior to sampling; if $\tau > 0$ the fixation has occured tau units of $4N_0$ generations in the past. Two other parameters are necessary when specifying sweeps. 1) the location of the site that has fixed using the \texttt{-x} flag. \texttt{-x} for convenience takes a floating point number between 0 and 1, thus translating the discrete sites modeled to the real line (defaults to $0.5$). 2) The strength of selection given as $\alpha=2Ns$ specified via the \texttt{-a} flag.
Building on our previous example then, let's generate a sample with a single selective sweep at position 50 (the middle) in our locus with $\alpha=1000$ and $\tau=0.05$ using stochastic sweep trajectories
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4 -ws 0.05 -a 1000 -x 0.5
\end{verbatim}
This can also be combined with populations size changes to generate samples with stepwise constant population size and selective sweeps
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4 -ws 0.05 -a 1000 -x 0.5 -en 0.5 0 0.1 -en 1.2 0 0.8
\end{verbatim}
Another feature of \texttt{discoal} is its ability to generate what are now being called soft selective sweeps \cite{HermissonPennings2005,Przeworskietal2005}. To generate sweeps from standing variation the user can specify $f_0$, the frequency at which a previously neutral allele became beneficial. This is done using \texttt{-f} flag. For such situations, conditional allele frequency trajectories for the sweep site are generated as per \cite{Przeworskietal2005}. If one is instead interested in a soft sweep model with recurrent mutation towards a beneficial allele the \texttt{-uA} flag generates sweeps where one of the possible moves is a change in selective class through mutation for alleles not originally linked to the beneficial mutation \cite{PenningsHermisson2006}.
As an example, we can take our previous hard sweep command line and adjust it so that the mutation drifts until frequency 0.1 where it then becomes beneficial
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4 -ws 0.05 -a 1000 -x 0.5 -f 0.1
\end{verbatim}
\subsection*{Simulating selective sweeps, beyond the sampled locus}
Often we are interested in examining the effects of linked selection on a neutral locus, rather than examining the site of selection itself. To facilitate this \texttt{discoal} will perform simulations with sweeps where the sweep occurs at a specified genetic distance to the left of the locus sampled as well as the time of the sweep. This occurs if the user specifies the \texttt{-ls}, \texttt{-ld}, or \texttt{-ln} options corresponding to a stochastic selective sweep, a deterministic selective sweep, or a neutral fixation respectively. In this case the coalescent simulation is carried out according to the Braverman \emph{et al.} (1995) model, in which distance between the selected and neutral locus is scaled in units of $4Nr$.
\subsection*{Simulating recurrent selective sweep models}
\texttt{discoal} can generate coalescent samples that have undergone multiple selective sweeps in their history under a model of recurrent hitch hiking \cite{Bravermanetal1995}. The use can specify either the \texttt{-R} or \texttt{-L} flags which simulate recurrent hitch hiking either at the locus (\texttt{-R}) or to the side of the locus (\texttt{-L}) at a rate specified per 2N individuals per generation.
\subsection*{Partial sweeps}
\texttt{discoal} can also generate partial sweeps, both partial hard and partial soft, using the \texttt{-c} flag. Partial sweeps are parameterized by an ending sweep frequency ($< 1-\frac{1}{2N}$) which is assumed to be the frequency at which the selection ended. If used in combination with the \texttt{-f} flag a partial soft sweep will be simulated. For instance a single partial soft sweep that occurred at $\tau=0.05$ time units in the past and reached a frequency of $0.8$ could be specified with
\begin{verbatim}
$ ./discoal 3 2 100 -t 2 -r 2.4 -ws 0.05 -a 1000 -x 0.5 -f 0.1 -c 0.8
\end{verbatim}
The recurrent sweep flags from above can be used in conjunction with partial sweeps as well, simply by appending a \texttt{-c} flag with the appropriate final frequency. This case is modeled by assigning active gene copies to either selective background (beneficial vs. not) with probability determined by the final frequency of the partial sweep at the beginning of each sweep phase.
Upon output \texttt{discoal} by default will also output the selected SNP. Be careful about this behavior, because it will mean that estimates of $\theta$ will be incorrect. Moreover simulations with $\theta=0$ will contain one SNP- the selected SNP which is going through the population. Outputting this SNP can be depressed using the \texttt{-h} flag.
\subsection*{More options with selection}
Because of the way the structured coalescent proceeds during a sweep, time is tracking in small increments ($\delta t$) during a sweep. Throughout the literature on coalescent simulations with sweeps the actual value of $\delta t$ has varied considerably. We have given the user the option to change $\delta t$ as a function of a scalar of population size. By default $\delta t = 1/\sigma N$ with $\sigma=40$. The user can specify another value for $\sigma$ using the \texttt{-i} flag. Higher values of sigma lead to longer run times of the simulations. In our hands we have found very little difference between simulations run with $\sigma$ in the range $(4,400)$. We recommend proofing simulation parameter spaces with fast runs using \texttt{-i 4} and then for production using high values or the default.
Again due to scaling of time during the sweep phase of the simulation, there is a notion of an effective population size during the sweep that is separate from the other dynamics of the coalescent simulation (i.e. the $N$ in the $\delta t$ calculations above). We have chosen to allow the user to varying this sweep effective population size using the \texttt{-N} flag whose default value is $10^6$. Again in practice we have seen minimal differences between runs with different sweep effective population sizes.
\section*{Priors on parameters}
Sometimes when generating simulations one is interested in simulating over a range of parameter values. \texttt{discoal} allows uniform priors to be set on all of its parameters with the \texttt{-P} family of flags. For each \texttt{-P} option the user specifies a lower bound and an upper bound for the parameter of interest. The exception to this are the \texttt{-Pe1} and \texttt{-Pe2} flags which specify a prior on population size changes and their associated times. Table 1. shown below gives the usage for each
\begin{table}[hb]
\centering
\caption{Command line options for specifying parameter prior distributions}
\label{priortable}
\begin{tabular}{|l|l|l|l|}
\hline
-Pt & low & high & uniform distribution of $\theta$ drawn from the range [low, high] \\ \hline
-Pr & low & high & uniform distribution of $\rho$ (recombination rate for the entire locus) \\ \hline
-Pre & mean & upperBound & exponential distribution of $\rho$, truncated at upperbound \\ \hline
-Pa & low & high & uniform distribution of $\alpha$ (selection coefficient) \\ \hline
-Pu & low & high & uniform distribution of $\tau$ (time since fixation) \\ \hline
-PuA & low & high & uniform distribution of uA- recurrent adaptive mutation rate \\ \hline
-Px & low & high & uniform distribution of sweep position \\ \hline
-Pf & low & high & uniform distribution of $f_0$ (initial selected frequency) \\ \hline
\end{tabular}
\end{table}
\section*{Outputting trees}
Rather than output haplotype samples from the coalescent, \texttt{discoal} can also output the marginal trees associated with the ARG that has been simulated along the sequence. This can be done using the \texttt{-T} option. Each marginal tree is Newick formatted on an individual line with a number in brackets specifying the number of sites that have that tree. For instance:
\begin{verbatim}
$ ./discoal 3 1 10 -t 1 -r 5 -T
./discoal 3 1 10 -t 1 -r 5 -T
1491027497 1579365069
//
[1](1:2.437097,(2:1.727308,0:1.727308):0.709789);
[1]((0:0.633478,1:0.633478):1.191596,2:1.825074);
[1](2:1.953495,(0:0.761898,1:0.761898):1.191596);
[2](2:1.889839,(0:0.698243,1:0.698243):1.191596);
[1]((0:1.692594,1:1.692594):1.191596,2:2.884190);
[1]((0:2.655055,1:2.655055):1.191596,2:3.846651);
[3](0:0.210377,(2:0.177475,1:0.177475):0.032902);
\end{verbatim}
Here there were seven marginal trees produced in the ARG. Each of the first three marginal trees are only for a single site, where as the the fourth tree pertains to two sites. Individual samples are proceeded by `//' as before.
\section*{Ancient samples}
We have added a feature into \texttt{discoal} that allows for a subset of the sampled chromosomes to be `ancient', i.e. sampled before the present time. This is done using the \texttt{-A} option which takes as arguments the number of ancient samples (must be smaller than the total sample size), the population they were taken from (must be population 0 unless the \texttt{-p} flag has been used to setup more than one population), and a time of sampling. More than one \texttt{-A} flag can be used to indicate multiple ancient samples at different times or from different populations.
\section*{Some caveats}
\begin{itemize}
\item Currently \texttt{discoal} uses static memory allocations for vectors that encode ancestry information for nodes during simulation. This means that we have hard coded a limit on the number of sites allowed to be simulated to 220020. If the user wishes to simulate larger regions they will need to change the \texttt{MAXSITES} define in \texttt{discoal.h} and recompile.
\item \texttt{discoal} models sweep trajectories during population size changes with a stochastic jump process. If you ask for deterministic sweeps with population size changes it will throw an error.
\end{itemize}
\section*{Further support}
If you are having trouble compiling or running the software don't hesitate to contact me via email (kern@biology.rutgers.edu)
\bibliography{texrefs}
\end{document}