Skip to content

Commit

Permalink
Merge pull request #1650 from jeromekelleher/fix-verification-again
Browse files Browse the repository at this point in the history
Batch of docs updates
  • Loading branch information
jeromekelleher authored Apr 14, 2021
2 parents 1bf1847 + a45d7e6 commit a721867
Show file tree
Hide file tree
Showing 11 changed files with 560 additions and 289 deletions.
34 changes: 18 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Changelog

## [1.0.0b1] - 2021-03-31
## [1.0.0] - 2021-04-14

Msprime 1.0 is a major update, recommended for all users. It introduces
new APIs and many new features, which still retaining compatibility
Expand All @@ -19,8 +19,6 @@ with nearly all existing code.
coordinates. This fixes some long standing bugs and makes it far simpler
to add features such as gene conversion. ({user}`daniel-goldstein`)
- Support discrete or continuous genomes in simulations directly.
- Add an modern array oriented RateMap class as to replace the RecombinationMap
class ({user}`jeromekelleher`, {user}`grahamgower`, {user}`hyanwong`).
- Gene conversion ({user}`fbaumdicker`)
- Selective sweeps and low-level infrastructure for the structured coalescent
({user}`andrewkern`, {user}`gbinux`)
Expand All @@ -29,17 +27,11 @@ with nearly all existing code.
- Different ploidy levels for coalescent models ({user}`jerekoskela`)
- Functions to parse species trees and set up simulation models according
to the species tree. ({user}`mmatschiner` {issue}`893` {issue}`929` {issue}`931`)
- Complete provenance recording of all arguments to simulate and mutate.
Adds argument record_provenance to simulate, which allows recording of
provenances to be disabled, for example when they are large.
({user}`benjeffery` {pr}`914`).
- Add replicate_index to simulate, allowing output of a single tree sequence
from a set of replicates. ({user}`benjeffery` {pr}`914`).
- Details of the simulation are written to the DEBUG log periodically.
This can help debug long-running simulations. ({user}`jeromekelleher`,
{pr}`1080`).
- Add an modern array oriented RateMap class as to replace the RecombinationMap
class ({user}`jeromekelleher`, {user}`grahamgower`, {user}`hyanwong`).
- Functions to compute the likelihood of a particular ARG realisation
under the coalescent with recombination ({user}`jerekoskela`).
- Mutations are assigned times ({user}`petrelharp`)
- Finite sites mutations ({user}`petrelharp`)
- Several instances of matrix mutation models: JC69, GTR, HKY, F84, BLOSUM62,
PAM. ({user}`GertjanBisschop`, {user}`petrelharp`).
Expand All @@ -48,9 +40,19 @@ with nearly all existing code.
- Methods to track the possible location of lineages, and compute the
coalescence rates over time ({user}`apragsdale`, {user}`petrelharp`
{user}`grahamgower`).
- Improved command line interface features ({user}`winni2k`)
- Two new CLI commands, `msp ancestry` and `msp mutations` which correspond
to the `sim_ancestry` and `sim_mutations` functions. These can be pipelined
using stdin/stdout. ({user}`winni2k`, {user}`jeromekelleher`)
- Complete provenance recording of all arguments to simulate and mutate.
Adds argument record_provenance to simulate, which allows recording of
provenances to be disabled, for example when they are large.
({user}`benjeffery` {pr}`914`).
- Add replicate_index to simulate, allowing output of a single tree sequence
from a set of replicates. ({user}`benjeffery` {pr}`914`).
- Details of the simulation are written to the DEBUG log periodically.
This can help debug long-running simulations. ({user}`jeromekelleher`,
{pr}`1080`).
- Binary wheels for PyPI ({user}`benjeffery`)
- Mutations are assigned times ({user}`petrelharp`)

**Breaking changes**:

Expand All @@ -60,8 +62,8 @@ with nearly all existing code.
{pr}`1028`).
- The `simulate` function only takes one positional argument, and all other
arguments are keyword-only.
- The `msp` CLI has been stripped of all sub-commands except for
`simulate` and `mutate`. These sub-commands are provided by the `tskit`
- The `msp` CLI has been stripped of all existing sub-commands except for
`simulate`. The old sub-commands are provided by the `tskit`
CLI or the `TreeSequence` API in `tskit`.
- The `end_time` option now allows events up to and including the specified
max time. Previously, events occurred strictly before the max time.
Expand Down
1 change: 1 addition & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- file: ancestry
- file: mutations
- file: demography
- file: replication

- part: Interfaces
chapters:
Expand Down
117 changes: 5 additions & 112 deletions docs/ancestry.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,10 @@ kernelspec:
```{code-cell}
:tags: [remove-cell]
import msprime
from IPython.display import SVG
from matplotlib import pyplot as plt
# Doing this to make the notebook outputs deterministic.
# DO NOT DO THIS IN YOUR CODE
msprime.core.set_seed_rng_seed(42)
import msprime
from IPython.display import SVG
from matplotlib import pyplot as plt
import numpy as np
```

(sec_ancestry)=
Expand Down Expand Up @@ -799,105 +797,6 @@ chrom_ts
This gives us a list of tree sequences, one for each chromosome, in the order
that they were stitched together in the initial recombination map.

(sec_ancestry_controlling_randomness)=

## Controlling randomness

:::{important}
There is usually not much reason for setting a random seed when
running simulations for statistical analysis. Msprime will generate
high-quality random seeds by default. If you are concerned about
reproducibility, the tree sequence provenance contains all the
information required to reproduce any simulation.
:::

(sec_ancestry_random_seed)=

### Random seeds

Stochastic simulations depend on a source of randomness, provided
by a [psuedorandom number generator](<https://en.wikipedia.org/wiki/Pseudorandom_number_generator>).
Msprime uses the [GNU Scientific Library](<https://www.gnu.org/software/gsl/doc/html/rng.html>) to generate high-quality
random numbers. The particular trajectory produced by a pseudorandom number
generator is controlled by the "seed" it is provided.

By default, msprime generates random seeds using a private instance of
{class}`python:random.Random`, which should guarantee unique seeds are produced
even if (e.g.) many simulations are started at the same time in different
processes. In particular, simulations run concurrently in subprocesses using
{mod}`concurrent.futures` or {mod}`multiprocessing` will be assigned unique
seeds by default.

Thus, if we run two simulations with the same parameters, we will get different
results, as we can see from the different times of the last node:


```{code-cell}
:tags: [remove-cell]
# Make this deterministic. DON'T DO THIS IN YOUR CODE.
msprime.core.set_seed_rng_seed(412343)
```

```{code-cell}
ts = msprime.sim_ancestry(2)
SVG(ts.draw_svg(y_axis=True))
```

```{code-cell}
ts = msprime.sim_ancestry(2)
SVG(ts.draw_svg(y_axis=True))
```

The `random_seed` argument to {func}`.sim_ancestry` allows us specify
seeds explicitly, making the output of the simulation fully deterministic:


```{code-cell}
ts = msprime.sim_ancestry(2, random_seed=42)
SVG(ts.draw_svg(y_axis=True))
```

```{code-cell}
ts = msprime.sim_ancestry(2, random_seed=42)
SVG(ts.draw_svg(y_axis=True))
```

(sec_ancestry_replication)=

### Running replicate simulations

Simulations are random, and we will therefore usually want to have
many independent replicates for a particular set of parameters.
The `num_replicates` parameter provides a convenient and efficient
way to iterate over a number of replicate simulations. For example,
this is a good way to compute the mean and variance of the time to the most
recent common ancestor in a set of simulations:

```{code-cell}
import numpy as np
num_replicates = 100
tmrca = np.zeros(num_replicates)
replicates = msprime.sim_ancestry(10, num_replicates=num_replicates, random_seed=42)
for replicate_index, ts in enumerate(replicates):
tree = ts.first()
tmrca[replicate_index] = tree.time(tree.root)
np.mean(tmrca), np.var(tmrca)
```

It's important to note that the replicate simulations are generated
lazily here on demand - the `replicates` variable is a Python iterator.
This means we use much less memory that we would if we stored each
of the replicate simulations in a list.

:::{note}
The return type of {func}`.sim_ancestry` changes when we use the
`num_replicates` argument. If `num_replicates` is not specified
or `None`, we return an instance of {class}`tskit.TreeSequence`.
If it is specified, we return an *iterator* over
a set of {class}`tskit.TreeSequence` instances.
:::

## Recording more information

Expand Down Expand Up @@ -1902,12 +1801,6 @@ but only one merger event can take place at a given time. For {math}`p`-ploids,
up to {math}`2p` simultaneous mergers can take place, corresponding to the
{math}`2p` available parental chromosome copies.

```{code-cell}
:tags: [remove-cell]
msprime.core.set_seed_rng_seed(42)
```

The diploid Beta-Xi-coalescent can be simulated as follows:

```{code-cell}
Expand Down Expand Up @@ -2142,7 +2035,7 @@ more discussion on this important point.

Once we've set up the replicate simulations we can compute the
windows for plotting, run the actual simulations
(see the {ref}`sec_ancestry_replication` section for more details)
(see the {ref}`sec_randomness_replication` section for more details)
and compute the
{meth}`pairwise diversity<tskit.TreeSequence.diversity>`.

Expand Down
15 changes: 2 additions & 13 deletions docs/legacy.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,19 +46,8 @@ to be supported indefinitely**.

One major change is that {ref}`ancestry<sec_ancestry>` and
{ref}`mutations<sec_mutations>` must be simulated **separately**.
A good pattern for this when performing
{ref}`replicate simulations<sec_ancestry_replication>` is

```{code-cell}
import msprime
ancestry_replicates = msprime.sim_ancestry(10, num_replicates=10)
for ts in ancestry_replicates:
ts = msprime.sim_mutations(ts, rate=0.1)
# Do something with your mutated tree sequence now.
```

This approach is memory efficient and uses high-quality random seeds.
See the {ref}`sec_randomness_replication_mutations` section for
idiomatic examples of how to do this efficiently.

### Ancestry

Expand Down
32 changes: 3 additions & 29 deletions docs/mutations.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,34 +154,6 @@ position of the {class}`.RateMap` is the same as the
:::


(sec_mutations_randomness)=

## Controlling randomness

As described in {ref}`sec_ancestry_random_seed`, ``msprime`` uses a
[psuedorandom number generator](<https://en.wikipedia.org/wiki/Pseudorandom_number_generator>)
from the [GNU Scientific Library](<https://www.gnu.org/software/gsl/doc/html/rng.html>) for
random number generation. Passing an integer to the ``random_seed`` parameter
defines a trajectory for a call of {func}`.sim_mutations`, making output deterministic.

```{code-cell}
ts = msprime.sim_ancestry(10, random_seed=1)
mts_1 = msprime.sim_mutations(ts, rate=1, random_seed=7)
mts_2 = msprime.sim_mutations(ts, rate=1, random_seed=8)
mts_1.tables.equals(mts_2.tables, ignore_timestamps=True)
```

Here, we check for equality between the TableCollections of the `mts_1` and `mts_2` tree
sequences, excluding the differing timestamps in Provenance tables. The result is
`False` because we used different seeds.

```{code-cell}
mts_3 = msprime.sim_mutations(ts, rate=1, random_seed=7)
mts_1.tables.equals(mts_3.tables, ignore_timestamps=True)
```

When we use the same seed, the resulting tree sequence is identical.


(sec_mutations_discrete)=

Expand Down Expand Up @@ -277,7 +249,9 @@ def count_transitions(ts, alleles):
print(f"{a}\t", "\t".join(map(str, counts[j])))
model = msprime.HKY(kappa=0.75, equilibrium_frequencies=[0.2, 0.3, 0.3, 0.2])
ts = msprime.sim_ancestry(5, random_seed=1, sequence_length=1e7, recombination_rate=1e-8, population_size=1000)
ts = msprime.sim_ancestry(
5, random_seed=1, sequence_length=1e7, recombination_rate=1e-8,
population_size=1000)
mts = msprime.sim_mutations(ts, rate=1e-8, model=model, random_seed=1)
count_transitions(mts, model.alleles)
```
Expand Down
Loading

0 comments on commit a721867

Please sign in to comment.