Merge pull request #1650 from jeromekelleher/fix-verification-again

Batch of docs updates
tskit-dev · Apr 14, 2021 · a721867 · a721867
2 parents 1bf1847 + a45d7e6
commit a721867
Show file tree

Hide file tree

Showing 11 changed files with 560 additions and 289 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,6 @@
 # Changelog
 
-## [1.0.0b1] - 2021-03-31
+## [1.0.0] - 2021-04-14
 
 Msprime 1.0 is a major update, recommended for all users. It introduces
 new APIs and many new features, which still retaining compatibility
@@ -19,8 +19,6 @@ with nearly all existing code.
   coordinates. This fixes some long standing bugs and makes it far simpler
   to add features such as gene conversion. ({user}`daniel-goldstein`)
 - Support discrete or continuous genomes in simulations directly.
-- Add an modern array oriented RateMap class as to replace the RecombinationMap
-  class ({user}`jeromekelleher`, {user}`grahamgower`, {user}`hyanwong`).
 - Gene conversion ({user}`fbaumdicker`)
 - Selective sweeps and low-level infrastructure for the structured coalescent
   ({user}`andrewkern`, {user}`gbinux`)
@@ -29,17 +27,11 @@ with nearly all existing code.
 - Different ploidy levels for coalescent models ({user}`jerekoskela`)
 - Functions to parse species trees and set up simulation models according
   to the species tree. ({user}`mmatschiner` {issue}`893` {issue}`929` {issue}`931`)
-- Complete provenance recording of all arguments to simulate and mutate.
-  Adds argument record_provenance to simulate, which allows recording of
-  provenances to be disabled, for example when they are large.
-  ({user}`benjeffery` {pr}`914`).
-- Add replicate_index to simulate, allowing output of a single tree sequence
-  from a set of replicates. ({user}`benjeffery` {pr}`914`).
-- Details of the simulation are written to the DEBUG log periodically.
-  This can help debug long-running simulations. ({user}`jeromekelleher`,
-  {pr}`1080`).
+- Add an modern array oriented RateMap class as to replace the RecombinationMap
+  class ({user}`jeromekelleher`, {user}`grahamgower`, {user}`hyanwong`).
 - Functions to compute the likelihood of a particular ARG realisation
   under the coalescent with recombination ({user}`jerekoskela`).
+- Mutations are assigned times ({user}`petrelharp`)
 - Finite sites mutations ({user}`petrelharp`)
 - Several instances of matrix mutation models: JC69, GTR, HKY, F84, BLOSUM62,
   PAM. ({user}`GertjanBisschop`, {user}`petrelharp`).
@@ -48,9 +40,19 @@ with nearly all existing code.
 - Methods to track the possible location of lineages, and compute the
   coalescence rates over time ({user}`apragsdale`, {user}`petrelharp`
   {user}`grahamgower`).
-- Improved command line interface features ({user}`winni2k`)
+- Two new CLI commands, `msp ancestry` and `msp mutations` which correspond
+  to the `sim_ancestry` and `sim_mutations` functions. These can be pipelined
+  using stdin/stdout. ({user}`winni2k`, {user}`jeromekelleher`)
+- Complete provenance recording of all arguments to simulate and mutate.
+  Adds argument record_provenance to simulate, which allows recording of
+  provenances to be disabled, for example when they are large.
+  ({user}`benjeffery` {pr}`914`).
+- Add replicate_index to simulate, allowing output of a single tree sequence
+  from a set of replicates. ({user}`benjeffery` {pr}`914`).
+- Details of the simulation are written to the DEBUG log periodically.
+  This can help debug long-running simulations. ({user}`jeromekelleher`,
+  {pr}`1080`).
 - Binary wheels for PyPI ({user}`benjeffery`)
-- Mutations are assigned times ({user}`petrelharp`)
 
 **Breaking changes**:
 
@@ -60,8 +62,8 @@ with nearly all existing code.
   {pr}`1028`).
 - The `simulate` function only takes one positional argument, and all other
   arguments are keyword-only.
-- The `msp` CLI has been stripped of all sub-commands except for
-  `simulate` and `mutate`. These sub-commands are provided by the `tskit`
+- The `msp` CLI has been stripped of all existing sub-commands except for
+  `simulate`. The old sub-commands are provided by the `tskit`
   CLI or the `TreeSequence` API in `tskit`.
 - The `end_time` option now allows events up to and including the specified
   max time. Previously, events occurred strictly before the max time.

diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -9,6 +9,7 @@
   - file: ancestry
   - file: mutations
   - file: demography
+  - file: replication
 
 - part: Interfaces
   chapters:

diff --git a/docs/ancestry.md b/docs/ancestry.md
@@ -15,12 +15,10 @@ kernelspec:
 ```{code-cell}
 :tags: [remove-cell]
 
-    import msprime
-    from IPython.display import SVG
-    from matplotlib import pyplot as plt
-    # Doing this to make the notebook outputs deterministic.
-    # DO NOT DO THIS IN YOUR CODE
-    msprime.core.set_seed_rng_seed(42)
+import msprime
+from IPython.display import SVG
+from matplotlib import pyplot as plt
+import numpy as np
 ```
 
 (sec_ancestry)=
@@ -799,105 +797,6 @@ chrom_ts
 This gives us a list of tree sequences, one for each chromosome, in the order
 that they were stitched together in the initial recombination map.
 
-(sec_ancestry_controlling_randomness)=
-
-## Controlling randomness
-
-:::{important}
-There is usually not much reason for setting a random seed when
-running simulations for statistical analysis. Msprime will generate
-high-quality random seeds by default. If you are concerned about
-reproducibility, the tree sequence provenance contains all the
-information required to reproduce any simulation.
-:::
-
-(sec_ancestry_random_seed)=
-
-### Random seeds
-
-Stochastic simulations depend on a source of randomness, provided
-by a [psuedorandom number generator](<https://en.wikipedia.org/wiki/Pseudorandom_number_generator>).
-Msprime uses the [GNU Scientific Library](<https://www.gnu.org/software/gsl/doc/html/rng.html>) to generate high-quality
-random numbers. The particular trajectory produced by a pseudorandom number
-generator is controlled by the "seed" it is provided.
-
-By default, msprime generates random seeds using a private instance of
-{class}`python:random.Random`, which should guarantee unique seeds are produced
-even if (e.g.) many simulations are started at the same time in different
-processes. In particular, simulations run concurrently in subprocesses using
-{mod}`concurrent.futures` or {mod}`multiprocessing` will be assigned unique
-seeds by default.
-
-Thus, if we run two simulations with the same parameters, we will get different
-results, as we can see from the different times of the last node:
-
-
-```{code-cell}
-:tags: [remove-cell]
-
-# Make this deterministic. DON'T DO THIS IN YOUR CODE.
-msprime.core.set_seed_rng_seed(412343)
-```
-
-```{code-cell}
-ts = msprime.sim_ancestry(2)
-SVG(ts.draw_svg(y_axis=True))
-```
-
-```{code-cell}
-ts = msprime.sim_ancestry(2)
-SVG(ts.draw_svg(y_axis=True))
-```
-
-The `random_seed` argument to {func}`.sim_ancestry` allows us specify
-seeds explicitly, making the output of the simulation fully deterministic:
-
-
-```{code-cell}
-ts = msprime.sim_ancestry(2, random_seed=42)
-SVG(ts.draw_svg(y_axis=True))
-```
-
-```{code-cell}
-ts = msprime.sim_ancestry(2, random_seed=42)
-SVG(ts.draw_svg(y_axis=True))
-```
-
-(sec_ancestry_replication)=
-
-### Running replicate simulations
-
-Simulations are random, and we will therefore usually want to have
-many independent replicates for a particular set of parameters.
-The `num_replicates` parameter provides a convenient and efficient
-way to iterate over a number of replicate simulations. For example,
-this is a good way to compute the mean and variance of the time to the most
-recent common ancestor in a set of simulations:
-
-```{code-cell}
-import numpy as np
-
-num_replicates = 100
-tmrca = np.zeros(num_replicates)
-replicates = msprime.sim_ancestry(10, num_replicates=num_replicates, random_seed=42)
-for replicate_index, ts in enumerate(replicates):
-    tree = ts.first()
-    tmrca[replicate_index] = tree.time(tree.root)
-np.mean(tmrca), np.var(tmrca)
-```
-
-It's important to note that the replicate simulations are generated
-lazily here on demand - the `replicates` variable is a Python iterator.
-This means we use much less memory that we would if we stored each
-of the replicate simulations in a list.
-
-:::{note}
-The return type of {func}`.sim_ancestry` changes when we use the
-`num_replicates` argument. If `num_replicates` is not specified
-or `None`, we return an instance of {class}`tskit.TreeSequence`.
-If it is specified, we return an *iterator* over
-a set of {class}`tskit.TreeSequence` instances.
-:::
 
 ## Recording more information
 
@@ -1902,12 +1801,6 @@ but only one merger event can take place at a given time. For {math}`p`-ploids,
 up to {math}`2p` simultaneous mergers can take place, corresponding to the
 {math}`2p` available parental chromosome copies.
 
-```{code-cell}
-:tags: [remove-cell]
-
-msprime.core.set_seed_rng_seed(42)
-```
-
 The diploid Beta-Xi-coalescent can be simulated as follows:
 
 ```{code-cell}
@@ -2142,7 +2035,7 @@ more discussion on this important point.
 
 Once we've set up the replicate simulations we can compute the
 windows for plotting, run the actual simulations
-(see the {ref}`sec_ancestry_replication` section for more details)
+(see the {ref}`sec_randomness_replication` section for more details)
 and compute the
 {meth}`pairwise diversity<tskit.TreeSequence.diversity>`.
 

diff --git a/docs/legacy.md b/docs/legacy.md
@@ -46,19 +46,8 @@ to be supported indefinitely**.
 
 One major change is that {ref}`ancestry<sec_ancestry>` and
 {ref}`mutations<sec_mutations>` must be simulated **separately**.
-A good pattern for this when performing
-{ref}`replicate simulations<sec_ancestry_replication>` is
-
-```{code-cell}
-import msprime
-
-ancestry_replicates = msprime.sim_ancestry(10, num_replicates=10)
-for ts in ancestry_replicates:
-    ts = msprime.sim_mutations(ts, rate=0.1)
-    # Do something with your mutated tree sequence now.
-```
-
-This approach is memory efficient and uses high-quality random seeds.
+See the {ref}`sec_randomness_replication_mutations` section for
+idiomatic examples of how to do this efficiently.
 
 ### Ancestry
 

diff --git a/docs/mutations.md b/docs/mutations.md
@@ -154,34 +154,6 @@ position of the {class}`.RateMap` is the same as the
 :::
 
 
-(sec_mutations_randomness)=
-
-## Controlling randomness
-
-As described in {ref}`sec_ancestry_random_seed`, ``msprime`` uses a
-[psuedorandom number generator](<https://en.wikipedia.org/wiki/Pseudorandom_number_generator>)
-from the [GNU Scientific Library](<https://www.gnu.org/software/gsl/doc/html/rng.html>) for
-random number generation. Passing an integer to the ``random_seed`` parameter
-defines a trajectory for a call of {func}`.sim_mutations`, making output deterministic.
-
-```{code-cell}
-ts = msprime.sim_ancestry(10, random_seed=1)
-mts_1 = msprime.sim_mutations(ts, rate=1, random_seed=7)
-mts_2 = msprime.sim_mutations(ts, rate=1, random_seed=8)
-mts_1.tables.equals(mts_2.tables, ignore_timestamps=True)
-```
-
-Here, we check for equality between the TableCollections of the `mts_1` and `mts_2` tree
-sequences, excluding the differing timestamps in Provenance tables. The result is
-`False` because we used different seeds.
-
-```{code-cell}
-mts_3 = msprime.sim_mutations(ts, rate=1, random_seed=7)
-mts_1.tables.equals(mts_3.tables, ignore_timestamps=True)
-```
-
-When we use the same seed, the resulting tree sequence is identical.
-
 
 (sec_mutations_discrete)=
 
@@ -277,7 +249,9 @@ def count_transitions(ts, alleles):
         print(f"{a}\t", "\t".join(map(str, counts[j])))
 
 model = msprime.HKY(kappa=0.75, equilibrium_frequencies=[0.2, 0.3, 0.3, 0.2])
-ts = msprime.sim_ancestry(5, random_seed=1, sequence_length=1e7, recombination_rate=1e-8, population_size=1000)
+ts = msprime.sim_ancestry(
+    5, random_seed=1, sequence_length=1e7, recombination_rate=1e-8,
+    population_size=1000)
 mts = msprime.sim_mutations(ts, rate=1e-8, model=model, random_seed=1)
 count_transitions(mts, model.alleles)
 ```