diff --git a/args.md b/args.md index df81293..2aec930 100644 --- a/args.md +++ b/args.md @@ -24,9 +24,9 @@ parent to child nodes. Therefore a succinct tree sequence is equivalent to a [directed graph](https://en.wikipedia.org/wiki/Directed_graph), which is additionally annotated with genomic positions such that at each position, a path through the edges exists which defines a tree. This graph -interpretation of a tree sequence is tightly connected to the concept of +interpretation of a tree sequence maps very closely to the concept of an "ancestral recombination graph" (or ARG). See -[this preprint](https://www.biorxiv.org/content/10.1101/2023.11.03.565466v1) for further details. +[this preprint](https://www.biorxiv.org/content/10.1101/2023.11.03.565466v2) for further details. ## Full ARGs @@ -39,12 +39,16 @@ graph structure defined by that process, see e.g. The term "ARG" is [often used](https://doi.org/10.1086%2F508901) to refer to a structure consisting of nodes and edges that describe the genetic genealogy of a set -of sampled chromosomes which have evolved via a process of genetic inheritance combined -with recombination. ARGs may contain not just nodes corresponding to genetic -coalescence, but also additional nodes that correspond e.g. to recombination events. -These "full ARGs" can be stored and analysed in +of sampled chromosomes which have evolved via a process of inheritance combined +with recombination. We use the term "full ARG" to describe a commonly-described type of +ARG that contains not just nodes that correspond to +coalescence of ancestral material, but also additional nodes that correspond to +recombination events and common ancestor events that are not associated with +coalescence in any of the local trees. Full ARGs can be stored and analysed in [tskit](https://tskit.dev) like any other tree sequence. A full ARG can be generated using -{func}`msprime:msprime.sim_ancestry` with the `record_full_arg=True` option, as described +{func}`msprime:msprime.sim_ancestry` by specifying `coalescing_segments_only=False` along with +`additional_nodes = msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT` +(or the equivalent `record_full_arg=True`) as described {ref}`in the msprime docs`: ```{code-cell} @@ -58,8 +62,12 @@ parameters = { "random_seed": 333, } -ts_arg = msprime.sim_ancestry(**parameters, record_full_arg=True, discrete_genome=False) -# NB: the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome) +ts_arg = msprime.sim_ancestry( + **parameters, + discrete_genome=False, # the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome) + coalescing_segments_only=False, # setting record_full_arg=True is equivalent to these last 2 parameters + additional_nodes=msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT, +) print('Simulated a "full ARG" under the Hudson model:') print( @@ -282,7 +290,12 @@ its simplified version: ```{code-cell} large_sim_parameters = parameters.copy() large_sim_parameters["sequence_length"] *= 1000 -large_ts_arg = msprime.sim_ancestry(**large_sim_parameters, record_full_arg=True) +large_ts_arg = msprime.sim_ancestry( + **large_sim_parameters, + discrete_genome=False, + coalescing_segments_only=False, + additional_nodes=msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT, +) large_ts = large_ts_arg.simplify() print( @@ -312,7 +325,7 @@ difference between some classical ARG formulations, and the ARG formulation used in `tskit`. Classically, nodes in an ARG are taken to represent _events_ (specifically, "common ancestor", "recombination", and "sampling" events), and genomic regions of inheritance are encoded by storing a specific breakpoint location on -each recombination node. In contrast, [nodes](tskit:sec_data_model_definitions_node) in a `tskit` +each recombination node. In contrast, {ref}`nodes` in a `tskit` ARG correspond to _genomes_. More crucially, inherited regions are defined by intervals stored on *edges* (via the {attr}`~Edge.left` and {attr}`~Edge.right` properties), rather than on nodes. Here, for example, is the edge table from our ARG: diff --git a/terminology_and_concepts.md b/terminology_and_concepts.md index b0138de..360b587 100644 --- a/terminology_and_concepts.md +++ b/terminology_and_concepts.md @@ -416,28 +416,38 @@ there are multiple, overlaid ancestral recombination events. ### Tree sequences and ARGs -Much of the literature on ancestral inference concentrates on the Ancestral Recombination -Graph, or ARG, in which details of the position and potentially the timing of -recombination events are explictly stored. Although a tree sequence *can* represent such -an ARG, by incorporating nodes that represent recombination events (see the -{ref}`sec_args` tutorial), this is not normally done for two reasons: +::::{margin} +:::{note} +There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in a local tree. +::: +:::: + +The term "Ancestral Recombination Graph", or ARG, is commonly used to describe a genetic +genealogy. In particular, many (but not all) authors use it to mean a genetic +genealogy in which details of the position and potentially the timing of all +recombination and common ancestor events are explictly stored. For clarity +we refer to this sort of genetic genealogy as a "full ARG". Succinct tree sequences can +represent many different sorts of ARGs, including "full ARGs", by incorporating extra +non-coalescent nodes (see the {ref}`sec_args` tutorial). However, tree sequences are +often shown and stored in {ref}`fully simplified` form, +which omits these extra nodes. This is for two main reasons: 1. Many recombination events are undetectable from sequence data, and even if they are detectable, they can be logically impossible to place in the genealogy (as in the second SPR example above). -2. The number of recombination events in the genealogy can grow to dominate the total - number of nodes in the total tree sequence, without actually contributing to the - realised sequences in the samples. In other words, recombination nodes are redundant - to the storing of genome data. +2. The number of recombination and non-coalescing common ancestor events in the genealogy + quickly grows to dominate the total number of nodes in the tree sequence, + without actually contributing to the mutations inherited by the samples. + In other words, these nodes are redundant to the storing of genome data. -Therefore, compared to an ARG, you can think of a standard tree sequence as simply +Therefore, compared to a full ARG, you can think of a simplified tree sequence as storing the trees *created by* recombination events, rather than attempting to record the recombination events themselves. The actual recombination events can be sometimes be inferred from these trees but, as we have seen, it's not always possible. Here's another way to put it: > "an ARG encodes the events that occurred in the history of a sample, -> whereas a tree sequence encodes the outcome of those events" +> whereas a [simplified] tree sequence encodes the outcome of those events" > ([Kelleher _et al._, 2019](https://doi.org/10.1534/genetics.120.303253)) diff --git a/what_is.md b/what_is.md index b65ff4d..2c2bba6 100644 --- a/what_is.md +++ b/what_is.md @@ -307,8 +307,8 @@ plt.show() ::::{margin} :::{note} The genetic genealogy is sometimes referred to as an ancestral recombination graph, -or ARG, and there are {ref}`close similarities` between ARGs -and tree sequences (see the {ref}`ARG tutorial`) +or ARG, and one way to think of tskit tree sequence is as a way +to store various different sorts of ARGs (see the {ref}`ARG tutorial`) ::: ::::