Skip to content

Commit

Permalink
Merge pull request #471 from jeromekelleher/final-submission-tweaks
Browse files Browse the repository at this point in the history
Update "probabilistic sampling" bit
  • Loading branch information
jeromekelleher authored Jun 4, 2024
2 parents 0cb62aa + e1329f2 commit d93747b
Show file tree
Hide file tree
Showing 5 changed files with 144 additions and 40 deletions.
14 changes: 13 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ ILLUSTRATIONS=\
illustrations/cell-lines.pdf \
illustrations/simplification-with-edges.pdf \

all: paper.pdf response-to-reviewers.pdf
all: paper.pdf response-to-reviewers.pdf response-to-reviewers-2.pdf

paper.pdf: paper.tex paper.bib ${DATA} ${FIGURES} ${ILLUSTRATIONS}
pdflatex -shell-escape paper.tex
Expand Down Expand Up @@ -102,3 +102,15 @@ review-diff.pdf: review-diff.tex

response-to-reviewers.pdf: response-to-reviewers.tex
pdflatex $<

review-diff-2.tex: paper.tex
latexdiff reviewed-paper-2.tex paper.tex > review-diff-2.tex

review-diff-2.pdf: review-diff-2.tex
pdflatex review-diff-2.tex
pdflatex review-diff-2.tex
bibtex review-diff-2
pdflatex review-diff-2.tex

response-to-reviewers-2.pdf: response-to-reviewers-2.tex
pdflatex $<
5 changes: 4 additions & 1 deletion cover-letter/Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
all: cover-letter.pdf cover-letter-resubmit.pdf
all: cover-letter.pdf cover-letter-resubmit.pdf cover-letter-resubmit-2.pdf

cover-letter.pdf: cover-letter.tex
pdflatex cover-letter.tex

cover-letter-resubmit.pdf: cover-letter-resubmit.tex
pdflatex cover-letter-resubmit.tex

cover-letter-resubmit-2.pdf: cover-letter-resubmit-2.tex
pdflatex cover-letter-resubmit-2.tex
29 changes: 29 additions & 0 deletions cover-letter/cover-letter-resubmit-2.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
\documentclass{letter}

\signature{Jerome Kelleher}

\address{Big Data Institute\\University of Oxford}
\begin{document}

\begin{letter}{GENETICS}

\opening{Dear Graham,}

I am writing on behalf of my coauthors to resubmit our
manuscript entitled
\emph{A general and efficient representation of ancestral recombination
graphs}.

We are delighted that it is potentially suitable for publication in GENETICS,
and have endeavored to address the points you have raised.

We have attached a detailed point-by-point response in the
\texttt{response-to-reviewers-2.pdf} file, along with a
latex-diff of the differences between the current and previous submissions.

Thank you again for your careful and helpful input throughout this process.

\closing{Sincerely,}

\end{letter}
\end{document}
21 changes: 10 additions & 11 deletions paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@
% This rapid progress has led to a diversity of ARG definitions and representations.
Classical formalisms have focused on mapping
coalescence and recombination events to the nodes in an ARG.
This approach is out of step with many modern developments, however,
This approach is out of step with some modern developments, however,
which do not represent genetic inheritance in terms of these events
or explicitly infer them.
We present a simple formalism that defines an ARG in terms
Expand Down Expand Up @@ -507,18 +507,17 @@ \section{Event ARGs}
Aside from these practical challenges, there is also a deeper
issue with the implicit strategy of basing an ARG data structure on
recording events and their properties (e.g.\ the crossover breakpoint
for a recombination event). This approach
for a recombination event).
This approach
requires all events to be recorded explicitly, and does not
provide an obvious mechanism for either aggregating multiple events
or expressing uncertainty about them. This is not a
problem when describing the results of simulations, where all details
are perfectly known. However, it can be an issue when we wish to
formally describe the output of various inference methods, particularly
those that avoid inferring events that are not \emph{knowable} from the data:
a useful approach as datasets approach the population scale~\citep[e.g.][]{
provide an obvious mechanism for aggregating multiple, potentially
unresolvable, events.
As datasets approach the population scale~\citep[e.g.][]{
turnbull2018hundred, bycroft2018genome,hayes20191000,
Ros-Freixedes2020,karczewski2020mutational,tanjo2021practical,
halldorsson2022sequences}.
halldorsson2022sequences} representing such uncertainty
directly through the data structure is a useful alternative to
classical methods based on probabilistic sampling.

% There is also a certain clarity gained by explicitly modelling nodes
% in the inheritance graph as genomes.
Expand Down Expand Up @@ -1129,7 +1128,7 @@ \section{Discussion}
The emerging ARG software ecosystem could similarly benefit
from the adoption of such shared community infrastructure
to handle the mundane and time-consuming details of data interchange.
The \texttt{tskit} library (Section~\ref{sec-efficiency})
The \texttt{tskit} library
is a high-quality open-source gARG implementation,
with proven efficiency and
scalability~\citep[e.g.][]{anderson2022genes,zhan2023towards},
Expand Down
115 changes: 88 additions & 27 deletions response-to-reviewers-2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,23 @@ \section*{Response to the editor}
\section*{Associate Editor's comments}

\begin{point}
My remaining broad concern is that the paper is still in places somewhat narrow about the goals of future ARG development. I certainly see the practical utility of dropping inference down to some minimum "knowable" structure that can be reconstructed using deterministic algorithms for very large datasets. However, probabilistic reconstructions of some form of ARG with more explicit events is also a reasonable goal moving forwards (e.g. for some applications we may want a subset of the recombination events explicitly included). There are a few places where the paper still comes across as overly dogmatic about the minimum "knowable" ARG being the only goal (although the discussion casts a broader view).
My remaining broad concern is that the paper is still in places somewhat narrow
about the goals of future ARG development. I certainly see the practical
utility of dropping inference down to some minimum ``knowable'' structure that
can be reconstructed using deterministic algorithms for very large datasets.
However, probabilistic reconstructions of some form of ARG with more explicit
events is also a reasonable goal moving forwards (e.g. for some applications we
may want a subset of the recombination events explicitly included). There are a
few places where the paper still comes across as overly dogmatic about the
minimum ``knowable'' ARG being the only goal (although the discussion casts a
broader view).
\end{point}
\begin{reply}
We have gone through the article and, in addition to the suggestions made below, have rephrased
parts to make it clear that a gARG can be used to encode a \emph{variety} of ARG structures, whether events are or are not explicitly inferred by the reconstruction method. We specifically state at the end of \emph{A diversity of structures} that
We have gone through the article and, in addition to the suggestions made
below, have rephrased parts to make it clear that a gARG can be used to encode
a \emph{variety} of ARG structures, whether events are or are not explicitly
inferred by the reconstruction method. We specifically state at the end of
\emph{A diversity of structures} that
\begin{quote}
A gARG can encode a diversity of ARG structures, including
those where events \emph{are} recorded explicitly, and those where
Expand All @@ -70,94 +82,143 @@ \section*{Associate Editor's comments}
\end{reply}

\begin{point}
Abstract: "This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them." So this is on the places where I feel like the authors state things too strongly. The authors, and some others, approaches have taken this path, but folks can agree that the gARG is a good idea and yet think that explicitly inferring details of recombination events is a `modern' goal.
Abstract: ``This approach is out of step with modern developments, which do not
represent genetic inheritance in terms of these events or explicitly infer
them.'' So this is on the places where I feel like the authors state things too
strongly. The authors, and some others, approaches have taken this path, but
folks can agree that the gARG is a good idea and yet think that explicitly
inferring details of recombination events is a `modern' goal.
\end{point}
\begin{reply}
We have changed this to "This approach is out of step with many modern developments, however,..."
We have changed this to ``This approach is out of step with some modern developments,
however,...''
\end{reply}

\begin{point}
"Broadly speaking, an ARG describes the different paths of genetic inheritance caused by recombination, encapsulating the resulting complex web of genetic ancestry " - add "of a set of samples". Also I'd say "genetic ancestors", as ancestry is tied up with genetic ancestry groups in peoples' minds.
``Broadly speaking, an ARG describes the different paths of genetic inheritance
caused by recombination, encapsulating the resulting complex web of genetic
ancestry'' - add ``of a set of samples''. Also I'd say ``genetic ancestors'', as
ancestry is tied up with genetic ancestry groups in peoples' minds.
\end{point}
\begin{reply}
Amended as suggested.
\end{reply}

\begin{point}
"We define a genome as the complete set of genetic material that a child inherits from one parent. A diploid individual therefore carries two genomes, one inherited from each parent (we assume diploids here for clarity, but the definitions apply to organisms of arbitrary ploidy). " -Excludes Y, mtDNA, and X as written, please revise, e.g. talk about autosomal genome.
``We define a genome as the complete set of genetic material that a child
inherits from one parent. A diploid individual therefore carries two genomes,
one inherited from each parent (we assume diploids here for clarity, but the
definitions apply to organisms of arbitrary ploidy). '' -Excludes Y, mtDNA, and
X as written, please revise, e.g. talk about autosomal genome.
\end{point}
\begin{reply}
Amended as suggested.
\end{reply}

\begin{point}
"The topology of a gARG specifies that genetic inheritance occurred between particular ancestors and descendants, " -struggle slightly with word "particular" here as the identity of the ancestors is not known. Deleting "particular" is likely sufficient.
``The topology of a gARG specifies that genetic inheritance occurred between
particular ancestors and descendants, '' -struggle slightly with word
``particular" here as the identity of the ancestors is not known. Deleting
``particular" is likely sufficient.
\end{point}
\begin{reply}
Amended as suggested.
\end{reply}

\begin{point}
"This is sufficient to describe the effects of inheritance under any form of homologous recombination (such as multiple crossovers,..." -do you mean multiple crossovers during a single round of meiosis.
``This is sufficient to describe the effects of inheritance under any form of
homologous recombination (such as multiple crossovers,..." -do you mean
multiple crossovers during a single round of meiosis.
\end{point}
\begin{reply}
Yes - amended to clarify this.
\end{reply}

\begin{point}
"In this encoding there are two types of internal node in the graph, representing the common ancestor and recombination events in the history of a sample. " stipulate that these are most recent common ancestor events.
``In this encoding there are two types of internal node in the graph,
representing the common ancestor and recombination events in the history of a
sample. " stipulate that these are most recent common ancestor events.
\end{point}
\begin{reply}
Amended as suggested.
\end{reply}

\begin{point}
"This approach assumes all events are knowable, and does not provide an obvious mechanism for either aggregating multiple events or expressing uncertainty about them. While this is not a problem when describing the results of simulations". -Maybe one way to flip this around would be to say that because it arose from tracking a particular stochastic process it has these properties. Also I don't think it assumes that all events are knowable, eg we could construct some parsimonious ARG or probabilistic ARG. If we wish to express uncertainty about events we usually give draws from the posterior etc. I agree that might be computational prohibitive with large samples etc, but it seems like place to take a broad view. This seems like a place to acknowledge that for some applications we might want to explicitly reconstruct the events.
``This approach assumes all events are knowable, and does not provide an obvious
mechanism for either aggregating multiple events or expressing uncertainty
about them. While this is not a problem when describing the results of
simulations''. -Maybe one way to flip this around would be to say that because
it arose from tracking a particular stochastic process it has these properties.
Also I don't think it assumes that all events are knowable, eg we could
construct some parsimonious ARG or probabilistic ARG. If we wish to express
uncertainty about events we usually give draws from the posterior etc. I agree
that might be computational prohibitive with large samples etc, but it seems
like place to take a broad view. This seems like a place to acknowledge that
for some applications we might want to explicitly reconstruct the events.
\end{point}
\begin{reply}
We have rephrased this part to read
\begin{quote}
This approach necessitates that all events are recorded explicitly, and does not
provide an obvious mechanism for either aggregating multiple events
or expressing uncertainty about them. While this is not a
problem when describing the results of simulations, for instance (where all details
are perfectly known), it is an issue when we wish to
formally describe the output of inference methods which do not
necessarily attempt to infer events that are not \emph{knowable} from the data,
particularly as datasets approach the population scale...
\end{quote}
This approach
requires all events to be recorded explicitly, and does not
provide an obvious mechanism for aggregating multiple, potentially
unresolvable, events.
As datasets approach the population scale [citations]
representing such uncertainty
directly through the data structure is a useful alternative to
classical methods based on probabilistic sampling.
\end{reply}

\begin{point}
"A key feature of the gARG encoding is that it enables these varying levels of precision to be represented, and brings these nuanced features to light." -the word nuanced feels strange here.
``A key feature of the gARG encoding is that it enables these varying levels of
precision to be represented, and brings these nuanced features to light." -the
word nuanced feels strange here.
\end{point}
\begin{reply}
We have deleted the second part of this sentence.
\end{reply}

\begin{point}
"Simpler representations can be formed by removing "unknowable" nodes (Fig. 5B)" -unknowable is vague here, do you mean bubbles along a single lineage?
``Simpler representations can be formed by removing ``unknowable" nodes (Fig.
5B)" -unknowable is vague here, do you mean bubbles along a single lineage?
\end{point}
\begin{reply}
We've added a clarification that this refers to nodes such as those in singly-connected graph components.
We've added a clarification that this refers to nodes such as those
in singly-connected graph components.
\end{reply}

\begin{point}
"The gARG encoding leads to highly efficient storage and processing of ARG data, "-As gARG has various levels of precision, perhaps this needs to state that the "gARG encoding can lead to..." or be more precise that this is a reduced precision level.
``The gARG encoding leads to highly efficient storage and processing of ARG
data, "-As gARG has various levels of precision, perhaps this needs to state
that the "gARG encoding can lead to..." or be more precise that this is a
reduced precision level.
\end{point}
\begin{reply}
Amended as suggested to add "can lead to".
Amended as suggested to add ``can lead to".
\end{reply}

\begin{point}
"The succinct tree sequence data structure (usually known as a "tree sequence" for brevity) is a practical gARG implementation focused on efficiency." - If the tree sequence is focused at a particular level of gARG simplification be precise about this.
``The succinct tree sequence data structure (usually known as a ``tree sequence"
for brevity) is a practical gARG implementation focused on efficiency." - If
the tree sequence is focused at a particular level of gARG simplification be
precise about this.
\end{point}
\begin{reply}
We have left this sentence as is, since the tree sequence structure can record gARGs at various levels of simplification.
We have left this sentence as is, since the tree sequence structure
can record gARGs at various levels of simplification.
\end{reply}

\begin{point}
"Methods targeting large-scale datasets tend to simplify the inference problem by making a single, deterministic best-guess " --I think this is the best guess of the topology, and the uncertainty in times given the ARG is downstream of this. If so please clarify. Also I'd perhaps explicitly acknowledge Deng et al (SINGER), e.g. "deterministic best-guess of the topology (see Deng et al for parallel developments addressing uncertainty with somewhat small sample sizes)" or something like that. While these deterministic approaches are a strong way forward for human biobank scale data, it's good to be highlight parallel developments that might be key to other applications.
``Methods targeting large-scale datasets tend to simplify the inference problem
by making a single, deterministic best-guess " --I think this is the best guess
of the topology, and the uncertainty in times given the ARG is downstream of
this. If so please clarify. Also I'd perhaps explicitly acknowledge Deng et al
(SINGER), e.g. ``deterministic best-guess of the topology (see Deng et al for
parallel developments addressing uncertainty with somewhat small sample sizes)"
or something like that. While these deterministic approaches are a strong way
forward for human biobank scale data, it's good to be highlight parallel
developments that might be key to other applications.
\end{point}
\begin{reply}
We have mentioned this as suggested.
Expand Down

0 comments on commit d93747b

Please sign in to comment.