Merge pull request #471 from jeromekelleher/final-submission-tweaks

Update "probabilistic sampling" bit
tskit-dev · Jun 4, 2024 · d93747b · d93747b
2 parents 0cb62aa + e1329f2
commit d93747b
Show file tree

Hide file tree

Showing 5 changed files with 144 additions and 40 deletions.
diff --git a/Makefile b/Makefile
@@ -15,7 +15,7 @@ ILLUSTRATIONS=\
 	illustrations/cell-lines.pdf \
 	illustrations/simplification-with-edges.pdf \
 
-all: paper.pdf response-to-reviewers.pdf
+all: paper.pdf response-to-reviewers.pdf response-to-reviewers-2.pdf
 
 paper.pdf: paper.tex paper.bib ${DATA} ${FIGURES} ${ILLUSTRATIONS}
 	pdflatex -shell-escape paper.tex
@@ -102,3 +102,15 @@ review-diff.pdf: review-diff.tex
 
 response-to-reviewers.pdf: response-to-reviewers.tex
 	pdflatex $<
+
+review-diff-2.tex: paper.tex
+	latexdiff reviewed-paper-2.tex paper.tex > review-diff-2.tex
+
+review-diff-2.pdf: review-diff-2.tex
+	pdflatex review-diff-2.tex
+	pdflatex review-diff-2.tex
+	bibtex review-diff-2
+	pdflatex review-diff-2.tex
+
+response-to-reviewers-2.pdf: response-to-reviewers-2.tex
+	pdflatex $<
diff --git a/cover-letter/Makefile b/cover-letter/Makefile
@@ -1,7 +1,10 @@
-all: cover-letter.pdf cover-letter-resubmit.pdf
+all: cover-letter.pdf cover-letter-resubmit.pdf cover-letter-resubmit-2.pdf
 
 cover-letter.pdf: cover-letter.tex
 	pdflatex cover-letter.tex
 
 cover-letter-resubmit.pdf: cover-letter-resubmit.tex
 	pdflatex cover-letter-resubmit.tex
+
+cover-letter-resubmit-2.pdf: cover-letter-resubmit-2.tex
+	pdflatex cover-letter-resubmit-2.tex
diff --git a/cover-letter/cover-letter-resubmit-2.tex b/cover-letter/cover-letter-resubmit-2.tex
@@ -0,0 +1,29 @@
+\documentclass{letter}
+
+\signature{Jerome Kelleher}
+
+\address{Big Data Institute\\University of Oxford}
+\begin{document}
+
+\begin{letter}{GENETICS}
+
+\opening{Dear Graham,}
+
+I am writing on behalf of my coauthors to resubmit our 
+manuscript entitled
+\emph{A general and efficient representation of ancestral recombination
+graphs}. 
+
+We are delighted that it is potentially suitable for publication in GENETICS,
+and have endeavored to address the points you have raised.
+
+We have attached a detailed point-by-point response in the 
+\texttt{response-to-reviewers-2.pdf} file, along with a 
+latex-diff of the differences between the current and previous submissions.
+
+Thank you again for your careful and helpful input throughout this process.
+
+\closing{Sincerely,}
+
+\end{letter}
+\end{document}
diff --git a/paper.tex b/paper.tex
@@ -68,7 +68,7 @@
 % This rapid progress has led to a diversity of ARG definitions and representations.
 Classical formalisms have focused on mapping 
 coalescence and recombination events to the nodes in an ARG.
-This approach is out of step with many modern developments, however,
+This approach is out of step with some modern developments, however,
 which do not represent genetic inheritance in terms of these events 
 or explicitly infer them.
 We present a simple formalism that defines an ARG in terms 
@@ -507,18 +507,17 @@ \section{Event ARGs}
 Aside from these practical challenges, there is also a deeper
 issue with the implicit strategy of basing an ARG data structure on
 recording events and their properties (e.g.\ the crossover breakpoint
-for a recombination event). This approach 
+for a recombination event). 
+This approach 
 requires all events to be recorded explicitly, and does not 
-provide an obvious mechanism for either aggregating multiple events
-or expressing uncertainty about them. This is not a
-problem when describing the results of simulations, where all details
-are perfectly known. However, it can be an issue when we wish to
-formally describe the output of various inference methods, particularly
-those that avoid inferring events that are not \emph{knowable} from the data:
-a useful approach as datasets approach the population scale~\citep[e.g.][]{
+provide an obvious mechanism for aggregating multiple, potentially
+unresolvable, events.
+As datasets approach the population scale~\citep[e.g.][]{
 turnbull2018hundred, bycroft2018genome,hayes20191000,
 Ros-Freixedes2020,karczewski2020mutational,tanjo2021practical,
-halldorsson2022sequences}.
+halldorsson2022sequences} representing such uncertainty 
+directly through the data structure is a useful alternative to 
+classical methods based on probabilistic sampling.
 
 % There is also a certain clarity gained by explicitly modelling nodes
 % in the inheritance graph as genomes.
@@ -1129,7 +1128,7 @@ \section{Discussion}
 The emerging ARG software ecosystem could similarly benefit
 from the adoption of such shared community infrastructure
 to handle the mundane and time-consuming details of data interchange.
-The \texttt{tskit} library (Section~\ref{sec-efficiency})
+The \texttt{tskit} library 
 is a high-quality open-source gARG implementation,
 with proven efficiency and
 scalability~\citep[e.g.][]{anderson2022genes,zhan2023towards},

diff --git a/response-to-reviewers-2.tex b/response-to-reviewers-2.tex
@@ -57,11 +57,23 @@ \section*{Response to the editor}
 \section*{Associate Editor's comments}
 
 \begin{point}
-My remaining broad concern is that the paper is still in places somewhat narrow about the goals of future ARG development. I certainly see the practical utility of dropping inference down to some minimum "knowable" structure that can be reconstructed using deterministic algorithms for very large datasets. However, probabilistic reconstructions of some form of ARG with more explicit events is also a reasonable goal moving forwards (e.g. for some applications we may want a subset of the recombination events explicitly included). There are a few places where the paper still comes across as overly dogmatic about the minimum "knowable" ARG being the only goal (although the discussion casts a broader view).
+My remaining broad concern is that the paper is still in places somewhat narrow
+about the goals of future ARG development. I certainly see the practical
+utility of dropping inference down to some minimum ``knowable'' structure that
+can be reconstructed using deterministic algorithms for very large datasets.
+However, probabilistic reconstructions of some form of ARG with more explicit
+events is also a reasonable goal moving forwards (e.g. for some applications we
+may want a subset of the recombination events explicitly included). There are a
+few places where the paper still comes across as overly dogmatic about the
+minimum ``knowable'' ARG being the only goal (although the discussion casts a
+broader view).
 \end{point}
 \begin{reply}
-We have gone through the article and, in addition to the suggestions made below, have rephrased 
-parts to make it clear that a gARG can be used to encode a \emph{variety} of ARG structures, whether events are or are not explicitly inferred by the reconstruction method. We specifically state at the end of \emph{A diversity of structures} that
+We have gone through the article and, in addition to the suggestions made
+below, have rephrased parts to make it clear that a gARG can be used to encode
+a \emph{variety} of ARG structures, whether events are or are not explicitly
+inferred by the reconstruction method. We specifically state at the end of
+\emph{A diversity of structures} that
 \begin{quote}
 A gARG can encode a diversity of ARG structures, including 
 those where events \emph{are} recorded explicitly, and those where
@@ -70,94 +82,143 @@ \section*{Associate Editor's comments}
 \end{reply}
 
 \begin{point}
-Abstract: "This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them." So this is on the places where I feel like the authors state things too strongly. The authors, and some others, approaches have taken this path, but folks can agree that the gARG is a good idea and yet think that explicitly inferring details of recombination events is a `modern' goal.
+Abstract: ``This approach is out of step with modern developments, which do not
+represent genetic inheritance in terms of these events or explicitly infer
+them.'' So this is on the places where I feel like the authors state things too
+strongly. The authors, and some others, approaches have taken this path, but
+folks can agree that the gARG is a good idea and yet think that explicitly
+inferring details of recombination events is a `modern' goal.
 \end{point}
 \begin{reply}
-We have changed this to "This approach is out of step with many modern developments, however,..."
+We have changed this to ``This approach is out of step with some modern developments, 
+however,...''
 \end{reply}
 
 \begin{point}
-"Broadly speaking, an ARG describes the different paths of genetic inheritance caused by recombination, encapsulating the resulting complex web of genetic ancestry " - add "of a set of samples". Also I'd say "genetic ancestors", as ancestry is tied up with genetic ancestry groups in peoples' minds.
+``Broadly speaking, an ARG describes the different paths of genetic inheritance
+caused by recombination, encapsulating the resulting complex web of genetic
+ancestry'' - add ``of a set of samples''. Also I'd say ``genetic ancestors'', as
+ancestry is tied up with genetic ancestry groups in peoples' minds.
 \end{point}
 \begin{reply}
 Amended as suggested.
 \end{reply}
 
 \begin{point}
-"We define a genome as the complete set of genetic material that a child inherits from one parent. A diploid individual therefore carries two genomes, one inherited from each parent (we assume diploids here for clarity, but the definitions apply to organisms of arbitrary ploidy). " -Excludes Y, mtDNA, and X as written, please revise, e.g. talk about autosomal genome.
+``We define a genome as the complete set of genetic material that a child
+inherits from one parent. A diploid individual therefore carries two genomes,
+one inherited from each parent (we assume diploids here for clarity, but the
+definitions apply to organisms of arbitrary ploidy). '' -Excludes Y, mtDNA, and
+X as written, please revise, e.g. talk about autosomal genome.
 \end{point}
 \begin{reply}
 Amended as suggested.
 \end{reply}
 
 \begin{point}
-"The topology of a gARG specifies that genetic inheritance occurred between particular ancestors and descendants, " -struggle slightly with word "particular" here as the identity of the ancestors is not known. Deleting "particular" is likely sufficient.
+``The topology of a gARG specifies that genetic inheritance occurred between
+particular ancestors and descendants, '' -struggle slightly with word
+``particular" here as the identity of the ancestors is not known. Deleting
+``particular" is likely sufficient.
 \end{point}
 \begin{reply}
 Amended as suggested.
 \end{reply}
 
 \begin{point}
-"This is sufficient to describe the effects of inheritance under any form of homologous recombination (such as multiple crossovers,..." -do you mean multiple crossovers during a single round of meiosis.
+``This is sufficient to describe the effects of inheritance under any form of
+homologous recombination (such as multiple crossovers,..." -do you mean
+multiple crossovers during a single round of meiosis.
 \end{point}
 \begin{reply}
 Yes - amended to clarify this.
 \end{reply}
 
 \begin{point}
-"In this encoding there are two types of internal node in the graph, representing the common ancestor and recombination events in the history of a sample. " stipulate that these are most recent common ancestor events.
+``In this encoding there are two types of internal node in the graph,
+representing the common ancestor and recombination events in the history of a
+sample. " stipulate that these are most recent common ancestor events.
 \end{point}
 \begin{reply}
 Amended as suggested.
 \end{reply}
 
 \begin{point}
-"This approach assumes all events are knowable, and does not provide an obvious mechanism for either aggregating multiple events or expressing uncertainty about them. While this is not a problem when describing the results of simulations". -Maybe one way to flip this around would be to say that because it arose from tracking a particular stochastic process it has these properties. Also I don't think it assumes that all events are knowable, eg we could construct some parsimonious ARG or probabilistic ARG. If we wish to express uncertainty about events we usually give draws from the posterior etc. I agree that might be computational prohibitive with large samples etc, but it seems like place to take a broad view. This seems like a place to acknowledge that for some applications we might want to explicitly reconstruct the events.
+``This approach assumes all events are knowable, and does not provide an obvious
+mechanism for either aggregating multiple events or expressing uncertainty
+about them. While this is not a problem when describing the results of
+simulations''. -Maybe one way to flip this around would be to say that because
+it arose from tracking a particular stochastic process it has these properties.
+Also I don't think it assumes that all events are knowable, eg we could
+construct some parsimonious ARG or probabilistic ARG. If we wish to express
+uncertainty about events we usually give draws from the posterior etc. I agree
+that might be computational prohibitive with large samples etc, but it seems
+like place to take a broad view. This seems like a place to acknowledge that
+for some applications we might want to explicitly reconstruct the events.
 \end{point}
 \begin{reply}
 We have rephrased this part to read
 \begin{quote}
-This approach necessitates that all events are recorded explicitly, and does not 
-provide an obvious mechanism for either aggregating multiple events
-or expressing uncertainty about them. While this is not a
-problem when describing the results of simulations, for instance (where all details
-are perfectly known), it is an issue when we wish to
-formally describe the output of inference methods which do not
-necessarily attempt to infer events that are not \emph{knowable} from the data, 
-particularly as datasets approach the population scale...
 \end{quote}
+This approach 
+requires all events to be recorded explicitly, and does not 
+provide an obvious mechanism for aggregating multiple, potentially
+unresolvable, events.
+As datasets approach the population scale [citations]
+representing such uncertainty 
+directly through the data structure is a useful alternative to 
+classical methods based on probabilistic sampling.
 \end{reply}
 
 \begin{point}
-"A key feature of the gARG encoding is that it enables these varying levels of precision to be represented, and brings these nuanced features to light." -the word nuanced feels strange here.
+``A key feature of the gARG encoding is that it enables these varying levels of
+precision to be represented, and brings these nuanced features to light." -the
+word nuanced feels strange here.
 \end{point}
 \begin{reply}
 We have deleted the second part of this sentence.
 \end{reply}
 
 \begin{point}
-"Simpler representations can be formed by removing "unknowable" nodes (Fig. 5B)" -unknowable is vague here, do you mean bubbles along a single lineage?
+``Simpler representations can be formed by removing ``unknowable" nodes (Fig.
+5B)" -unknowable is vague here, do you mean bubbles along a single lineage?
 \end{point}
 \begin{reply}
-We've added a clarification that this refers to nodes such as those in singly-connected graph components.
+We've added a clarification that this refers to nodes such as those 
+in singly-connected graph components.
 \end{reply}
 
 \begin{point}
-"The gARG encoding leads to highly efficient storage and processing of ARG data, "-As gARG has various levels of precision, perhaps this needs to state that the "gARG encoding can lead to..." or be more precise that this is a reduced precision level.
+``The gARG encoding leads to highly efficient storage and processing of ARG
+data, "-As gARG has various levels of precision, perhaps this needs to state
+that the "gARG encoding can lead to..." or be more precise that this is a
+reduced precision level.
 \end{point}
 \begin{reply}
-Amended as suggested to add "can lead to".
+Amended as suggested to add ``can lead to".
 \end{reply}
 
 \begin{point}
-"The succinct tree sequence data structure (usually known as a "tree sequence" for brevity) is a practical gARG implementation focused on efficiency." - If the tree sequence is focused at a particular level of gARG simplification be precise about this.
+``The succinct tree sequence data structure (usually known as a ``tree sequence"
+for brevity) is a practical gARG implementation focused on efficiency." - If
+the tree sequence is focused at a particular level of gARG simplification be
+precise about this.
 \end{point}
 \begin{reply}
-We have left this sentence as is, since the tree sequence structure can record gARGs at various levels of simplification.
+We have left this sentence as is, since the tree sequence structure 
+can record gARGs at various levels of simplification.
 \end{reply}
 
 \begin{point}
-"Methods targeting large-scale datasets tend to simplify the inference problem by making a single, deterministic best-guess " --I think this is the best guess of the topology, and the uncertainty in times given the ARG is downstream of this. If so please clarify. Also I'd perhaps explicitly acknowledge Deng et al (SINGER), e.g. "deterministic best-guess of the topology (see Deng et al for parallel developments addressing uncertainty with somewhat small sample sizes)" or something like that. While these deterministic approaches are a strong way forward for human biobank scale data, it's good to be highlight parallel developments that might be key to other applications.
+``Methods targeting large-scale datasets tend to simplify the inference problem
+by making a single, deterministic best-guess " --I think this is the best guess
+of the topology, and the uncertainty in times given the ARG is downstream of
+this. If so please clarify. Also I'd perhaps explicitly acknowledge Deng et al
+(SINGER), e.g. ``deterministic best-guess of the topology (see Deng et al for
+parallel developments addressing uncertainty with somewhat small sample sizes)"
+or something like that. While these deterministic approaches are a strong way
+forward for human biobank scale data, it's good to be highlight parallel
+developments that might be key to other applications.
 \end{point}
 \begin{reply}
 We have mentioned this as suggested.