diff --git a/abundance-measurement.Rmd b/abundance-measurement.Rmd index f8a3359..aa9c587 100644 --- a/abundance-measurement.Rmd +++ b/abundance-measurement.Rmd @@ -1,7 +1,7 @@ # How bias affects abundance measurements {#abundance-measurement} -This section extends the theoretical results of @mclaren2019cons to describe how taxonomic bias in an MGS experiment affects the relative and absolute abundances measured for various microbial species. -We show that some approaches to quantifying species abundance yield constant fold errors (FEs), while others yield FEs that depend on overall community composition and thus can vary across samples. +This section extends the theoretical results of @mclaren2019cons to describe how taxonomic bias in MGS experiments leads to errors in the relative and absolute abundances measured for various microbial species. +All approaches to abundance quantification have systematic errors driven by taxonomic bias; however, some yield constant fold errors (FEs), while others yield FEs that depend on overall community composition and thus can vary across samples. ## A model of MGS measurements diff --git a/case-studies.Rmd b/case-studies.Rmd index e304ba3..e941711 100644 --- a/case-studies.Rmd +++ b/case-studies.Rmd @@ -4,7 +4,7 @@ To better understand the potential impact of taxonomic bias on DA analysis in pr ## Foliar fungi experiment -The taxonomic bias species present in 'mock communities' of known composition can be directly measured, and the measured bias can then be used to correct downstream DA analysis (@mclaren2019cons). +The taxonomic bias of species in 'mock communities' of known composition can be directly measured, and the measured bias can then be used to correct downstream DA analysis (@mclaren2019cons). In practice it is difficult to construct control communities that span the many species present in complex natural communities. However, gnotobiotic community experiments are well suited to this form of _calibration via community controls_ since it is possible to assemble mock communities containing all species in known relative abundances. @@ -38,9 +38,9 @@ However, failure to account for bias caused the decrease to be overestimated, by For commensal-host pairs with relatively small observed decreases, bias correction greatly reduced the magnitude of the negative LFDs or, in several cases, resulted in LFDs that were near 0 or slightly positive. Although @leopold2020host did not include absolute-abundance measurements, we can consider the impact taxonomic bias would have on an absolute DA analysis in a simple scenario in which total genome concentration of each pre- and post-infection sample is perfectly known. -The total genomic concentrations of each sample, combined with the species proportions measured by MGS, can be combined into a measurement of species concentrations using the total-abundance normalization method (Equation \@ref(eq:density-prop-meas)). +We consider the approach of multiplying the total genomic concentrations of each sample with the species proportions measured by MGS as described by Equation \@ref(eq:density-prop-meas). In this case, the bias in the MGS measurements will create absolute errors in the estimated LFDs of genome concentration of equal magnitude to the errors in the LFD estimates for proportions. -The scientific error, however, might become worse. +If total abundances also differed after pathogen growth, however, we may make directional errors in determining the FD. Suppose that total abundance increased by 2-fold due to pathogen growth; in this case, species whose proportions remained approximately constant would have increased in absolute abundance by around 2-fold. Bias shifts FD estimates downwards by 2.5-fold to 5.2-fold (depending on host genotype). Hence, without bias correction, we would instead conclude that these species decreased. @@ -126,7 +126,7 @@ hmp_species_diff <- hmp_species_stats %>% To investigate this question, we analyzed bacterial species profiles of vaginal and stool samples derived from shotgun sequencing in the Human Microbiome Project (@huttenhower2012stru, [SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-30-hmp-stool-vagina-comparison/)). -On average, stool profiles had substantially greater order-2 alpha diversity than vaginal profiles (SI Figure \@ref(fig:gut)A; geometric mean (GM) $\md$ geometric standard deviation (GSD) of `r hmp_div_stats['stool', 'gm'] %>% round(1)` $\md$ `r hmp_div_stats['stool', 'gsd'] %>% round(1)` in stool samples and `r hmp_div_stats['vagina', 'gm'] %>% round(1)` $\md$ `r hmp_div_stats['vagina', 'gsd'] %>% round(1)` in vaginal samples). +On average, stool profiles had substantially greater order-2 alpha diversity (Inverse Simpson index) than vaginal profiles (SI Figure \@ref(fig:gut)A; geometric mean (GM) $\md$ geometric standard deviation (GSD) of `r hmp_div_stats['stool', 'gm'] %>% round(1)` $\md$ `r hmp_div_stats['stool', 'gsd'] %>% round(1)` in stool samples and `r hmp_div_stats['vagina', 'gm'] %>% round(1)` $\md$ `r hmp_div_stats['vagina', 'gsd'] %>% round(1)` in vaginal samples). To assess the potential importance of bias for proportion-based DA analyses in the two ecosystems, we quantified the multiplicative variation in the mean efficiency across samples for a large number of possible taxonomic biases under the assumption that the measured profiles reflected the truth. Across simulation replicates, the GSD in the mean efficiency was typically lower in stool than in vaginal samples (SI Figure \@ref(fig:gut)B; ratio of GSD in vaginal samples to GSD in stool samples of `r hmp_me_stats['iid', 'gm'] %>% round(1)` $\md$ `r hmp_me_stats['iid', 'gsd'] %>% round(1)` across 1000 simulations). Notably, however, the multiplicative variation in species also tended to be lower (SI Figure \@ref(fig:gut)C); the GSD in the proportion of a gut species across gut samples was an average of `r hmp_species_diff %>% round(2)`-fold lower than that of a vaginal species across vaginal samples. @@ -188,28 +188,27 @@ The second soil core included qPCR measurements of _Bathyarchaeota_ and a second In this core, FRAxC and qPCR growth rates differed more substantially, with growth rates from FRAxC being larger by `r round(bathy_core32_fraxc - bathy_core32_qpcr, 3)`/yr for _Bathyarchaeota_ (`r round(bathy_core32_fraxc, 3)`/yr by FRAxC and `r round(bathy_core32_qpcr, 3)`/yr by qPCR) and by `r round(mbgd_core32_fraxc - mbgd_core32_qpcr, 3)`/yr for _Thermoprofundales_/MBG-D (`r round(mbgd_core32_fraxc, 3)`/yr by FRAxC and `r round(mbgd_core32_qpcr, 3)`/yr by qPCR). Uncertainty in the FRAxC- and qPCR-derived growth rates is not reported and is likely substantial; however, the fact that the FRAxC-derived rates are larger than qPCR-derived rates in all three cases is consistent with our hypothesis that mean efficiency decreased with depth in a manner that systematically biased FRAxC-derived rates to higher values. The differences in growth rate are small in absolute terms; however, the maximum observed difference of 0.086/yr suggests an error large enough to impact results for some taxa classified as positive growers, whose FRAxC growth rates ranged between 0.04/yr and 0.5/yr. -In summary, the comparison between FRAxC and qPCR measurements supports the overall study conclusions, but also suggests that species at the low end of positive FRAxC-derived rates may be merely persisting or even slowly declining in abundance. - +In summary, our comparison between FRAxC and qPCR measurements supports the overall study conclusion that many taxa did grow following sediment burial; however, we should remain uncertain as to whether species with small, positive FRAxC-derived growth rates were in fact growing or rather were slowly declining in abundance. ## Summary and discussion The effect that consistent taxonomic bias has on proportion-based DA analyses depends on the MGS protocol, the biological system, and the sample comparisons under investigation. -Our case studies explore a limited range of possibilities; however, we can begin to form some general conclusions by placing them within the context of the three scenarios described in Section \@ref(differential-abundance). +Although our case studies explore a limited range of possibilities, some general patterns stand out. -We saw several cases where the mean efficiency is stable across samples, so that bias does not affect LFD estimates (Scenario 1). +In several cases, the mean efficiency remained stable across samples, so that LFD esimates were unaffected by bias. In the pre-infection fungal microbiome samples of @leopold2020host, we observed that the mean efficiency remained stable despite substantial bias and large multiplicative variation in species proportions among samples. Because the variation in the mean efficiency was much less than that of the individual species, the impact of bias on DA analysis was negligible. In the vaginal case study, we also observed that the mean efficiency was relative stable across vaginal microbiomes that were dominated by the same species, despite substantial bias and large LFDs in non-dominant species. In both cases, the stability of the mean efficiency can be understood by the fact that it is an additive average over species and so is primarily determined by the most abundant species in the sample. Yet we also observed cases where the mean efficiency varied substantially. -In several cases, the (log) mean efficiency co-varied sufficiently strongly with the condition of interest to cause substantial systematic errors in LFD estimates (Scenario 3). +In several cases, the (log) mean efficiency co-varied sufficiently strongly with the condition of interest to cause substantial systematic errors in LFD estimates. Examples include the comparison of foliar fungal microbiomes pre- and post-infection, for which the mean efficiency substantially increased post-infection, and the comparison of vaginal microbiomes with low versus high diversity in the MOMS-PI study, for which the mean efficiency was typically lower in high-diversity samples. Our analysis of marine sediment communities is consistent with a systematic decline of the mean efficiency with burial time that is sufficient to substantially inflate the estimated growth rates of slowly changing species. -The mean efficiency cannot differ by more than the species proportions constituting it. +The FD in the mean efficiency is bounded by the largest FD of any species' proportion. Therefore, the error in the estimated LFDs tend to be practically significant only for species whose LFDs are substantially smaller in magnitude than the largest LFD. -Variation in the mean efficiency may also be substantial, yet unassociated with the condition of interest (Scenario 2). +Variation in the mean efficiency may also be substantial, yet unassociated with the condition of interest. Although we did not directly observe this scenario directly, we have reason to think that it may be common in real microbiome studies. Our simulation analysis of gut microbiomes suggest that even in diverse ecosystems, bias will often cause multiplicative variation in the mean efficiency that is comparable to that of individual species. This variation is much less problematic for DA results when it not associated with the condition under study, since any loss in precision can in principle be offset by an increase in sample size. @@ -227,7 +226,7 @@ They are also expected to share traits related to the condition of interest. If related species tend to have similar efficiencies and similar associations with the condition, then they may drive an association of the mean efficiency with the condition. These observations provide reasons to both worry and hope. -It seems likely that in many studies, the mean efficiency is consistent (first scenario) or is at least not associated with the condition (second scenario), so that DA inferences remain valid. +It seems likely that in many studies, the mean efficiency is consistent (first scenario) or is at least not associated with the condition (second scenario), so LFD estimates remain accurate (or at least not overconfident). Yet it is not obvious which scenario any study falls into. There are plausible mechanisms leading to systematic variation of the mean efficiency even in ecosystems with high species diversity. Moreover, our explorations suggest that random variation in the mean efficiency is common and distorts comparisons between individual samples. diff --git a/differential-abundance.Rmd b/differential-abundance.Rmd index 242666b..62800fd 100644 --- a/differential-abundance.Rmd +++ b/differential-abundance.Rmd @@ -51,7 +51,7 @@ In particular, there are three distinct types of error: an increase in the magni We can see each type of error in Figure \@ref(fig:error-proportions). For Species 1, which increases in abundance and thus moves in the opposite direction of the mean efficiency, we see an increase in magnitude of the measured FD (actual FD: 2.3, measured FD: 6.5). For Species 2, which decreases and thus moves in the same direction as the mean efficiency but by a larger factor, we see a decrease in magnitude of the measured FD (actual FD: 0.15, measured FD: 0.44). -For Species 2, which decreases but by a smaller factor than the mean efficiency, the species actually appears to increase---a reversal in direction of the measured FD (actual FD: 0.6, measured FD: 1.8). +For Species 3, which decreases but by a smaller factor than the mean efficiency, the species actually appears to increase---a reversal in direction of the measured FD (actual FD: 0.6, measured FD: 1.8). In contrast, the FDs in the ratios among species are identical in the actual and measured taxonomic profiles. For example, the ratio of Species 2 to Species 3 shows the same 4-fold decrease in the actual profiles ($1$ to $1/4$) and in the measured profiles ($3$ to $3/4$). @@ -77,27 +77,35 @@ The simplest such regression relates a microbial abundance variable $y$ to a cov (\#eq:regression) y^{(a)} = \alpha + \beta x^{(a)} + \varepsilon^{(a)}. \end{align} +A multiplicative DA analysis can be conducted by setting $y$ equal to the logarithm of the untransformed abundance measurement ($\tilde P_i$, $\tilde R_{i/j}$, or $\tilde A_i$). For example, $y$ may be the log absolute abundance of species $i$ ($\log A_i^{(a)}$) and $x$ may be a continuous variable (such as pH) or a binary variable (such as $x=1$ for treated patients and $x=0$ for untreated controls). The regression coefficients $\alpha$ and $\beta$ are parameters to be estimated from the data. A DA regression analysis is primarily interested in the slope coefficient $\beta$, which determines how the average or expected value of $y$ increases with $x$. For a binary covariate, $\beta$ equals the average difference in $y$ between samples with $x=1$ versus those with $x=0$. -Taking $y$ to be the logarithm of a species' relative or absolute abundance makes the DA analysis multiplicative, with the coefficient $\beta$ corresponding to the average log fold difference (LFD) in abundance between conditions or for a unit increase in $x$. - +When $y$ equals the logarithm of a species' relative or absolute abundance, the coefficient $\beta$ equals the average log fold difference (LFD) in abundance between conditions corresponding to a unit increase in $x$. Although researchers often use more sophisticated models, the simple linear regression model in Equation \@ref(eq:regression) provides an intuitive basis for understanding the effect of taxonomic bias on regression-based DA analyses. -Appendix \@ref(appendix-regression) describes a general framework for finding the error in the estimated slope coefficient $\beta$ caused by taxonomic bias; here, we summarize and illustrate these results for the various abundance measures described in Section \@ref(abundance-measurement). -In each case, a multiplicative DA analysis can be conducted by setting $y$ equal to the logarithm of the untransformed abundance measurement ($\tilde P_i$, $\tilde R_{i/j}$, or $\tilde A_i$). -Consider an abundance measure such as the ratio between species $i$ and $j$, $\tilde R_{i/j}$. -Taxonomic bias creates a constant FE in $\tilde R_{i/j}$ that translates to a constant additive shift in $\log \tilde R_{i/j}$ and therefore does not affect the estimate of the slope $\hat \beta$. -Now consider the absolute abundance of a species $i$ obtained by normalizing its proportion to an non-MGS measurement of total community abundance, which for illustration we assume to be completely accurate. -The measured abundance $\tilde A_i$ thus has a FE that varies inversely with the mean efficiency $\bar B$. +Appendix \@ref(appendix-regression) describes a general framework for finding the error in the estimated slope coefficient $\hat \beta$ caused by taxonomic bias; here, we summarize and illustrate these results for the various abundance measures described in Section \@ref(abundance-measurement). + +The effect that taxonomic bias has on the estimate $\hat \beta$ mirrors its effect on the measured FDs between a pair of individual samples: +For abundance measures with a FE that is constant across samples, bias does not affect $\hat \beta$. +For abundance measures where the FE varies, bias can affect $\hat \beta$, with its exact effect depending on how the mean efficiency varies across samples. + +We illustrate the constant-FE case by considering a regression analysis of the log ratio between species $i$ and $j$, $\log \tilde R_{i/j}$. +Taxonomic bias creates a constant FE in $\tilde R_{i/j}$ that translates to a constant additive error in $\log \tilde R_{i/j}$. +A constant (additive) shift in a response variable only affects the estimated intercept $\hat \alpha$ and not the estimated slope $\hat \beta$. +Thus, in this case, taxonomic bias does not affect $\hat \beta$. + +We illustrate the varying-FE case by considering the absolute abundance of a species $i$ obtained by multiplying the species' MGS proportion by a non-MGS measure of total-abundance, using Equation \@ref(eq:density-prop-meas). +For simplicity, suppose that total-community abundance has been measured completely accurately. +In this case, the measured abundance $\tilde A_i$ has a FE that varies inversely with the mean efficiency $\bar B$. If $\bar B$ is stable across samples, then the additive error in $\log \tilde A_i$ will be approximately constant and $\hat \beta$ is again unaffected---bias has no effect on the DA result. If, on the other hand, $\log \bar B$ tends to linearly vary with the covariate $x$, then the additive error in $\log \tilde A_i$ will linearly vary in the opposite direction and cause a systematic error in $\hat \beta$ that is equal in magnitude but opposite in sign to the slope of $\log \bar B$ with $x$. A third possibility is that $\bar B$ varies in an essentially random manner that is independent of the covariate $x$. In this scenario, the error from taxonomic bias in the measurement $\log \tilde A_{i}$ acts to increase the noise in $\hat \beta$ (i.e., increase its standard error), but does not cause a systematic error. -Figure \@ref(fig:regression-example) shows a simulated example of a regression analysis of the absolute abundances of 10 species across two conditions in which the mean efficiency is systematically greater in the condition $x=1$ than the in the condition $x=0$. -The increase in the log mean efficiency from $x=0$ to $x=1$ (Figure \@ref(fig:regression-example) A and B) causes an artificial decrease in the estimated LFD between conditions for each species (Figure \@ref(fig:regression-example) C and D). +Figure \@ref(fig:regression-example) shows a simulated example of a regression analysis of the absolute abundances of 10 species across two conditions in which the mean efficiency is systematically greater in the condition $x=1$ than in the condition $x=0$. +The increase in the log mean efficiency from $x=0$ to $x=1$ (Figure \@ref(fig:regression-example) A and B) corresponds to an artificial decrease in the estimated LFD between conditions for each species (Figure \@ref(fig:regression-example) C and D). The absolute error created by bias on the LFD estimate is the same for each species; however, its scientific impact varies. For the three species with large positive LFDs (Species 9, 10, and 1), bias decreases the magnitude of the LFD estimate. In contrast, for the three species with large negative LFDs (Species 8, 2, and 3), bias increases the magnitude of the estimate. @@ -110,12 +118,12 @@ The three remaining species (Species 7, 6, and 4) have LFDs near zero, but bias here::here( "notebook/_posts/2021-08-03-simulate-regression-example/", "simulate-regression-example_files/figure-html5", - "main-figure-1.svg" + "main-figure-2-1.svg" ) %>% include_svg ``` -(ref:regression-example) **Taxonomic bias distorts multi-sample differential abundance inference when the mean efficiency of samples is associated with the covariate of interest.** This figure shows the results of a regression analysis of simulated microbiomes consisting of 50 samples and 10 species from two environmental conditions indexed by $x=0$ and $x=1$. In this simulation, the species with the largest efficiency (Species 9) also has the largest positive LFD, which drives a positive association of the log mean efficiency with the condition (shown in Panels A and B). This positive LFD in the log mean efficiency induces a systematic negative shift in the estimated LFDs of all species (Panels C and D). Panel D shows the mean LFD (points) with 95% confidence intervals (CIs), for each species estimated from either the actual or measured abundances. The error (difference in LFD estimates on measured and actual) equals the negative LFD of the mean efficiency (shown in Panel B). +(ref:regression-example) **Taxonomic bias distorts multi-sample differential abundance analysis when the mean efficiency of samples is associated with the covariate of interest.** This figure shows the results of a regression analysis of simulated microbiomes consisting of 10 species and 50 samples from two environmental conditions, which are indexed by $x=0$ and $x=1$. Panel A shows the (log) measurement efficiencies of the 10 species against their differential-abundance parameter values (expected LFD in absolute abundance between a sample from condition $x=1$ to a sample from condition $x=0$). The simulation was created so that the species with the largest efficiency (Sp. 9) also has the largest positive expected LFD. The increased relative abundance of Sp. 9 in condition $x=1$ drives an increase in the (log) mean efficiency (Panel B), which induces a systematic and consistent negative shift in the estimates of the mean LFD for each species (Panels C and D). Panel D shows the estimated mean LFD with 95% confidence interval for each species, when either estimated the actual abundances or the inaccurate measured abundances. The error (estimate from measured abundances minus estimate from actual abundances; red arrow in Panel D) equals the negative mean LFD the mean efficiency (red arrow in Panel B).