From b3fa9b029b05eb99fd980e56f610b6afcd104956 Mon Sep 17 00:00:00 2001 From: Aki Vehtari Date: Thu, 27 Jun 2024 14:03:17 +0300 Subject: [PATCH] new Pareto-k diagnostics vignette --- vignettes/pareto_diagnostics.Rmd | 226 +++++++++++++++++++++++++++++++ 1 file changed, 226 insertions(+) create mode 100644 vignettes/pareto_diagnostics.Rmd diff --git a/vignettes/pareto_diagnostics.Rmd b/vignettes/pareto_diagnostics.Rmd new file mode 100644 index 0000000..5c418e8 --- /dev/null +++ b/vignettes/pareto_diagnostics.Rmd @@ -0,0 +1,226 @@ +--- +title: "Pareto-khat diagnostics" +author: "Aki Vehtari" +output: + rmarkdown::html_vignette: + toc: true + toc_depth: 3 +vignette: > + %\VignetteIndexEntry{Pareto-khat diagnostics} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +## Introduction + +Paper + +* Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and + Jonah Gabry (2024). Pareto smoothed importance sampling. + *Journal of Machine Learning Research*, 25(72):1-58. + +presents Pareto smoothed importance sampling, but also +Pareto-$\hat{k}$ diagnostic that can be used when estimating any +expectation based on finite sample. This vignette illustrates the use of +these diagnostics. + +## Example + +```{r setup} +library(posterior) +options(pillar.neg = FALSE, pillar.subtle=FALSE, pillar.sigfig=2) +``` + +### Simulated data + +Generate `xn` a simulated MCMC sample with 4 chains each with 1000 +iterations using AR process with marginal normal(0,1) + +```{r simulate-data-1} +N=1000 +phi=0.3 +set.seed(6534) +dr<-array(data=replicate(4,as.numeric(arima.sim(n = N, + list(ar = c(phi)), + sd = sqrt((1-phi^2))))), + dim=c(N,4,1)) |> + as_draws_df() |> + set_variables('xn') +``` + +Transform `xn` via cdf-inverse-cdf so that we have variables that +have marginally distributions $t_3$, $t_{2.5}$, $t_2$, $t_{1.5}$, +and $t_1$. These all have thick tails. In addition $t_2$, +$t_{1.5}$, and $t_1$ have infinite variance, and $t_1$ (aka Cauchy) +has infinite mean. + +```{r simulate-data-2} +drt <- dr |> + mutate_variables(xt3=qt(pnorm(xn), df=3), + xt2_5=qt(pnorm(xn), df=2.5), + xt2=qt(pnorm(xn), df=2), + xt1_5=qt(pnorm(xn), df=1.5), + xt1=qt(pnorm(xn), df=1)) +``` + +### MCMC convergence diagnostics + +We examine the draws with the default `summarise_draws()`. + +```{r summarise_draws} +drt |> summarise_draws() +``` + +All the usual convergence diagnostics $\widehat{R}$, Bulk-ESS, and +Tail-ESS look good, which is fine as they have been designed to +work also with infinite variance [@Vehtari+etal:2021:Rhat]. + +If these variables would present variables of interest for which +we would like to estimate means, we would be also interested in +Monte Carlo standard error (MCSE, see case study [How many +iterations to run and how many digits to +report](https://users.aalto.fi/~ave/casestudies/Digits/digits.html)). + +```{r summarise_draws-mcse} +drt |> + summarise_draws(mean, sd, mcse_mean, ess_bulk, ess_basic) +``` + +Here MCSE for mean is based on standard deviation and Basic-ESS, +but these assume finite variance. We did sample also from +distributions with infinite variance, but given a finite sample size, the +empirical variance estimates are always finite, and thus we get +overoptimistic MCSE. + +## Pareto-$\hat{k}$ + +To diagnose whether our variables of interest may have infinite +variance and even infinite mean, we can use Pareto-$\hat{k}$ +diagnostic. + +```{r summarise_draws-pareto_khat} +drt |> + summarise_draws(mean, sd, mcse_mean, ess_basic, pareto_khat) +``` + +$\hat{k} \leq 0$ indicates that all moments exist, and the inverse +of positive $\hat{k}$ tells estimate for the number of finite (fractional) +moments. Thus, $\hat{k}\geq 1/2$ indicates infinite variance, +and $\hat{k}\geq 1$ indicates infinite mean. Sometimes very thick +distribution tails may affect also sampling, but assuming sampling +did go well, and we would be interested only in quantiles, infinite +variance and mean are not a problem. But if we are interested in mean, +then we need to care about the number of (fractional) moments. Here we +see $\hat{k} \geq 1/2$ for $t_2$, $t_{1.5}$, and $t_{1}$, and +we should not trust their `mcse_mean` values. + +## Pareto smoothing + +If we really do need those mean estimates, we can improve +trustworthiness by Pareto smoothing, which replaces +extreme tail draws with expected ordered statistics of Pareto +distribution fitted to the tails of the distribution. Pareto +smoothed mean estimate (computed using Pareto smoothed draws) has +finite variance with a cost of some bias which we know when it is +negligible. As a thumb rule when $\hat{k}<0.7$, the bias is +negligible. + +We do Pareto smoothing for all the variables. + +```{r pareto_smooth} +drts <- drt |> + mutate_variables(xt3_s=pareto_smooth(xt3), + xt2_5_s=pareto_smooth(xt2_5), + xt2_s=pareto_smooth(xt2), + xt1_5_s=pareto_smooth(xt1_5), + xt1_s=pareto_smooth(xt1)) |> + subset_draws(variable="_s", regex=TRUE) +``` + +Now the `mcse_mean` values are more trustworthy when $\hat{k} < 0.7$. +When $\hat{k}>0.7$ both bias and variance grow so fast that Pareto smoothing +rarely helps (see more details in the paper). + +```{r summarise_draws-pareto_khat-2} +drts |> summarise_draws(mean, mcse_mean, ess_basic, pareto_khat) +``` + + +## Minimum sample size required + +The bias and variance depend on the sample size, and we can +use additional diagnostic `min_ss` which tells the minimum sample size needed +so that `mcse_mean` can be trusted. + +```{r summarise_draws-min_ss} +drt |> summarise_draws(mean, mcse_mean, ess_basic, + pareto_khat, min_ss=pareto_min_ss) +``` + +Here required `min_ss` is smaller than `ess_basic` for all except $t_1$, for +which there is no hope. + +## Convergence rate + +Given finite variance, central limit theorem states that to halve +MCSE we need four times bigger sample size. With Pareto smoothing, +we can go further, but the convergence rate decreases when $\hat{k}$ increases. + +```{r summarise_draws-conv_rate} +drt |> summarise_draws(mean, mcse_mean, ess_basic, + pareto_khat, min_ss=pareto_min_ss, + conv_rate=pareto_convergence_rate) +``` + +We see that with $t_2$, $t_{1.5}$, and $t_1$ we need $4^{1/0.86}\approx 5$, +$4^{1/0.60}\approx 10$, and $4^{1/0}\approx \infty$ bigger sample sizes to +halve MCSE for mean. + +## Pareto-$\hat{k}$-threshold + +The final Pareto diagnostic, $\hat{k}$-threshold, is useful for smaller sample sizes. Here we select only 100 iterations per chain to get total of 400 draws. + +```{r summarise_draws-khat_threshold} +drt |> subset_draws(iteration=1:100) |> + summarise_draws(mean, mcse_mean, ess_basic, + pareto_khat, min_ss=pareto_min_ss, + khat_thres=pareto_khat_threshold, + conv_rate=pareto_convergence_rate) +``` + +With only 400 draws, we can trust the Pareto smoothed result only when +$\hat{k}<0.62$. For $t_{1.5}$ $\hat{k}\approx 0.64$, and `min_ss` reveals +we would probably need more than 560 draws to be on the safe side. + +## Pareto diagnostics + +We can get all these diagnostics with `pareto_diags()`, and it's +easy to use it also for derived quantities. + +```{r summarise_draws-pareto_diags} +drt |> mutate_variables(xt2_5_sq=xt2_5^2) |> subset_draws(variable="xt2_5_sq") |> + summarise_draws(mean, mcse_mean, + pareto_diags) +``` + +## Discussion + +All these diagnostics are presented in Section 3 and summarized in +Table 1 in PSIS paper (Vehtari et al., 2024). + +If you don't need to estimate means of thick tailed distributions, +and there are no sampling issues due to thick tails, then you don't +need to check existence of finite variance, and thus there is no +need to check Pareto-$\hat{k}$ for all the parameters and derived +quantities. + +It is possible that the distribution has finite variance, but +pre-asymptotically given a finite sample size the behavior can be +similar to infinite variance. Thus the diagnostic is useful even in +cases where theory guarantees finite variance. + +## Reference + +Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah +Gabry (2024). Pareto smoothed importance sampling. +*Journal of Machine Learning Research*, 25(72):1-58.