From b3fa9b029b05eb99fd980e56f610b6afcd104956 Mon Sep 17 00:00:00 2001
From: Aki Vehtari <Aki.Vehtari@aalto.fi>
Date: Thu, 27 Jun 2024 14:03:17 +0300
Subject: [PATCH] new Pareto-k diagnostics vignette

---
 vignettes/pareto_diagnostics.Rmd | 226 +++++++++++++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 vignettes/pareto_diagnostics.Rmd

diff --git a/vignettes/pareto_diagnostics.Rmd b/vignettes/pareto_diagnostics.Rmd
new file mode 100644
index 0000000..5c418e8
--- /dev/null
+++ b/vignettes/pareto_diagnostics.Rmd
@@ -0,0 +1,226 @@
+---
+title: "Pareto-khat diagnostics"
+author: "Aki Vehtari"
+output: 
+  rmarkdown::html_vignette:
+    toc: true
+    toc_depth: 3
+vignette: >
+  %\VignetteIndexEntry{Pareto-khat diagnostics}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+## Introduction 
+
+Paper
+
+* Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and
+  Jonah Gabry (2024). Pareto smoothed importance sampling.
+  *Journal of Machine Learning Research*, 25(72):1-58.
+
+presents Pareto smoothed importance sampling, but also
+Pareto-$\hat{k}$ diagnostic that can be used when estimating any
+expectation based on finite sample. This vignette illustrates the use of
+these diagnostics.
+
+## Example
+
+```{r setup}
+library(posterior)
+options(pillar.neg = FALSE, pillar.subtle=FALSE, pillar.sigfig=2)
+```
+
+### Simulated data
+
+Generate `xn` a simulated MCMC sample with 4 chains each with 1000
+iterations using AR process with marginal normal(0,1)
+
+```{r simulate-data-1}
+N=1000
+phi=0.3
+set.seed(6534)
+dr<-array(data=replicate(4,as.numeric(arima.sim(n = N,
+                                                list(ar = c(phi)),
+                                                sd = sqrt((1-phi^2))))),
+         dim=c(N,4,1)) |>
+  as_draws_df() |>
+  set_variables('xn')
+```
+
+Transform `xn` via cdf-inverse-cdf so that we have variables that
+have marginally distributions $t_3$, $t_{2.5}$, $t_2$, $t_{1.5}$,
+and $t_1$. These all have thick tails. In addition $t_2$,
+$t_{1.5}$, and $t_1$ have infinite variance, and $t_1$ (aka Cauchy)
+has infinite mean. 
+
+```{r simulate-data-2}
+drt <- dr |>
+  mutate_variables(xt3=qt(pnorm(xn), df=3),
+                   xt2_5=qt(pnorm(xn), df=2.5),
+                   xt2=qt(pnorm(xn), df=2),
+                   xt1_5=qt(pnorm(xn), df=1.5),
+                   xt1=qt(pnorm(xn), df=1))
+```
+
+### MCMC convergence diagnostics
+
+We examine the draws with the default `summarise_draws()`.
+
+```{r summarise_draws}
+drt |> summarise_draws()
+```
+
+All the usual convergence diagnostics $\widehat{R}$, Bulk-ESS, and
+Tail-ESS look good, which is fine as they have been designed to
+work also with infinite variance [@Vehtari+etal:2021:Rhat].
+
+If these variables would present variables of interest for which
+we would like to estimate means, we would be also interested in
+Monte Carlo standard error (MCSE, see case study [How many
+iterations to run and how many digits to
+report](https://users.aalto.fi/~ave/casestudies/Digits/digits.html)).
+
+```{r summarise_draws-mcse}
+drt |>
+  summarise_draws(mean, sd, mcse_mean, ess_bulk, ess_basic)
+```
+
+Here MCSE for mean is based on standard deviation and Basic-ESS,
+but these assume finite variance. We did sample also from
+distributions with infinite variance, but given a finite sample size, the
+empirical variance estimates are always finite, and thus we get
+overoptimistic MCSE.
+
+## Pareto-$\hat{k}$
+
+To diagnose whether our variables of interest may have infinite
+variance and even infinite mean, we can use Pareto-$\hat{k}$
+diagnostic.
+
+```{r summarise_draws-pareto_khat}
+drt |>
+  summarise_draws(mean, sd, mcse_mean, ess_basic, pareto_khat)
+```
+
+$\hat{k} \leq 0$ indicates that all moments exist, and the inverse
+of positive $\hat{k}$ tells estimate for the number of finite (fractional)
+moments. Thus, $\hat{k}\geq 1/2$ indicates infinite variance,
+and $\hat{k}\geq 1$ indicates infinite mean. Sometimes very thick
+distribution tails may affect also sampling, but assuming sampling
+did go well, and we would be interested only in quantiles, infinite
+variance and mean are not a problem. But if we are interested in mean,
+then we need to care about the number of (fractional) moments. Here we
+see $\hat{k} \geq 1/2$ for $t_2$, $t_{1.5}$, and $t_{1}$, and
+we should not trust their `mcse_mean` values.
+
+## Pareto smoothing
+
+If we really do need those mean estimates, we can improve
+trustworthiness by Pareto smoothing,  which replaces
+extreme tail draws with expected ordered statistics of Pareto
+distribution fitted to the tails of the distribution. Pareto
+smoothed mean estimate (computed using Pareto smoothed draws) has
+finite variance with a cost of some bias which we know when it is
+negligible. As a thumb rule when $\hat{k}<0.7$, the bias is
+negligible.
+
+We do Pareto smoothing for all the variables.
+
+```{r pareto_smooth}
+drts <- drt |> 
+  mutate_variables(xt3_s=pareto_smooth(xt3),
+                   xt2_5_s=pareto_smooth(xt2_5),
+                   xt2_s=pareto_smooth(xt2),
+                   xt1_5_s=pareto_smooth(xt1_5),
+                   xt1_s=pareto_smooth(xt1)) |>
+  subset_draws(variable="_s", regex=TRUE)
+```
+
+Now the `mcse_mean` values are more trustworthy when $\hat{k} < 0.7$.
+When $\hat{k}>0.7$ both bias and variance grow so fast that Pareto smoothing
+rarely helps (see more details in the paper).
+
+```{r summarise_draws-pareto_khat-2}
+drts |> summarise_draws(mean, mcse_mean, ess_basic, pareto_khat)
+```
+
+
+## Minimum sample size required
+
+The bias and variance depend on the sample size, and we can
+use additional diagnostic `min_ss` which tells the minimum sample size needed
+so that `mcse_mean` can be trusted.
+
+```{r summarise_draws-min_ss}
+drt |> summarise_draws(mean, mcse_mean, ess_basic,
+                       pareto_khat, min_ss=pareto_min_ss)
+```
+
+Here required `min_ss` is smaller than `ess_basic` for all except $t_1$, for
+which there is no hope.
+
+## Convergence rate
+
+Given finite variance, central limit theorem states that to halve
+MCSE we need four times bigger sample size. With Pareto smoothing,
+we can go further, but the convergence rate decreases when $\hat{k}$ increases.
+
+```{r summarise_draws-conv_rate}
+drt |> summarise_draws(mean, mcse_mean, ess_basic,
+                       pareto_khat, min_ss=pareto_min_ss,
+                       conv_rate=pareto_convergence_rate)
+```
+
+We see that with $t_2$, $t_{1.5}$, and $t_1$  we need $4^{1/0.86}\approx 5$,
+$4^{1/0.60}\approx 10$, and $4^{1/0}\approx \infty$ bigger sample sizes to
+halve MCSE for mean.
+
+## Pareto-$\hat{k}$-threshold
+
+The final Pareto diagnostic, $\hat{k}$-threshold, is useful for smaller sample sizes. Here we select only 100 iterations per chain to get total of 400 draws.
+
+```{r summarise_draws-khat_threshold}
+drt |> subset_draws(iteration=1:100) |>
+  summarise_draws(mean, mcse_mean, ess_basic,
+                  pareto_khat, min_ss=pareto_min_ss,
+                  khat_thres=pareto_khat_threshold,
+                  conv_rate=pareto_convergence_rate)
+```
+
+With only 400 draws, we can trust the Pareto smoothed result only when
+$\hat{k}<0.62$. For $t_{1.5}$ $\hat{k}\approx 0.64$, and `min_ss` reveals
+we would probably need more than 560 draws to be on the safe side.
+
+## Pareto diagnostics
+
+We can get all these diagnostics with `pareto_diags()`, and it's
+easy to use it also for derived quantities.
+
+```{r summarise_draws-pareto_diags}
+drt |> mutate_variables(xt2_5_sq=xt2_5^2) |> subset_draws(variable="xt2_5_sq") |>
+  summarise_draws(mean, mcse_mean,
+                  pareto_diags)
+```
+
+## Discussion
+
+All these diagnostics are presented in Section 3 and summarized in
+Table 1 in PSIS paper (Vehtari et al., 2024).
+
+If you don't need to estimate means of thick tailed distributions,
+and there are no sampling issues due to thick tails, then you don't
+need to check existence of finite variance, and thus there is no
+need to check Pareto-$\hat{k}$ for all the parameters and derived
+quantities.
+
+It is possible that the distribution has finite variance, but
+pre-asymptotically given a finite sample size the behavior can be
+similar to infinite variance. Thus the diagnostic is useful even in
+cases where theory guarantees finite variance.
+
+## Reference
+
+Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah
+Gabry (2024). Pareto smoothed importance sampling.
+*Journal of Machine Learning Research*, 25(72):1-58.