-
-
Notifications
You must be signed in to change notification settings - Fork 407
ArviZ 1.0 ideas
This document is an attempt at writing a wishlist for ArviZ 1.0 and at proposing a layout/structure for it. I am pasting some ideas below, but the goal is not to implement them. Yet, I think we should dedicate this year to gathering "wishes", proposing and testing design ideas so we can dedicate CZI resources next year to making the chosen ideas a reality
I think we should follow https://github.com/arviz-devs/arviz/wiki/Plot-hierarchy, even bring it a bit further.
I would have as base plots be plot types on a visual style only: line, kde, bar (used for both rank and hist)... which we define an API for based on our needs and implement those with all backends (or with no backend, printing/returning data). Those base plots therefore would get all the inputs already pre-processed, they are mostly a way to support multiple backends with minimal effort and duplication. Above that, the backend becomes nothing more than an argument that is passed down unmodified.
We would then have the atomic plots which are atomic plots in the statistical sense. Those provide the building blocks we can then combine:
- plot_posterior = plot_dist + plot_interval + plot_point
- plot_dist basically takes care only of choosing between kde, hist, ecdf? or quantile plot
- kde takes the data, computes the kde and calls
plot_line
, a base plot - hist takes the data, computes the histogram and calls
plot_bars
- quantile ... (though given its nature and relation between y position of dots and figure size this might need to be a base plot)
- kde takes the data, computes the kde and calls
- plot_interval takes the data, computes the hdi or eti and calls
plot_line
andplot_text
. plot_line is the the same base plot as kde, we only need to set the y manually and make it constant. - plot_point computes the mean/median/mode and calls
plot_scatter
andplot_text
- plot_dist basically takes care only of choosing between kde, hist, ecdf? or quantile plot
- ...
Note all the takes the data. For easy facetting we need that facet functions call the
xarray_var_iter
and similar helpers, and call the multiple functions on the same data
(but different kwargs if requested). Base plots take the input needed for plotting
so that no processing needs to be duplicated.
We would also need to discuss how to implement the plots, especially the base ones, do we want to keep the
(optionally) axes in -> axes out approach, do we want to use Artists (or other non matplotlib equivalent),
or do we want them to always return an ArviZLine
object a la artist but "hiding" the actual
plot object from the backend inside a common interface/type?
For better speed and integration with dask, we need to write our functions as ufuncs as much as possible. Here I am not yet sure about what to do.
One option would be to make numba a hard requirement of the stats module and
substitute the wrap_xarray_ufunc
with a bit different functions and numba decorators.
Another option is to keep those but whenever possible, use the numba alternative, which
might end up with some duplication though hopefully not too much.
Background: Our current approach of using xarray.apply_ufunc
and wrap_xarray_ufunc
everywhere is great as we always preserve coords and dims, and we can also
use dask as computational backend instead of numpy. However, the use
we can make of dask is quite limited. xarray.apply_ufunc
supports 3 modes:
-
dask="forbidden"
to enforce computation with numpy -
dask="allowed"
to use dask as numpy substitute -
dask="parallelized"
to use numpy for computation but to have xarray use dask to parallelize the calls to numpy.
wrap_xarray_ufunc
can only support dask="parallelized"
which is much less performant.
The reason for that is that it takes functions that are not ufuncs (and thus can't be used
directly with xarray.apply_ufunc
) and cleverly loops over the dimensions
of the input array to call the function on all the relevant subsets. From what I understand,
this use of loops prevents dask from being able to completely take over.
- Proof of concept of using pure xarray and xarray-einstats to implement an equivalent to
az.rhat
: https://github.com/OriolAbril/calaix_de_sastre/blob/main/arviz_lab/stats_diagnostics_refactoring.ipynb
I think that more and more, inference will be run on clusters, cloud services... Having matplotlib as a dependency of ArviZ can make installing it in those minimal environments complicated, defeating one of the goals of InferenceData.
I think we should have matplotlib be an optional dependency like bokeh, which might
require some refactoring to plotting utils (or alternatively supporting matplotlib alone
of matplotlib+bokeh for bokeh plots). We can then have arviz[all]
be the recommended?
full installation, but also provide arviz
for raw data+stats only install, arviz[mpl]
for matplotlib only...
Note: If it were possible to automate releases and version number update easily to
coordinate among multiple packages, it might also be worth it to have arviz
be
a "placeholder" that install arviz-data
arviz-stats
and arviz-plots
(all always
released together and with their dependencies pinned). It is great to
provide all the converters ourselves, but I am getting the impression more and more
that arviz functions can only be used on arviz generated inferencedata and
that inferencedata limitations extend to stats and plots. i.e. one can write
their own inferencedata converter and use *, draw, chain
or chain, *, draw
dimension order, stats and plots only care about the dimension names. Yet,
I would bet that if we did a poll, basically everybody would answer this is not
supported.
This also has the nice "property" that doing pip install arviz
will do the same
in <1 and >=1.