Skip to content

ArviZ 1.0 ideas

Oriol Abril-Pla edited this page Apr 5, 2022 · 2 revisions

This document is an attempt at writing a wishlist for ArviZ 1.0 and at proposing a layout/structure for it. I am pasting some ideas below, but the goal is not to implement them. Yet, I think we should dedicate this year to gathering "wishes", proposing and testing design ideas so we can dedicate CZI resources next year to making the chosen ideas a reality


Oriol Ideas

Plotting modularity

I think we should follow https://github.com/arviz-devs/arviz/wiki/Plot-hierarchy, even bring it a bit further.

I would have as base plots be plot types on a visual style only: line, kde, bar (used for both rank and hist)... which we define an API for based on our needs and implement those with all backends (or with no backend, printing/returning data). Those base plots therefore would get all the inputs already pre-processed, they are mostly a way to support multiple backends with minimal effort and duplication. Above that, the backend becomes nothing more than an argument that is passed down unmodified.

We would then have the atomic plots which are atomic plots in the statistical sense. Those provide the building blocks we can then combine:

  • plot_posterior = plot_dist + plot_interval + plot_point
    • plot_dist basically takes care only of choosing between kde, hist, ecdf? or quantile plot
      • kde takes the data, computes the kde and calls plot_line, a base plot
      • hist takes the data, computes the histogram and calls plot_bars
      • quantile ... (though given its nature and relation between y position of dots and figure size this might need to be a base plot)
    • plot_interval takes the data, computes the hdi or eti and calls plot_line and plot_text. plot_line is the the same base plot as kde, we only need to set the y manually and make it constant.
    • plot_point computes the mean/median/mode and calls plot_scatter and plot_text
  • ...

Note all the takes the data. For easy facetting we need that facet functions call the xarray_var_iter and similar helpers, and call the multiple functions on the same data (but different kwargs if requested). Base plots take the input needed for plotting so that no processing needs to be duplicated.

We would also need to discuss how to implement the plots, especially the base ones, do we want to keep the (optionally) axes in -> axes out approach, do we want to use Artists (or other non matplotlib equivalent), or do we want them to always return an ArviZLine object a la artist but "hiding" the actual plot object from the backend inside a common interface/type?

Use proper ufuncs/gufuncs

For better speed and integration with dask, we need to write our functions as ufuncs as much as possible. Here I am not yet sure about what to do.

One option would be to make numba a hard requirement of the stats module and substitute the wrap_xarray_ufunc with a bit different functions and numba decorators. Another option is to keep those but whenever possible, use the numba alternative, which might end up with some duplication though hopefully not too much.

Background: Our current approach of using xarray.apply_ufunc and wrap_xarray_ufunc everywhere is great as we always preserve coords and dims, and we can also use dask as computational backend instead of numpy. However, the use we can make of dask is quite limited. xarray.apply_ufunc supports 3 modes:

  • dask="forbidden" to enforce computation with numpy
  • dask="allowed" to use dask as numpy substitute
  • dask="parallelized" to use numpy for computation but to have xarray use dask to parallelize the calls to numpy.

wrap_xarray_ufunc can only support dask="parallelized" which is much less performant. The reason for that is that it takes functions that are not ufuncs (and thus can't be used directly with xarray.apply_ufunc) and cleverly loops over the dimensions of the input array to call the function on all the relevant subsets. From what I understand, this use of loops prevents dask from being able to completely take over.

Experiments and proposals

Dependency handling (significantly less important)

I think that more and more, inference will be run on clusters, cloud services... Having matplotlib as a dependency of ArviZ can make installing it in those minimal environments complicated, defeating one of the goals of InferenceData.

I think we should have matplotlib be an optional dependency like bokeh, which might require some refactoring to plotting utils (or alternatively supporting matplotlib alone of matplotlib+bokeh for bokeh plots). We can then have arviz[all] be the recommended? full installation, but also provide arviz for raw data+stats only install, arviz[mpl] for matplotlib only...

Note: If it were possible to automate releases and version number update easily to coordinate among multiple packages, it might also be worth it to have arviz be a "placeholder" that install arviz-data arviz-stats and arviz-plots (all always released together and with their dependencies pinned). It is great to provide all the converters ourselves, but I am getting the impression more and more that arviz functions can only be used on arviz generated inferencedata and that inferencedata limitations extend to stats and plots. i.e. one can write their own inferencedata converter and use *, draw, chain or chain, *, draw dimension order, stats and plots only care about the dimension names. Yet, I would bet that if we did a poll, basically everybody would answer this is not supported.

This also has the nice "property" that doing pip install arviz will do the same in <1 and >=1.