WIP: bootstrapping submodule, parameter for estimators #94

dotsdl · 2020-01-20T23:00:28Z

This PR addresses #80.

We have added an alchemlyb.preprocessing.bootstrapping submodule. The functions exposed by this module should work on both dHdl and u_nk standard forms, with a similar usage feel to the alchemlyb.preprocessing.subsampling functions. The following are needed to complete this PR:

tests
usage example(s)
preprocessing documentation for bootstrapping submodule
addition to base estimator class
at least one block bootstrapping implementation

In addition to the function(s), we additionally want to add a bootstrapping keyword (or set of keywords) to each estimator's constructor, which changes how uncertainties are estimated internally to the estimator at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose the building blocks by way of thebootstrapping module functions to allow for scaling out if needed (e.g. with dask).

As for block bootstrapping, we want to include an approach that bootstraps whole blocks of a timeseries instead of individual samples from that timeseries. Blocks should be defined based on the correlation time determined from the timeseries, or alternatively specified by the user. There are at least a few ways to do this, so this may require more debate and development. The advantage of block bootstrapping is that the resulting bootstrapped samples maintain correlation within their blocks, which may be desirable in some cases.

Should work on both dHdl and u_nk standard forms. Still requires: - tests - usage example(s) We additionally want to add a bootstrapping keyword to each estimator's constructor, which changes how uncertainties are estimated at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose building blocks by way of the `bootstrap` function to allow for scaling out if needed (e.g. with dask).

codecov · 2020-01-20T23:03:44Z

Codecov Report

Merging #94 into master will decrease coverage by 2.05%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
- Coverage   97.27%   95.21%   -2.06%     
==========================================
  Files          12       13       +1     
  Lines         696      711      +15     
  Branches      141      143       +2     
==========================================
  Hits          677      677              
- Misses          5       20      +15     
  Partials       14       14

Impacted Files	Coverage Δ
src/alchemlyb/preprocessing/bootstrapping.py	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed6cb72...e8db134. Read the comment docs.

dotsdl · 2020-01-21T07:06:11Z

Added a gist on how this works currently. Comments welcome!

orbeckst · 2020-01-23T23:21:19Z

I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?

For the "blocks" approach: Do I have to think about a trajectory of blocks where each block is the slice information to get the original data, something like slices = [[start0, stop0, step0], [start1, stop1, step1], ..., [startN_1, stopN_1, stepN_1]] so that data[slices] produces N blocks. Then you bootstrap the indices = arange(len(slices)) of slices to reassemble a subset of the original data as data[slices[bootstrapped_indices]]?

mrshirts · 2020-01-24T00:14:27Z

I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?

For vanilla bootstrap, yes.

For block bootstrap, no -- though to determine the length of the blocks, you do need to know the correlation length.

orbeckst · 2020-01-24T00:36:28Z

If I read the gist correctly then there "vanilla bootstrap" is performed so if the gist is turned into documentation then data decorrelation should probably included so that copy&paste will at least produce conservative estimates.

From the description in the issue above I still don't understand how block bootstraps will be used. A simple example might be helpful.

dotsdl · 2020-01-24T05:16:27Z

@orbeckst thanks! My intent is for the gist to become our actual example for bootstrap usage in the docs, so I am working on adding use of statistical_inefficiency in the loop for the Bootstrapping + estimation section onward.

However, it's looking like this isn't straightforward with the current implementation limitations of our subsamplers. I think it may be due to a combination of bitrot by way of changes in how pandas behaves with multi-index resetting and my rather limiting assumptions when I wrote the subsamplers way back when. I'm going to proceed with getting this right by making the subsampling functions work just fine with full DataFrames like what we get from alchemtest.gmx.load_expanded_ensemble_case_1. We need this since it is proving to be too painful pandas-fu to work around it. This will be a separate PR.

I believe you have the right idea on how block bootstrapping works. From @mrshirts, there are a few variations on it, however. Here is a reference I am working from.

dotsdl · 2020-01-29T06:03:19Z

The gist has been updated; it requires components of #98, which can be played with on this branch. This addresses @orbeckst's comments on decorrelation for when this notebook becomes the first example doc for bootstrapping.

orbeckst · 2022-07-26T17:42:04Z

I think pymbar 4 now contains bootstrapping natively (did I understand this correctly, @mrshirts ?). Is this PR still needed when we move to using pymbar 4.x #207 ?

mrshirts · 2022-07-26T19:47:00Z

GREAT question. If you are bootstrapping a single free energy calculation, or a single expectation. pymbar 4 handles it all, and faster than alchemlyb could wrapping around MBAR.

If you were doing something like a heat capacity calculation with multiple temperatures reweighting, or where you do some manipulation of the results of multiple free energy calculations, then you really should bootstrap on the outside of that. Entropy + enthalpy are done internally. But if for whatever reason you were calculating k_1 <O_1> - k_2<O_2> , where <O_1> and <O_2> are expectations of different observables from the same simulation and k_1 and k_2 are constants, then you can't bootstrap them separately and add the uncertainties together - you should bootstrap the entire calculation.

If all the functions just are computing a vector f_k from a single single set of files that have been subsampled - then the internal boostrapping should be fine.

orbeckst · 2023-06-06T22:50:22Z

PR #322 addressed #320 so we can now use MBAR-native bootstrap.

I feel that this is sufficient for the moment, especially as TI can get error estimates by other means.

Therefore, I am closing this PR, although of course anyone is welcome to resurrect it and take up work on it again.

Added random_seed kwarg to bootstrap function

e8db134

dotsdl mentioned this pull request Jan 21, 2020

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

Closed

This was referenced Jan 24, 2020

Calculation of statistical inefficiency in the method equilibrium_detection #96

Closed

A problem with slicing in subsampling #95

Open

dotsdl mentioned this pull request Mar 11, 2020

WIP: Refactor of subsampling module #98

Open

4 tasks

orbeckst closed this Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: bootstrapping submodule, parameter for estimators #94

WIP: bootstrapping submodule, parameter for estimators #94

dotsdl commented Jan 20, 2020 •

edited

Loading

codecov bot commented Jan 20, 2020 •

edited

Loading

dotsdl commented Jan 21, 2020

orbeckst commented Jan 23, 2020

mrshirts commented Jan 24, 2020

orbeckst commented Jan 24, 2020

dotsdl commented Jan 24, 2020

dotsdl commented Jan 29, 2020 •

edited

Loading

orbeckst commented Jul 26, 2022 •

edited

Loading

mrshirts commented Jul 26, 2022

orbeckst commented Jun 6, 2023

WIP: bootstrapping submodule, parameter for estimators #94

WIP: bootstrapping submodule, parameter for estimators #94

Conversation

dotsdl commented Jan 20, 2020 • edited Loading

codecov bot commented Jan 20, 2020 • edited Loading

Codecov Report

dotsdl commented Jan 21, 2020

orbeckst commented Jan 23, 2020

mrshirts commented Jan 24, 2020

orbeckst commented Jan 24, 2020

dotsdl commented Jan 24, 2020

dotsdl commented Jan 29, 2020 • edited Loading

orbeckst commented Jul 26, 2022 • edited Loading

mrshirts commented Jul 26, 2022

orbeckst commented Jun 6, 2023

dotsdl commented Jan 20, 2020 •

edited

Loading

codecov bot commented Jan 20, 2020 •

edited

Loading

dotsdl commented Jan 29, 2020 •

edited

Loading

orbeckst commented Jul 26, 2022 •

edited

Loading