Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: bootstrapping submodule, parameter for estimators #94

Closed
wants to merge 2 commits into from

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Jan 20, 2020

This PR addresses #80.

We have added an alchemlyb.preprocessing.bootstrapping submodule. The functions exposed by this module should work on both dHdl and u_nk standard forms, with a similar usage feel to the alchemlyb.preprocessing.subsampling functions. The following are needed to complete this PR:

  • tests
  • usage example(s)
  • preprocessing documentation for bootstrapping submodule
  • addition to base estimator class
  • at least one block bootstrapping implementation

In addition to the function(s), we additionally want to add a bootstrapping keyword (or set of keywords) to each estimator's constructor, which changes how uncertainties are estimated internally to the estimator at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose the building blocks by way of thebootstrapping module functions to allow for scaling out if needed (e.g. with dask).

As for block bootstrapping, we want to include an approach that bootstraps whole blocks of a timeseries instead of individual samples from that timeseries. Blocks should be defined based on the correlation time determined from the timeseries, or alternatively specified by the user. There are at least a few ways to do this, so this may require more debate and development. The advantage of block bootstrapping is that the resulting bootstrapped samples maintain correlation within their blocks, which may be desirable in some cases.

Should work on both dHdl and u_nk standard forms.
Still requires:
- tests
- usage example(s)

We additionally want to add a bootstrapping keyword to each estimator's
constructor, which changes how uncertainties are estimated at the
expense of more compute.
This will be the approach we will push users toward using for
simplicity, but we still want to expose building blocks by way of the
`bootstrap` function to allow for scaling out if needed (e.g. with
dask).
@codecov
Copy link

codecov bot commented Jan 20, 2020

Codecov Report

Merging #94 into master will decrease coverage by 2.05%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
- Coverage   97.27%   95.21%   -2.06%     
==========================================
  Files          12       13       +1     
  Lines         696      711      +15     
  Branches      141      143       +2     
==========================================
  Hits          677      677              
- Misses          5       20      +15     
  Partials       14       14
Impacted Files Coverage Δ
src/alchemlyb/preprocessing/bootstrapping.py 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed6cb72...e8db134. Read the comment docs.

@dotsdl
Copy link
Member Author

dotsdl commented Jan 21, 2020

Added a gist on how this works currently. Comments welcome!

@orbeckst
Copy link
Member

I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?

For the "blocks" approach: Do I have to think about a trajectory of blocks where each block is the slice information to get the original data, something like slices = [[start0, stop0, step0], [start1, stop1, step1], ..., [startN_1, stopN_1, stepN_1]] so that data[slices] produces N blocks. Then you bootstrap the indices = arange(len(slices)) of slices to reassemble a subset of the original data as data[slices[bootstrapped_indices]]?

@mrshirts
Copy link

I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?

For vanilla bootstrap, yes.

For block bootstrap, no -- though to determine the length of the blocks, you do need to know the correlation length.

@orbeckst
Copy link
Member

If I read the gist correctly then there "vanilla bootstrap" is performed so if the gist is turned into documentation then data decorrelation should probably included so that copy&paste will at least produce conservative estimates.

From the description in the issue above I still don't understand how block bootstraps will be used. A simple example might be helpful.

@dotsdl
Copy link
Member Author

dotsdl commented Jan 24, 2020

@orbeckst thanks! My intent is for the gist to become our actual example for bootstrap usage in the docs, so I am working on adding use of statistical_inefficiency in the loop for the Bootstrapping + estimation section onward.

However, it's looking like this isn't straightforward with the current implementation limitations of our subsamplers. I think it may be due to a combination of bitrot by way of changes in how pandas behaves with multi-index resetting and my rather limiting assumptions when I wrote the subsamplers way back when. I'm going to proceed with getting this right by making the subsampling functions work just fine with full DataFrames like what we get from alchemtest.gmx.load_expanded_ensemble_case_1. We need this since it is proving to be too painful pandas-fu to work around it. This will be a separate PR.

I believe you have the right idea on how block bootstrapping works. From @mrshirts, there are a few variations on it, however. Here is a reference I am working from.

@dotsdl
Copy link
Member Author

dotsdl commented Jan 29, 2020

The gist has been updated; it requires components of #98, which can be played with on this branch. This addresses @orbeckst's comments on decorrelation for when this notebook becomes the first example doc for bootstrapping.

@dotsdl dotsdl mentioned this pull request Mar 11, 2020
4 tasks
@orbeckst
Copy link
Member

orbeckst commented Jul 26, 2022

I think pymbar 4 now contains bootstrapping natively (did I understand this correctly, @mrshirts ?). Is this PR still needed when we move to using pymbar 4.x #207 ?

@mrshirts
Copy link

GREAT question. If you are bootstrapping a single free energy calculation, or a single expectation. pymbar 4 handles it all, and faster than alchemlyb could wrapping around MBAR.

If you were doing something like a heat capacity calculation with multiple temperatures reweighting, or where you do some manipulation of the results of multiple free energy calculations, then you really should bootstrap on the outside of that. Entropy + enthalpy are done internally. But if for whatever reason you were calculating k_1 <O_1> - k_2<O_2> , where <O_1> and <O_2> are expectations of different observables from the same simulation and k_1 and k_2 are constants, then you can't bootstrap them separately and add the uncertainties together - you should bootstrap the entire calculation.

If all the functions just are computing a vector f_k from a single single set of files that have been subsampled - then the internal boostrapping should be fine.

@orbeckst
Copy link
Member

orbeckst commented Jun 6, 2023

PR #322 addressed #320 so we can now use MBAR-native bootstrap.

I feel that this is sufficient for the moment, especially as TI can get error estimates by other means.

Therefore, I am closing this PR, although of course anyone is welcome to resurrect it and take up work on it again.

@orbeckst orbeckst closed this Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants