-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: bootstrapping submodule, parameter for estimators #94
Conversation
Should work on both dHdl and u_nk standard forms. Still requires: - tests - usage example(s) We additionally want to add a bootstrapping keyword to each estimator's constructor, which changes how uncertainties are estimated at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose building blocks by way of the `bootstrap` function to allow for scaling out if needed (e.g. with dask).
Codecov Report
@@ Coverage Diff @@
## master #94 +/- ##
==========================================
- Coverage 97.27% 95.21% -2.06%
==========================================
Files 12 13 +1
Lines 696 711 +15
Branches 141 143 +2
==========================================
Hits 677 677
- Misses 5 20 +15
Partials 14 14
Continue to review full report at Codecov.
|
Added a gist on how this works currently. Comments welcome! |
I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data? For the "blocks" approach: Do I have to think about a trajectory of blocks where each block is the slice information to get the original data, something like |
For vanilla bootstrap, yes. For block bootstrap, no -- though to determine the length of the blocks, you do need to know the correlation length. |
If I read the gist correctly then there "vanilla bootstrap" is performed so if the gist is turned into documentation then data decorrelation should probably included so that copy&paste will at least produce conservative estimates. From the description in the issue above I still don't understand how block bootstraps will be used. A simple example might be helpful. |
@orbeckst thanks! My intent is for the gist to become our actual example for bootstrap usage in the docs, so I am working on adding use of However, it's looking like this isn't straightforward with the current implementation limitations of our subsamplers. I think it may be due to a combination of bitrot by way of changes in how I believe you have the right idea on how block bootstrapping works. From @mrshirts, there are a few variations on it, however. Here is a reference I am working from. |
The gist has been updated; it requires components of #98, which can be played with on this branch. This addresses @orbeckst's comments on decorrelation for when this notebook becomes the first example doc for bootstrapping. |
GREAT question. If you are bootstrapping a single free energy calculation, or a single expectation. pymbar 4 handles it all, and faster than alchemlyb could wrapping around MBAR. If you were doing something like a heat capacity calculation with multiple temperatures reweighting, or where you do some manipulation of the results of multiple free energy calculations, then you really should bootstrap on the outside of that. Entropy + enthalpy are done internally. But if for whatever reason you were calculating k_1 <O_1> - k_2<O_2> , where <O_1> and <O_2> are expectations of different observables from the same simulation and k_1 and k_2 are constants, then you can't bootstrap them separately and add the uncertainties together - you should bootstrap the entire calculation. If all the functions just are computing a vector f_k from a single single set of files that have been subsampled - then the internal boostrapping should be fine. |
This PR addresses #80.
We have added an
alchemlyb.preprocessing.bootstrapping
submodule. The functions exposed by this module should work on bothdHdl
andu_nk
standard forms, with a similar usage feel to thealchemlyb.preprocessing.subsampling
functions. The following are needed to complete this PR:bootstrapping
submoduleIn addition to the function(s), we additionally want to add a bootstrapping keyword (or set of keywords) to each estimator's constructor, which changes how uncertainties are estimated internally to the estimator at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose the building blocks by way of the
bootstrapping
module functions to allow for scaling out if needed (e.g. with dask).As for block bootstrapping, we want to include an approach that bootstraps whole blocks of a timeseries instead of individual samples from that timeseries. Blocks should be defined based on the correlation time determined from the timeseries, or alternatively specified by the user. There are at least a few ways to do this, so this may require more debate and development. The advantage of block bootstrapping is that the resulting bootstrapped samples maintain correlation within their blocks, which may be desirable in some cases.