-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding seed for reproducibility and sampling methods #344
base: 1.7.0
Are you sure you want to change the base?
Conversation
��:wq��Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0
Before merging this I'd like to add some reproducibility tests for graphconv models and the other automatically added deepchem models. I'd also like to set seeds for models that are tested so we can avoid tests failing because a model didn't do as well as expected. |
…r and updated the test_split to use a fixed seed
…ds in each run is the same
…a and KFoldClassificationPerfData. Updated the parameters and documentation to match behavior. New tests for SimpleClassificationPerfData and SimpleRegressionPerfData
… to see that there is more of the major class
Ok, I think that this PR is ready to be reviewed again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it all makes sense, but I have several questions in-line that might be worth addressing before merging.
Overall, this was a huge PR so in the future we should try to break things up, such as seeds in one and sampling methods in another, maybe k-fold as a third. git cherrypick
is really useful to move certain commits from one branch to another.
The testing suite looks extensive (huge). I guess this is good, although it might be redundant. Is there a way we can reduce the integrative tests and increase the unit tests to make it simpler?
@@ -655,11 +661,31 @@ def combined_training_data(self): | |||
# All of the splits have the same combined train/valid data, regardless of whether we're using | |||
# k-fold or train/valid/test splitting. | |||
if self.combined_train_valid_data is None: | |||
# normally combining one fold is sufficient, but if SMOTE or undersampling is being used | |||
# just combining the first fold isn't enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For undersampling, it looks like it assumes that K-fold undersampling would sample the entire non-test dataset. What if this isn't the case? Is this assumption ensured elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a compound can ever be wiped entirely out of existence due to undersampling. Undersampling is only applied to the training set of each fold.
And since every compound has a 'turn' in the validation set, that compound must appear at least once.
This isn't tested, but do we need to test it anywhere? I think it's ok if a compound is dropped entirely, since that's what happens when using undersampling without k-fold validation.
self.data.save_split_dataset() | ||
# write split metadata | ||
self.create_split_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the random seed get saved into model metadata? If you wanted to retrain a model 10 times to see how variable the predictions are, would you end up training a model with the same random seed or 10 different ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will change that.
@@ -697,7 +723,8 @@ def get_subset_responses_and_weights(self, subset, transformers): | |||
""" | |||
if subset not in self.subset_response_dict: | |||
if subset in ('train', 'valid', 'train_valid'): | |||
dataset = self.combined_training_data() | |||
for fold, (train, valid) in enumerate(self.train_valid_dsets): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are just looking for new compounds in each fold, can you just concatenate all train_valid_dsets and then call set(ids) or drop_duplicates() or something? Might make the code more efficient than multiple for loops, but I'm not sure if it is actually easier based on the way the datasets are stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These datasets are NumpyDatasets and contain an X matrix, y matrix, w matrix, and ids. I'd have to put them into a data frame, call drop_duplicates, and then put it back into a NumpyDataset.
However, I think I can get rid of that loop on 726. That's not necessary.
|
||
#adjust weights and ids | ||
resampled_indices = undersampler.sample_indices_ | ||
resampled_weights = train.w[resampled_indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the weights all be 1 if undersampling makes the # of classes even
# set the new weights equal to 1 | ||
average_weight = 1 #np.mean(train.w) | ||
synthetic_weights=np.full((num_synthetic,1), average_weight, dtype=np.float64) | ||
resampled_weights=np.concatenate([train.w, synthetic_weights]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the weights need to be fully recalculated according to the weighting strategy specified. if you have 10% class 1 and 90% class 0 and then add class 1 smote so it's 30% and 70%, then if you're doing balanced class weights you need to change all of them to .3/.7 instead of having .1, .9 and 1.
@@ -0,0 +1,15 @@ | |||
{"verbose": "True", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the seed_test jsons supposed to pass a seed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these tests train a model, using a json file without a seed, then retrains it with the same seed and compares the outcomes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the model is trained without a seed using a JSON file and then the seed is saved to the file. It is then retrained with the same parameters and seed to ensure the output is identical.
… Currently default is to discard the seed
…fault is value is False
Hi, this pull request introduces two new feature additions to AMPL:
Users now have an option to use SMOTE or RandomUnderSampler from imbalance-learn on their classification datasets. The module to incorporate sampling was created under
atomsci/ddm/pipeline/sampling.py
. It takes the input training dataset and applies the sampling strategy. It then updates the IDs and weights and returns the dataset. The module is integrated inmodel_pipeline.py
underload_featurized_data
and only runs if it is a classification model and a sampling method is defined in the parameters dictionary (default is None).Because the sampling strategy was applied to each fold in k-fold cv splitting, Stewart updated
combined_training_data
inmodel_datasets.py
to incorporate the data from each fold (rather than the prior assumption that the same data was shared across each fold’s training and validation datasets).Tests were created under
atomsci/ddm/test/integrative/sampling_test
. It rather exhaustive by testing each model type (RF, NN, XGBoost) and each sampling method. It is based off of thetest_kfold_split.py
test.Each model trained now has a seed associated with it so it can be reproduced given the seed and exact same hyperparameters and data. This seed is created from a new script called
random_seed.py
underatomsci/ddm/pipeline
that sets a random seed and creates a random number generator when called. The script will parse the seed from the parameter parser Namespace object and if it is None, it will generate a seed using the uuid library. It will set the seed for NumPy, TensorFlow, random, and PyTorch. The module gets imported intomodel_pipeline.py
and a seed is initialized in the__init__
. It also takes input for a seed and random state in case a user initializes ModelPipeline from a different function and wants to use a pre-existing seed and random state. The seed is passed into the DeepChem models and splitting modules.In
model_pipeline.py
I also added two new functions:create_split_metadata
andsave_split_metadata
. These functions create a split_metadata.json for each split dataset and save it under the output directory. It writes the seed and the parameters so that the exact split is reproducible.There are two types of tests saved under
atomsci/ddm/test/integrative/seed_test/
. The first type tests the reproducibility of split datasets and is executed usingseed_test_splitting.py
. It tests each split type (scaffold, random, fingerprint) and split strategy (train_valid_test, k_fold_cv). The second type tests the reproducibility of created models and is executed usingseed_test_models.py
. This tests the recreation of each model type (RF, NN, XGBoost) and prediction type (regression, classification).I've ran all the tests and used the ruff linter to check for any problems and everything currently works on my end.