-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation data generation, removal of loss function overrides #234
base: master
Are you sure you want to change the base?
Conversation
…t target instead.
Hi @ppdebreuck I have a question regarding this comment: Do you think it would make sense to enable the passing of a loss function for classification tasks in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kaueltzen ! Looks good, there just might be something off in the the stratified split, please let me know if you agree.
modnet/models/vanilla.py
Outdated
@@ -1534,3 +1549,46 @@ def validate_model( | |||
|
|||
def map_validate_model(kwargs): | |||
return validate_model(**kwargs) | |||
|
|||
|
|||
def generate_shuffled_and_stratified_val_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to have this function under modnet.utils
and import it as from modnet.utils import generate_shuffled_and_stratified_val_split
. This will avoid having it in __all__
modnet/models/vanilla.py
Outdated
if isinstance(y[0][0], list) or isinstance(y[0][0], np.ndarray): | ||
ycv = np.argmax(y[0], axis=1) | ||
else: | ||
ycv = y[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely wrong. It's picking the first row, but we need the first column: ycv = y[:,0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ppdebreuck thanks for pointing that out!
It was written for MODNetModel.fit()
to handle a list of props from targets_groups
, but outside of vanilla.py
/ generate_shuffled_and_stratified_val_data
the targets are indeed not correctly handled, I will change it.
modnet/models/vanilla.py
Outdated
else: | ||
ycv = y[0] | ||
return train_test_split( | ||
range(len(y[0])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can simply become range(len(ycv))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: let's do range(len(y))
, so it's compatible with my last comment.
modnet/models/vanilla.py
Outdated
""" | ||
if classification: | ||
if isinstance(y[0][0], list) or isinstance(y[0][0], np.ndarray): | ||
ycv = np.argmax(y[0], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stratifying a multilabel case is a bit tricky, and probably you don't need it ? So we can skip it: ycv=None
) | ||
return ( | ||
x[train_idx], | ||
[t[train_idx] for t in y], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct ? y[train_idx]
should work, right?. It would pick the wrong columns now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The y
of MODNetModel.fit()
and of generate_shuffled_and_stratified_val_data
has the first two dimensions (n_target_groups, n_samples) while the y
of generate_shuffled_and_stratified_val_split
that is used elsewhere (data.df_targets.values
) has the first 2 dimensions (n_samples, n_targets).
No problem with me, as you can put ROC_AUC as default metric to keep current behavior, while adding flexibility if you need to change it :) (I would put it in a separate PR) |
…rected axes of y in generate_shuffled_and_stratified_val_split and generate_shuffled_and_stratified_val_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the modifications @kaueltzen, I think this can be merged ?
Resolves #228 .
Overview
train_test_split
with explicit generation of validation data that is reproducible, shuffled and stratified (if classification tasks are present)fit
ofMODNetModel