Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation data generation, removal of loss function overrides #234

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

kaueltzen
Copy link
Contributor

@kaueltzen kaueltzen commented Oct 28, 2024

Resolves #228 .

Overview

  • Replaces train_test_split with explicit generation of validation data that is reproducible, shuffled and stratified (if classification tasks are present)
  • Removes override of loss function for classification tasks in fit of MODNetModel

@kaueltzen kaueltzen changed the title [WIP] Validation data generation, removal of loss function overrides Validation data generation, removal of loss function overrides Nov 19, 2024
@kaueltzen kaueltzen marked this pull request as ready for review November 19, 2024 13:52
@kaueltzen
Copy link
Contributor Author

Hi @ppdebreuck I have a question regarding this comment:
#228 (comment)

Do you think it would make sense to enable the passing of a loss function for classification tasks in evaluate of MODNetModel?
Or would you prefer keeping -ROC-AUC (e.g., because it's easier to interpret than cross entropy losses)?

Copy link
Owner

@ppdebreuck ppdebreuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kaueltzen ! Looks good, there just might be something off in the the stratified split, please let me know if you agree.

@@ -1534,3 +1549,46 @@ def validate_model(

def map_validate_model(kwargs):
return validate_model(**kwargs)


def generate_shuffled_and_stratified_val_split(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have this function under modnet.utils and import it as from modnet.utils import generate_shuffled_and_stratified_val_split. This will avoid having it in __all__

if isinstance(y[0][0], list) or isinstance(y[0][0], np.ndarray):
ycv = np.argmax(y[0], axis=1)
else:
ycv = y[0]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely wrong. It's picking the first row, but we need the first column: ycv = y[:,0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppdebreuck thanks for pointing that out!
It was written for MODNetModel.fit() to handle a list of props from targets_groups, but outside of vanilla.py / generate_shuffled_and_stratified_val_data the targets are indeed not correctly handled, I will change it.

else:
ycv = y[0]
return train_test_split(
range(len(y[0])),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can simply become range(len(ycv))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: let's do range(len(y)), so it's compatible with my last comment.

"""
if classification:
if isinstance(y[0][0], list) or isinstance(y[0][0], np.ndarray):
ycv = np.argmax(y[0], axis=1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stratifying a multilabel case is a bit tricky, and probably you don't need it ? So we can skip it: ycv=None

)
return (
x[train_idx],
[t[train_idx] for t in y],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct ? y[train_idx] should work, right?. It would pick the wrong columns now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The y of MODNetModel.fit() and of generate_shuffled_and_stratified_val_data has the first two dimensions (n_target_groups, n_samples) while the y of generate_shuffled_and_stratified_val_split that is used elsewhere (data.df_targets.values) has the first 2 dimensions (n_samples, n_targets).

@ppdebreuck
Copy link
Owner

Hi @ppdebreuck I have a question regarding this comment: #228 (comment)

Do you think it would make sense to enable the passing of a loss function for classification tasks in evaluate of MODNetModel? Or would you prefer keeping -ROC-AUC (e.g., because it's easier to interpret than cross entropy losses)?

No problem with me, as you can put ROC_AUC as default metric to keep current behavior, while adding flexibility if you need to change it :) (I would put it in a separate PR)

kaueltzen and others added 2 commits December 18, 2024 17:34
…rected axes of y in generate_shuffled_and_stratified_val_split and generate_shuffled_and_stratified_val_data
Copy link
Owner

@ppdebreuck ppdebreuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the modifications @kaueltzen, I think this can be merged ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-stratified splitting and overwriting of loss function in classification tasks
2 participants