Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we exclude certain data and labels based on a condition? #28

Open
katerinakarampasi opened this issue May 16, 2018 · 9 comments
Open
Labels
answered The question has been answered question Further information is requested

Comments

@katerinakarampasi
Copy link

Based on the instructions, my personal comprehension is that we have to provide you the two basic functions, FeatureExtractor( ) and Classifier( ). I would like to access the whole data and exclude some of them, so afterwards I'll have to exclude their corresponding labels, as well. I can exclude the data based on the condition each time the FeatureExtractor is called but I can't do the same for the labels through it. So my question is if we will have to execute all the commands before FeatureExtractor is called (because that would solve my problem) or not.

@kegl
Copy link
Contributor

kegl commented May 16, 2018

You can remove data at training time (in fit) but not at transform/predict time. Providing labels to the FeatureExtractor at transform time would leak these labels on the test data. If you want to leave out some points from the training, you can do it in the fit function of the classifier.

@katerinakarampasi
Copy link
Author

Ok thank you.
I don't know if I have to open a new topic but eventually what is quality check that we are provided with for the fmri and the anatomy data?

@glemaitre
Copy link
Contributor

The quality check was done manually. Basically, visual inspection of the pre-processing steps (registration, segmentation) and inspection of the motions of the parameters were checked.

@glemaitre glemaitre added question Further information is requested answered The question has been answered labels May 17, 2018
@katerinakarampasi
Copy link
Author

Ok thank you.

@zh1peng
Copy link

zh1peng commented May 22, 2018

Hi,
how to remove bad data during FeatureExtractor or Classifier still confuses me. Sorry this may be a very basic question, but it's been confusing for a few days.
I tried to impute bad data during Feature extraction, but it seemed it made the model worse. If I understand it correctly, the FeatureExtractor is supposed to return only new_X rather than both new_X and new_y. So it is hard to remove bad samples at this stage.

But if I put this step in Classifier under fit, I used

def fit (self, X, y)
X_new=X[some_good_idx]
y_new=y[some_good_idx]
self.clf.fit(X_new, y_new), 

def predict(self, X):
        return self.clf.predict(X)

def predict_proba(self, X):
        return self.clf.predict_proba(X)

it crashed when running CV evaluation with error `X has a different shape than during fitting.

@kegl
Copy link
Contributor

kegl commented May 22, 2018

Can you submit it? I can look at the trace there.

@glemaitre
Copy link
Contributor

Modifying the starting kit, this should be something like this.

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin


class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X_df, y):
        return self

    def transform(self, X_df):
        # get only the anatomical information
        X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]
        return X 


from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = make_pipeline(StandardScaler(), LogisticRegression())

    def fit(self, X, y):
        X_select = X['anatomy_select'] == 1
        self.clf.fit(X[X_select], y[X_select.values])
        return self
        
    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)

@glemaitre
Copy link
Contributor

I tried and it works locally with the cross_validate and ramp_test_submission

@zh1peng
Copy link

zh1peng commented May 22, 2018

Thank you, guys. I have tested the modified anatomy code, it works.
So I will double-check with my code to see if I can figure that out.

I think the error was caused by that I was trying to exclude the QC columns (i.e. anatomy_select) in the fit. It should be fine to include that column as they will be all ones and removed by feature selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered The question has been answered question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants