-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for the zeros / NA discrepancy #275
Conversation
Fix normalization from 'per protein' to 'per sample'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some probably dumb questions asked (I don't know this code base at all ;-))
""" | ||
square_sum_per_row = array.pow(2).sum(axis=1, skipna=True) | ||
|
||
l2_norms = np.sqrt(square_sum_per_row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could check how https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html works with NaN
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't get it to work properly, I don't think it works well with NaN
s
n_jobs=2, | ||
random_state=0, | ||
verbose=0, # random forest takes a while print progress | ||
imp = sklearn.ensemble.HistGradientBoostingRegressor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same as RandomForest
? maybe introduce this as another method
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also based on multiple trees, so I don't know what would be right. I could call it gradient boosting, but this could be less known for non-technical people, although it would be more correct
@@ -30,45 +31,58 @@ def preprocess_print_info(self): | |||
print(pd.DataFrame(self.preprocessing_info.items())) | |||
|
|||
def _remove_na_values(self, cut_off): | |||
if ( | |||
self.preprocessing_info.get("Missing values were removed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about this preprocessing_info
, but it seems to store information in human-readable keys.
This could lead to problems (say, I access something as self.preprocessing_info.get("Missing values were removed.")
(did you spot the trailing dot? ;-))
I would suggest introducing a set of string constants, e.g.
MISSING_VALUES_REMOVED = "Missing values were removed"
somewhere and access this store exclusively through them
(not now, just for the future)
Some of the tools' output which is used in alphapeptstats use zero meaning missing data, some use NA. Previously they were not unified so the preprocessing in
_remove_na_values
was wrong. Other changes are so the downstream code can work with NAs; style, and tests.