New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix for the zeros / NA discrepancy #275

Merged

boopthesnoot merged 6 commits into main from na_bugfix

May 22, 2024

Collaborator

boopthesnoot commented May 17, 2024

Some of the tools' output which is used in alphapeptstats use zero meaning missing data, some use NA. Previously they were not unified so the preprocessing in _remove_na_values was wrong. Other changes are so the downstream code can work with NAs; style, and tests.

Mikhail Lebedev and others added 4 commits

April 12, 2024 13:48


          NA bug fix + consequential logic changes

d503c25


          Fix normalization from 'per protein' to 'per sample'.

1bdfbde


          Merge pull request #274 from ibludau/na_bugfix

18ff81b

Fix normalization from 'per protein' to 'per sample'.


          CHORE: ruff

1c64535

boopthesnoot requested a review from mschwoer

May 17, 2024 11:19

boopthesnoot self-assigned this


          FIX: rewritten problematic randomforest imputer that couldn't work wi…

387867e

…th NAs xd

mschwoer approved these changes

View reviewed changes

Contributor

mschwoer left a comment

LGTM, some probably dumb questions asked (I don't know this code base at all ;-))

alphastats/DataSet_Preprocess.py Outdated Show resolved Hide resolved

alphastats/DataSet_Preprocess.py

+                      """
+                      square_sum_per_row = array.pow(2).sum(axis=1, skipna=True)
+                      l2_norms = np.sqrt(square_sum_per_row)

Contributor

mschwoer May 17, 2024

you could check how https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html works with NaNs

Collaborator Author

boopthesnoot May 22, 2024

Couldn't get it to work properly, I don't think it works well with NaNs

alphastats/DataSet_Preprocess.py

-                              n_jobs=2,
-                              random_state=0,
-                              verbose=0,  #  random forest takes a while print progress
+                          imp = sklearn.ensemble.HistGradientBoostingRegressor(

Contributor

mschwoer May 17, 2024

is this the same as RandomForest ? maybe introduce this as another method ?

Collaborator Author

boopthesnoot May 22, 2024

It's also based on multiple trees, so I don't know what would be right. I could call it gradient boosting, but this could be less known for non-technical people, although it would be more correct

alphastats/DataSet_Preprocess.py

@@ @@ -30,45 +31,58 @@ def preprocess_print_info(self): @@
                       print(pd.DataFrame(self.preprocessing_info.items()))
                   def _remove_na_values(self, cut_off):
+                      if (
+                          self.preprocessing_info.get("Missing values were removed")

Contributor

mschwoer May 17, 2024

I don't know much about this preprocessing_info, but it seems to store information in human-readable keys.
This could lead to problems (say, I access something as self.preprocessing_info.get("Missing values were removed.") (did you spot the trailing dot? ;-))
I would suggest introducing a set of string constants, e.g.
MISSING_VALUES_REMOVED = "Missing values were removed"
somewhere and access this store exclusively through them

(not now, just for the future)

alphastats/gui/pages/02_Import Data.py Outdated Show resolved Hide resolved

alphastats/DataSet_Preprocess.py Show resolved Hide resolved

mschwoer reviewed

View reviewed changes

alphastats/DataSet_Preprocess.py Show resolved Hide resolved


          FIX: pr comments

4eab905

boopthesnoot merged commit dc517ed into main

6 checks passed

boopthesnoot deleted the na_bugfix branch

May 22, 2024 14:10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet