Merge pull request #75 from biocore/jointrpca

v0.0.10
biocore · Jan 18, 2024 · 190f89e · 190f89e
2 parents 9ac99ac + 1fde9cf
commit 190f89e
Show file tree

Hide file tree

Showing 83 changed files with 287,323 additions and 369 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,33 @@
+
+v0.0.10 (2023-12-01)
+
+### Bug fixes
+
+* Removed np.int types
+    * fixes bug that prevents use in newer versions of numpy / QIIME2
+        * see issue #71
+
+### Features [experimental]
+
+* rpca projection of new data and with cross validation
+    * A new function for RPCA where new data can be projected into an existing ordination.
+    * Also allows for internal CV, where the hold out data is projected into the training ordination, which is then used to re-build the test data, and the error between the projection and the real test data is calculated.
+        * see issue #70
+* tempted.py and assocated tests/commands/tutorials
+    * Added in a python implementation TEMPTED, more details on the methods inm this paper [here](https://www.biorxiv.org/content/10.1101/2023.07.26.550749v1).
+* joint-rpca and associated tests/commands/tutorials
+    * Joint-RPCA is an extention of RPCA for multiple omics types.
+* qc_rarefaction
+    * This function compares the mantel correlation of distances to the abs. differences in sample sum between rarefied and unrarefied input data. This is an easy check to ensure the results are not bieng significantly altered by non-rarefaction is cases of large differences (e.g., low-biomass) or where the sample sum differences can not be seperated from the phenotypes/low-rank structure of the data (i.e., deep sequencing of controls and shallow of sick).
+        * see issue #70
+
+### Deprecated functionality
+
+* auto_rpca across the package and rank_estimate in optspace.py
+    * was not performing correctly and no reasonable quick fix is available.
+        * see issue #70
+    * **Change from `auto_rpca` to `gemelli.rpca import rpca`**
+
 v0.0.9 (2023-05-24)
 
 ### Bug fixes

diff --git a/README.md b/README.md
@@ -1,9 +1,9 @@
 
 # Gemelli
 
-Gemelli is a tool box for running both Robust Aitchison PCA (RPCA) and Compositional Tensor Factorization (CTF) on _sparse_ compositional omics datasets.
+Gemelli is a tool box for running Robust Aitchison PCA (RPCA), Joint Robust Aitchison PCA (Joint-RPCA), TEMPoral TEnsor Decomposition (TEMPTED), and Compositional Tensor Factorization (CTF) on _sparse_ compositional omics datasets.
 
-RPCA can be used on cross-sectional datasets where each subject is sampled only once. CTF can be used on repeated-measure data where each subject is sampled multiple times (e.g. longitudinal sampling). Both methods are [_unsupervised_](https://en.wikipedia.org/wiki/Unsupervised_learning) and aim to describe sample/subject variation and the biological features that separate them. 
+RPCA can be used on cross-sectional datasets where each subject is sampled only once. CTF can be used on repeated-measure data where each subject is sampled multiple times (e.g. longitudinal sampling). TEMPTED is specifically designed for longitundal (time series) repeated measure studies, especially when samples are irregularly sampled across subjects. Joint-RPCA allows for the exploration of multiple omics datasets with shared samples at once. All these methods are [_unsupervised_](https://en.wikipedia.org/wiki/Unsupervised_learning) and aim to describe sample/subject variation and the biological features that separate them.
 
 The preprocessing transform for both RPCA and CTF is the robust centered log-ratio transform (rlcr) which accounts for sparse data (i.e. many missing/zero values). Details on the rclr can be found [here](https://msystems.asm.org/content/4/1/e00016-19) and a interactive introduction into the transformation can be found [here](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/introduction.ipynb). In short, the rclr log transforms the observed (nonzero) values before centering. RPCA and CTF then perform a matrix or tensor factorization on only the observed values after rclr transformation, similar to [Aitchison PCA](https://academic.oup.com/biomet/article-abstract/70/1/57/240898?redirectedFrom=fulltext) performed on dense data. If the data also has an associated phylogeny it can be incorporated through the phylogenetic rclr, details can be found [here](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-RPCA-moving-pictures.ipynb).
 
@@ -20,37 +20,50 @@ To install the most up to date version of gemelli, run the following command
 
 Gemelli can be run standalone or through [QIIME2](https://qiime2.org/) and as a python API or CLI. 
 
-## Cross-sectional study (i.e. one sample per subject) with RPCA
+## Cross-sectional / multi-omics study (i.e. one sample per subject) with RPCA
 
-If you have a [cross-sectional study design](https://en.wikipedia.org/wiki/Cross-sectional_study) with only one sample per subject then RPCA is the appropriate method to use in gemelli.  There are two commands within RPCA. The first is `rpca` and the second is `auto-rpca`. The only difference is that `auto-rpca` automatically estimates the underlying-rank of the matrix and requires no input for the `n_components` parameter. In the `rpca` command the `n_components` must be set explicitly. For examples of using RPCA we provide tutorials below exploring the microbiome between body sites.
+If you have a [cross-sectional study design](https://en.wikipedia.org/wiki/Cross-sectional_study) with only one sample per subject then RPCA is the appropriate method to use in gemelli. For examples of using RPCA we provide tutorials below exploring the microbiome between body sites.
+
+Joint-RPCA allows for the exploration of those feature that seperate jointly across sample groupings and the potential interactions of those features.  
 
 ### Tutorials
 
 #### Tutorials with QIIME2
 
 * [RPCA QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures.ipynb)
 * [Phylogenetic RPCA QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-RPCA-moving-pictures.ipynb)
+* [Joint-RPCA QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Joint-RPCA-QIIME2-CLI.ipynb)
+* [Joint-RPCA QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Joint-RPCA-QIIME2-API.ipynb)
+
 
 #### Standalone tutorial outside of QIIME2
 
 * [RPCA Python API & CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures-standalone-cli-and-api.ipynb)
+* [Joint-RPCA API & CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Joint-RPCA-CLI-API.ipynb)
 
-## Repeated measures study (i.e. multiple sample per subject) with CTF
+## Repeated measures study (i.e. multiple sample per subject) with CTF & TEMPTED
 
 ### Tutorials
 
-If you have a [repeated measures study design](https://en.wikipedia.org/wiki/Repeated_measures_design) with multiple samples per subject over time or space then CTF is the appropriate method to use in gemelli. For optimal results CTF requires samples for each subject in each time or space measurement. In some cases, this can require binning time my larger windows (e.g. instead of days use months). For examples of using CTF we provide a microbiome time series IBD study in the tutorials below.
+If you have a [repeated measures study design](https://en.wikipedia.org/wiki/Repeated_measures_design) with multiple samples per subject over time or space then CTF is the appropriate method to use in gemelli. For optimal results CTF requires samples for each subject in each time or space measurement. If that is not the case and your study has irregular time sampling, then TEMPTED should be used. TEMPTED also allows for the projection of new data into an existing factorization which is necessary for machine learning. For examples, explore the tutorials below.
 
 #### Tutorials with QIIME2
 
 * [CTF QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-QIIME2-CLI.md)
 * [CTF QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-QIIME2-API.ipynb)
 * [Phylogenetic CTF QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-IBD-Tutorial-QIIME2-CLI.ipynb)
 * [Phylogenetic CTF QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-IBD-Tutorial-QIIME2-API.ipynb)
+* [TEMPTED QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/TEMPTED-QIIME2-CLI.ipynb)
+* [TEMPTED QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/TEMPTED-QIIME2-API.ipynb)
 
 #### Standalone tutorial outside of QIIME2
 
 * [CTF Standalone Python API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-standalone-API.ipynb)
+* [TEMPTED R implementation - Intallation and tutorials](https://github.com/pixushi/tempted)
+
+## Performing parameter optimization and QC on results
+
+For an introduction to these QC methods see the tutorial [here](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-CV-QC-introduction.ipynb). Examples are also provided in the RPCA tutorials [here (RPCA QIIME2 CLI)](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures.ipynb) & [here (RPCA Python API & CLI)](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures-standalone-cli-and-api.ipynb). Users are encrouaged to report the QC/CV results for thier data.
 
 # Citations
 
@@ -95,6 +108,52 @@ Martino, C. et al. A Novel Sparse Compositional Technique Reveals Microbial Pert
 }
 ```
 
+
+## Citation for Phylogenetic RPCA
+
+```
+Martino, C. et al. A Novel Sparse Compositional Technique Reveals Microbial Perturbations. mSystems 4, (2019)
+```
+
+```
+@ARTICLE{Martino2022,
+  author = {Martino, Cameron and McDonald, Daniel and Cantrell, Kalen and
+            Dilmore, Amanda Hazel and Vázquez-Baeza, Yoshiki and Shenhav,
+            Liat and Shaffer, Justin P and Rahman, Gibraan and Armstrong,
+            George and Allaband, Celeste and Song, Se Jin and Knight, Rob},
+  title = {Compositionally Aware Phylogenetic {Beta-Diversity} Measures
+           Better Resolve Microbiomes Associated with Phenotype},
+  volume = {7},
+  number = {3},
+  elocation-id = {e0005022},
+  year =  {2022},
+  doi = {10.1128/msystems.00050-22},
+  publisher = {American Society for Microbiology Journals},
+  URL = {http://dx.doi.org/10.1128/msystems.00050-22},
+  journal = {mSystems},
+}
+```
+
+## Citation for TEMPTED
+
+```
+Shi, p. et al. Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv, (2023)
+```
+
+```
+@ARTICLE{Shi2023,
+  author = {Shi, Pixu and Martino, Cameron and Han, Rungang and Janssen,
+            Stefan and Buck, Gregory and Serrano, Myrna and Owzar, Kouros and
+            Knight, Rob and Shenhav, Liat and Zhang, Anru R},
+  title = {{Time-Informed} Dimensionality Reduction for Longitudinal
+           Microbiome Studies},
+  year =  {2023},
+  doi = {10.1101/2023.07.26.550749},
+  URL = {https://www.biorxiv.org/content/10.1101/2023.07.26.550749v1},
+  journal = {bioRxiv},
+}
+```
+
 ## Other Resources
 
 - The compositional data [wiki](https://en.wikipedia.org/wiki/Compositional_data)

diff --git a/gemelli/__init__.py b/gemelli/__init__.py
@@ -6,4 +6,4 @@
 # The full license is in the file COPYING.txt, distributed with this software.
 # ----------------------------------------------------------------------------
 
-__version__ = "0.0.9"
+__version__ = "0.0.10"
diff --git a/gemelli/_defaults.py b/gemelli/_defaults.py
@@ -10,7 +10,6 @@
 # descriptions. This is used by both the standalone RPCA and QIIME 2 RPCA sides
 # of gemelli.
 
-
 DEFAULT_MTD = 0
 DEFAULT_BL = None
 DEFAULT_COMP = 3
@@ -21,6 +20,20 @@
 DEFAULT_TENSALS_MAXITER = 25
 DEFAULT_FMETA = None
 DEFAULT_COND = None
+DEFAULT_METACV = None
+DEFAULT_COLCV = None
+DEFAULT_TESTS = 10
+DEFAULT_MATCH = True
+DEFAULT_TRNSFRM = True
+DEFAULT_TEMPTED_PC = 1
+DEFAULT_TEMPTED_EP = 1e-4
+DEFAULT_TEMPTED_SMTH = 1e-6
+DEFAULT_TEMPTED_RES = 101
+DEFAULT_TEMPTED_MAXITER = 20
+DEFAULT_TEMPTED_RH = 'random'
+DEFAULT_TEMPTED_RHC = 'random'
+DEFAULT_TEMPTED_SVDC = True
+DEFAULT_TEMPTED_SVDCN = 1
 DESC_BIN = ("The feature table containing the "
             "samples over which metric should be computed.")
 DESC_COUNTS = ("The feature table in biom format containing the "
@@ -103,4 +116,77 @@
               "sum of all children of that node (i.e. Fast UniFrac).")
 QBIPLOT = "A biplot of the (Robust Aitchison) RPCA feature loadings"
 QADIST = "The Aitchison distance of the sample loadings from RPCA."
+QACV = "The cross-validation reconstruction error."
 QRCLR = "A rclr transformed table. Note: zero/missing values have NaNs"
+DESC_METACV = ("Sample metadata file in QIIME2 formatting. "
+               "Containing the columns with training and test labels.")
+DESC_COLCV = ("Sample metadata column containing `train` and `test`"
+              " labels to use for the cross-validation evaluation.")
+DESC_TESTS = ("Number of random samples to choose for test split samples "
+              "if sample metadata and a train-test column are not provided.")
+DESC_TABLES = ("The collection of feature tables containing shared "
+               "samples over which metric should be computed.")
+DESC_TRAINTABLES = ("The tables to be projected on the first"
+                    " principal components previously extracted"
+                    " from a training set.")
+DESC_TRAINTABLE = ("The table to be projected on the first"
+                   " principal components previously extracted"
+                   " from a training set.")
+DESC_TRAINORDS = ("A joint-biplot of the (Robust Aitchison) RPCA"
+                  " feature loadings produced from the training data.")
+DESC_TRAINORD = ("A biplot of the (Robust Aitchison) RPCA"
+                 " feature loadings produced from the training data.")
+DESC_MATCH = ("Subsets the input tables to contain only features used in the"
+              " training data. If set to False and the tables are not"
+              " perfectly. Matched a ValueError will be produced.")
+DESC_TRNSFRM = ("If set to false the function will expect `tables`"
+                "to be dataframes already rclr transformed."
+                " This is used for internal functionality in the "
+                "joint-rpca function and is set to be only False.")
+DESC_MTABLE = ("The biplot ordination in"
+               " skbio.OrdinationResults format to match.")
+DESC_MORD = ("The feature table in biom format containing the"
+             "samples and features to match the ordination with.")
+DESC_FM = "If set to True the features in the ordination will be matched."
+DESC_SM = "If set to True the samples in the ordination will be matched."
+DESC_MORDOUT = "A subset biplot with the input table."
+DESC_CORRTBLORD = ("A joint-biplot or subset joint-biplot"
+                   " of the (Robust Aitchison) RPCA feature loadings.")
+DESC_CORRTBL = "A feature by feature correlation table."
+DESC_TCOND = ("Metadata column containing time points"
+              " across which samples are paired.")
+DESC_REP = ('Choose how replicate samples are handled. If replicates are'
+            'detected, "error" causes method to fail; "drop" will discard'
+            ' all replicated samples; "random" chooses one representative at'
+            ' random from among replicates.')
+DESC_SVD = "Removes the mean structure of the temporal tensor."
+DESC_SVDC = "Rank of approximation for average matrix in svd-centralize."
+DESC_RES = ("Number of time points to evaluate the value"
+            " of the temporal loading function.")
+DESC_SMTH = ("Smoothing parameter for RKHS norm. Larger means "
+             "smoother temporal loading functions.")
+DESC_MXTR = "Maximum number of iteration in for rank-1 calculation."
+DESC_EPS = ("Convergence criteria for difference between iterations "
+            "for each rank-1 calculation.")
+DESC_IO = ("Compositional biplot of subjects as points and"
+           " features as arrows. Where the variation between"
+           " subject groupings is explained by the log-ratio"
+           " between opposing arrows.")
+DESC_PIO = ("Compositional biplot of subjects as points from"
+            " new data projected into a pre-generated space.")
+DESC_SLO = ("Each components temporal loadings across the"
+            "input resolution included as a column called"
+            "'time_interval'.")
+DESC_SVDO = ("The loadings from the SVD centralize"
+             " function, used for projecting new data.")
+DESC_TDIST = "Subject by subject distance matrix (not samples)."
+DESC_PC = ("The pseudocount to add to the table before applying the "
+           "transformation. Default is zero which will add the"
+           " minimum non-zero value to all the values.")
+DESC_TJNT = ("If flagged Joint-RPCA will not use the RCLR "
+             "transformation and will instead assume that "
+             "the data has already been transformed or "
+             "normalized. Disabling the RCLR step will also "
+             "disable any filtering steps. It will be expected "
+             "that all filtering was done previously."
+             "Default is to use RCLR.")
diff --git a/gemelli/factorization.py b/gemelli/factorization.py
@@ -6,14 +6,12 @@
 # The full license is in the file COPYING.txt, distributed with this software.
 # ----------------------------------------------------------------------------
 
-import warnings
 import numpy as np
 import pandas as pd
 from scipy.linalg import svd
 from numpy.linalg import norm
 from .base import _BaseImpute
 from scipy.spatial import distance
-from gemelli.optspace import rank_estimate
 
 
 class TensorFactorization(_BaseImpute):
@@ -524,19 +522,6 @@ def tenals(tensor,
     tensor_dimensions = tensor.shape
     # Frobenius norm initial for ALS minimization.
     initial_tensor_frobenius_norm = norm(tensor)**2
-    # rank est. (for each slice)
-    if len(tensor_dimensions) == 3:  # only for 3D
-        for i in range(tensor_dimensions[0]):
-            obs_tmp = tensor[i, :, :].copy()
-            total_nonzeros = np.count_nonzero(mask[i, :, :].copy())
-            n_, m_ = obs_tmp.shape
-            eps_tmp = total_nonzeros / np.sqrt(n_ * m_)
-            if min(obs_tmp.shape) <= 2:
-                # two-subjects/time is already low-rank
-                continue
-            if rank_estimate(obs_tmp, eps_tmp) >= (min(obs_tmp.shape) - 1):
-                warnings.warn('A component of your data may be high-rank.',
-                              RuntimeWarning)
     # initial by Robust Tensor Power Method (modified for non-symmetric
     # tensors).
     initial_eigvals, initial_loadings = robust_tensor_power_method(

diff --git a/gemelli/matrix_completion.py b/gemelli/matrix_completion.py
@@ -117,12 +117,8 @@ def _fit(self):
             raise ValueError("max_iterations must be at least 1")
 
         # check the settings for n_components
-        if isinstance(self.n_components, str) and \
-           self.n_components.lower() == 'auto':
-            # estimate the rank of the matrix
-            self.n_components = 'auto'
         # check hardset values
-        elif isinstance(self.n_components, int):
+        if isinstance(self.n_components, int):
             if self.n_components > (min(n, m) - 1):
                 raise ValueError("n-components must be at most"
                                  " 1 minus the min. shape of the"
@@ -133,7 +129,7 @@ def _fit(self):
         # otherwise rase an error.
         else:
             raise ValueError("n-components must be "
-                             "an interger or 'auto'")
+                             "an interger.")
 
         # return solved matrix
         self.U, self.s, self.V = OptSpace(n_components=self.n_components,
@@ -163,7 +159,7 @@ def fit_transform(self, X, y=None):
         having right singular vectors as rows. Of shape (N,n_components)
 
         """
-        X_sparse = X.copy().astype(np.float64)
+        X_sparse = X.copy().astype(float)
         self.X_sparse = X_sparse
         self._fit()
         return self.sample_weights