-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archipelago - dict transformer for vectorizing persistence diagrams #1017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diagrams according to homology_dimensions. | ||
|
||
Examples | ||
>>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit reluctant to add more code that uses the old format for persistence diagrams, when we are trying to replace it with something more convenient (see #925, https://github.com/GUDHI/gudhi-devel/blob/master/src/python/gudhi/sklearn/cubical_persistence.py or #691). Of course we could change archipelago later to accept the new format, but it may become complicated if we try to be compatible with both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better. Do you mean you want me to wait for those PR to come online before processing this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better.
It is the one we discussed before the holidays. Instead of a list of (dim,(birth,death)), if the user asked for homology in dimensions [i,j,k], they get [dgm_i, dgm_j, dgm_k] where each dgm is a numpy array of shape (*,2).
It is better in that you already have numpy arrays for each dimension. However, because it is a list and not a dict, you do not have access to the labels [i,j,k], the user has to provide them separately if you need them for something. It also makes it ambiguous, if we refer to the 0-th diagram, whether it corresponds to homology in dimension 0, or in dimension i (the first of the list).
Do you mean you want me to wait for those PR to come online before processing this one?
I don't know what I want 😉
|fit on| and |transform = vectorize| list or series of persistence diagrams (in pandas format). | ||
The object is sklearn-API consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the pandas format controlled by transform_output
in scikit-learn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but that feature I have not implemented: at the moment I force the output in pandas format. Will look into that and if it's easy I will implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is worked out within the TransformerMixin of sklearn, so I branched it up now (not sure if very safe though?) and it nicely gets rid of the annoying pandas dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not sure if very safe though?)
Do you mean "safe" in terms of the version of sklearn that is required, or something else?
(the doc may need a small update to reflect this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the number (and names) of features out of Archipelago can only be known at transform
run time, I change a class attribute accordingly, and I'm not sure if it is very safe...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the number (and names) of features out of Archipelago can only be known at
transform
run time,
I see. It could be computed at fit
time, which would better match the sklearn philosophy, but that wouldn't be convenient. I have no idea how bad it is, if it can interact strangely with cloning or something. For now, I don't see any obvious problem...
continue | ||
this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags] | ||
vectorized_dgms = self.archipelago_[dimension].transform(this_dim_list_pdiags) | ||
running_transform_names += [f"{dimension} Center {i + 1}" for i in range(vectorized_dgms.shape[1])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "feature" is more generic than "center" if the island is not necessarily atol. Although the name archipelago may look strange if we don't use atol at all.
|
||
class Archipelago(BaseEstimator, TransformerMixin): | ||
""" | ||
Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this may be more natural with the new format)
Maybe we should say that this class is helpful to avoid going through DimensionSelector+FeatureUnion or similar. It could even be nice to have an example somewhere comparing the 2 (we can add it later though).
>>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))] | ||
>>> pdiagram2 = [(0, (0.0, 3.34)), (0, (0.0, 2.956)), (1, (0.536, 1.856)), (2, (1.202, 2.734))] | ||
>>> pdiagram3 = [(0, (1.0, 4.34)), (0, (2.0, 3.956)), (1, (1.536, 2.856)), (2, (3.202, 4.734))] | ||
>>> list_pdiags = [pdiagram1, pdiagram2, pdiagram3] | ||
>>> archipelago = Archipelago(island=Atol()) | ||
>>> archipelago.fit(X=list_pdiags) | ||
>>> archipelago.transform(X=list_pdiags) | ||
>>> archipelago = Archipelago(island_dict={2: BettiCurve(resolution=4), 0:Atol()}) | ||
>>> import pandas as pd | ||
>>> series_pdiags = pd.Series(list_pdiags) | ||
>>> archipelago.set_output(transform="pandas") | ||
>>> archipelago.fit(X=series_pdiags) | ||
>>> archipelago.transform(X=series_pdiags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html, using the new format of persistence diagrams, and ignoring pandas for now, is this equivalent to the following?
from gudhi.representations import BettiCurve, Atol, Archipelago
import numpy as np
from sklearn.compose import ColumnTransformer
Xpdiagram1 = [np.array([[0.0, 2.34], [0.0, 0.956]]), np.array([[0.536, 0.856]]), np.array([[1.202, 1.734]])]
Xpdiagram2 = [np.array([[0.0, 3.34], [0.0, 2.956]]), np.array([[0.536, 1.856]]), np.array([[1.202, 2.734]])]
Xpdiagram3 = [np.array([[1.0, 3.34], [2.0, 2.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]
Xlist_pdiags = [Xpdiagram1, Xpdiagram2, Xpdiagram3]
ct=ColumnTransformer([("0",Atol(),0),("1",Atol(),1),("2",Atol(),2)])
print(ct.fit_transform(Xlist_pdiags))
ct=ColumnTransformer([("0",Atol(),0),("2",BettiCurve(resolution=4),2)])
print(ct.fit_transform(Xlist_pdiags))
I had to do a couple changes in BettiCurve to let this run (I don't know exactly what not X
is meant to test for)
@@ -381,7 +381,7 @@
if not self.is_fitted():
raise NotFittedError("Not fitted.")
- if not X:
+ if X is None or len(X) == 0:
X = [np.zeros((0, 2))]
N = len(X)
@@ -408,7 +408,7 @@
return np.array(bettis, dtype=int)[:, 0:-1]
- def fit_transform(self, X):
+ def fit_transform(self, X, y=None):
"""
The result is the same as fit(X) followed by transform(X), but potentially faster.
"""
and I get similar results, but they differ from what Archipelago returns for dimension 0 (maybe it is just related to random generators).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, they differ because you made a mistake in copying it should be
Xpdiagram3 = [np.array([[1.0, 4.34], [2.0, 3.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]
and then the results are equal indeed.
Also this seems to be working:
ct.set_output(transform="pandas")
Xdf_pdiags = pd.DataFrame(Xlist_pdiags)
print(ct.fit_transform(Xdf_pdiags))
so this is possibly the end for this PR, with the new pdiags format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the one thing this PR does better than your proposition is handle dimensions that fail at fit
, by not registering the key into the archipelago_
dict.
It seems desirable for now?
Which makes me think a lot of methods from vector_methods
are not protected against things like
X is None or len(X) == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further discussion with Marc, I believe the better philosophy for gudhi is what he suggests:
- if one asks for vectorizing of some homology dimensions (say 0, 1 and 2) we provide the structure for the dimensions (for instance with ColumnTransformers from this conversation),
- and because there can be no homology features at some dimensions, if it turns out there is no data to vectorize, we should print a warning (e.g. "no information in dimension d, problems ahead, consider asking for different homology dimensions") but still let it go to crashing, instead of what I was doing which was removing the problematic dimension.
So we can definitely drop this PR now.
We should (somehow?) salvage the changes to vector_methods: default method for Atol so that it can be initialized with empty arguments, and some tweaks that can be generalized to all vector methods such as implementing get_feature_names_out
so that we can use the set_output
feature from sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should still include the other changes.
For reference, the discussion was mostly about how to handle empty diagrams in some dimensions. If the user asks for persistence in dimensions 0, 1 and 2, but the diagrams in dimension 2 are all empty (especially during fit
), we have several options:
- we can throw an exception (possibly just for the vectorizers that do need fitting like Atol, or Landscape without an explicit
sample_range
) - we can (print a warning and) continue, and for this dimension
transform
will either (preferably) return a harmless constant, or return nonsense (including NaN). - we can (print a warning and) skip that dimension henceforth,
transform
will act as if the user had only asked for dimensions 0 and 1.
@martinroyer chose the last option in this PR. From a pure ML perspective, that option is convenient. A user can ask the pipeline to aggregate the results in several dimensions and not care about the details, if one dimension is useless and the pipeline can automatically ignore it, that user is happy. Ignoring that dimension also has the advantage over returning a constant that it avoids a useless increase in dimension. It also goes well with the dictionary-like pandas.
From a lower level perspective, the last option can be confusing. The output of transform does not have the expected shape (I can ask for 3 landscapes with 5 points each but get a vector of dimension 10 because one dimension was ignored).
Both options have their advantages. Note that this choice can be handled either inside each vectorizer (say Atol(skip_on_fit_error=True)
for which fit([])
defines an empty list of centers, and after that transform
returns a list of empty arrays, which hopefully disappear when ColumnTransformer tries to append them to the rest), or in a wrapper like ColumnTransformer.
In any case, we should check how we handle (lists of) empty diagrams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PS: redrafted to reflect that this is consistent with @mglisse's option number 2]
Adding to that, I believe there are two "empty diagrams" cases to discuss, that of fit
and that of transform
.
As for fit
, if we hold the view that once someone asks for vectorizing (say in dimensions 0 and 1) we should provide the structure for doing so (Marc's option 2 "provide the structure at all costs"):
for Atol that implies that even without data there should always be centers at the end of fit
- so owe could for instance do random generation:
def fit(self, X, y=None, sample_weight=None):
if not hasattr(self.quantiser, 'fit'):
raise TypeError("quantiser %s has no `fit` attribute." % (self.quantiser))
# In fitting we remove infinite birth/death time points so that every center is finite
X = [dgm[~np.isinf(dgm).any(axis=1), :] for dgm in X if len(dgm)]
if len(X) < self.quantiser.n_clusters:
# If no point to fit, let's arbitrarily put centers in [0, 1)
X += [np.array([[np.random.random(), np.random.random()]]) for _ in range(self.quantiser.n_clusters - len(X))]
(This would ensure that quantiser has enough points to fit n_clusters
regardless of entry X
.)
As for transform
, one convention could be that empty diagrams are thought of as empty measure therefore vectorized as 0 = no mass. This ensures that no error is thrown which would be pretty desirable at transform
time?
For Atol one way of achieving this is to replace empty diagrams with np.array([[np.inf, np.inf]])
.
That looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am globally in favor of option (2), that is, return a constant for all data instance if the corresponding persistence diagrams are all empty. I think it matches better with ML practices, where you usually use homology dimensions without thinking too much beforehand if the persistence diagrams make sense or not. If they do not (and thus constants are returned), pipelines will not crash, but will contain redundant, useless constants, that can be spotted a posteriori with post-analysis of the results, as is sometimes done in other contexts.
That being said, I don't know if this is the behavior of the different representation methods that we have...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentRouvreau hopefully this is the bugfixes you need!
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006)) | ||
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentRouvreau I believe this is what you need part1
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006)) | ||
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10)) | ||
# Atol will do | ||
# X = np.concatenate([a,b,c]) | ||
# kmeans = KMeans(n_clusters=2, random_state=202006).fit(X) | ||
# kmeans = KMeans(n_clusters=2, random_state=202006, n_init=10).fit(X) | ||
# kmeans.labels_ will be : array([1, 0, 1, 0, 0, 1, 0, 0]) | ||
first_cluster = np.asarray([a[0], a[2], b[2]]) | ||
second_cluster = np.asarray([a[1], b[0], b[2], c[0], c[1]]) | ||
second_cluster = np.asarray([a[1], b[0], b[1], c[0], c[1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentRouvreau I believe this is what you need part2
quantiser=KMeans(n_clusters=1, random_state=202006), | ||
quantiser=KMeans(n_clusters=1, random_state=202006, n_init=10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentRouvreau I believe this is what you need part3/3
- arbitrary centers when not enough points - better filtering of infinite birth/death time points - weighting before filtering
So I implemented what we discussed in February in this PR: the ability (for I added a test file that checks if a vectorizer is able to handle various scenarii with input persistence diagram (new format): fit/transform/fit when empty/transform when empty/panda's We could rewrite and keep only the tests that best interest us if we want to have a sort of common interface for vectorization classes down the way, and add the classes to the test once they do what we want them to do. I removed archipelago because I believe it is well handled by scikit-learn's� |
So this PR has now become only a "Atol robust update" coupled with some interesting tests for a shared vectorisation method interface. I can redraft it into one/two PRs if we want things to be clearer. |
This is continued in PR #1096. |
Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods. One provides persistence diagram vectorizers (by way of either
island
orisland_dict
), and the Archipelago object will |fit on| and |transform = vectorize| list or series of persistence diagrams (in pandas format). The object is sklearn-API consistent.So the difference between Archipelago and methods from gudhi.representations.vector_methods is that this operates on the "whole" persistence diagrams e.g.$R^2$ .
[(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))]
, whereas methods from vector_methods tend to operate on a single dimension e.g. list of numpy arrays inSo far it feels nice to have something like
Archipelago(island_dict={2: BettiCurve(resolution=4), 0:TopologicalVector(threshold=3)})
work, and also it looks to be working on both lists of diagrams and pandas.Series of diagrams.This PR also contains minor edits to Atol like default quantiser.