Skip to content

Commit

Permalink
black readme and update docstring
Browse files Browse the repository at this point in the history
  • Loading branch information
momonga-ml committed Nov 12, 2023
1 parent 5f4b638 commit 8c740e9
Show file tree
Hide file tree
Showing 6 changed files with 55 additions and 17 deletions.
37 changes: 29 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,16 @@ print(clf.score())

## Usage

For slower but more stable results select `intersection_union_mapper` to combine embedding layers via third UMAP.
Be sure that random seeds are set too!
For a slower but more **stable** results select `intersection_union_mapper` to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of [the randomness](https://umap-learn.readthedocs.io/en/latest/reproducibility.html) of the algorithm.

```python
clf = DenseClus(
umap_combine_method="intersection_union_mapper",
)
```

### Advanced Usage

For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing
dictionaries into `DenseClus` class.

Expand All @@ -60,23 +61,43 @@ For example:
from denseclus import DenseClus
from denseclus.utils import make_dataframe

umap_params = {'categorical': {'n_neighbors': 15, 'min_dist': 0.1},
'numerical': {'n_neighbors': 20, 'min_dist': 0.1}}
hdbscan_params = {'min_cluster_size': 10}
umap_params = {
"categorical": {"n_neighbors": 15, "min_dist": 0.1},
"numerical": {"n_neighbors": 20, "min_dist": 0.1},
}
hdbscan_params = {"min_cluster_size": 10}

df = make_dataframe()

clf = DenseClus(umap_combine_method="union"
,umap_params=umap_params
,hdbscan_params=hdbscan_params)
, umap_params=umap_params
, hdbscan_params=hdbscan_params
, random_state=None) # this will run in parallel

clf.fit(df)
```


## Examples

A hands-on example with an overview of how to use is currently available in the form of a [Jupyter Notebook](/notebooks/DenseClus%20Example%20NB.ipynb).
### Notebooks

A hands-on example with an overview of how to use is currently available in the form of a [Example Jupyter Notebook](/notebooks/01_DenseClusExampleNB.ipynb).

Should you need to tune HDBSCAN, here is an optional approach: [Tuning with HDBSCAN Notebook](/notebooks/02_TuningwithHDBSCAN.ipynb)

Should you need to validate UMAP emeddings, there is an approach to do so in the [Validation for UMAP Notebook](/notebooks/03_ValidationForUMAP.ipynb)

### Blogs


[AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data](https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/)

[TDS Blog: How To Tune HDBSCAN](https://towardsdatascience.com/tuning-with-hdbscan-149865ac2970)

[TDS Blog: On the Validation of UMAP](https://towardsdatascience.com/on-the-validating-umap-embeddings-2c8907588175)



## References

Expand Down
33 changes: 25 additions & 8 deletions denseclus/DenseClus.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,21 +52,38 @@ class DenseClus(BaseEstimator, ClassifierMixin):
Parameters
----------
random_state : int, default=None
random_state : int, default=42
Random State for both UMAP and numpy.random.
If set to None UMAP will run in Numba in multicore mode but
results may vary between runs.
Setting a seed may help to offset the stochastic nature of
UMAP by setting it with fixed random seed.
umap_combine_method : str, default=intersection
umap_combine_method : str, default=contrast
Method by which to combine embeddings spaces.
Options include: intersection, union, contrast,
intersection_union_mapper
The latter combines both the intersection and union of
the embeddings.
See:
https://umap-learn.readthedocs.io/en/latest/composing_models.html
methods for combining the embeddings: including
'intersection', 'union', 'contrast', and 'intersection_union_mapper'.
'intersection' preserves the numerical embeddings more, focusing on the quantitative aspects of
the data. This method is particularly useful when the numerical data is of higher importance or
relevance to the clustering task.
'Union' preserves the categorical embeddings more, emphasizing the qualitative aspects of the
data. This method is ideal when the categorical data carries significant weight or importance in
the clustering task.
'Contrast' highlights the differences between the numerical and categorical embeddings, providing
a more balanced representation of both. This method is particularly useful when there are
significant differences between the numerical and categorical data, and both types of data are
equally important for the clustering task.
'Intersection_union_mapper' is a hybrid method that combines the strengths of both 'intersection'
and 'union'. It first applies the 'intersection' method to preserve the numerical embeddings, then
applies the 'union' method to preserve the categorical embeddings. This method is useful when both
numerical and categorical data are important, but one type of data is not necessarily more
important than the other.
See: https://umap-learn.readthedocs.io/en/latest/composing_models.html
prediction_data: bool, default=False
Whether to generate extra cached data for predicting labels or
Expand Down Expand Up @@ -105,7 +122,7 @@ class DenseClus(BaseEstimator, ClassifierMixin):
def __init__(
self,
random_state: int = 42,
umap_combine_method: str = "intersection",
umap_combine_method: str = "contrast",
prediction_data: bool = False,
verbose: bool = False,
umap_params=None,
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ target-version = "py311"
fix = true
unfixable = []
select = ["E", "W"]
ignore = ["E203", "E231", "E402", "E712", "F401"]
ignore = ["E203", "E231", "E402", "E712", "F401","E501"]
exclude = [
'.git',
'__pycache__',
Expand Down

0 comments on commit 8c740e9

Please sign in to comment.