v1.1.5

gagolews · Oct 18, 2023 · b0a6f92 · b0a6f92
1 parent 1345503
commit b0a6f92
Show file tree

Hide file tree

Showing 115 changed files with 2,882 additions and 3,743 deletions.
diff --git a/.devel/sphinx/bibliography.bib b/.devel/sphinx/bibliography.bib
@@ -18,17 +18,9 @@ @misc{nca
     note = {under review (preprint)}
 }
 
-@misc{clustering_benchmarks_v1,
-    author = {M. Gagolewski and others},
-    title = {Benchmark Suite for Clustering Algorithms -- Version 1},
-    year = {2020},
-    url = {https://github.com/gagolews/clustering-benchmarks},
-    doi = {10.5281/zenodo.3815066}
-}
-
 @misc{Gagolewski2022:clustering-data-v1.1.0,
     author = {M. Gagolewski and others},
-    title = {A benchmark suite for clustering algorithms: Version 1.1.0},
+    title = {A benchmark suite for clustering algorithms: {V}ersion 1.1.0},
     year = {2022},
     url = {https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0},
     doi = {10.5281/zenodo.7088171}
@@ -47,7 +39,7 @@ @article{clustering-benchmarks
 
 @book{datawranglingpy,
     author = {M. Gagolewski},
-    title = {Minimalist Data Wrangling with Python},
+    title = {Minimalist Data Wrangling with {P}ython},
     doi = {10.5281/zenodo.6451068},
     isbn = {978-0-6455719-1-2},
     publisher = {Zenodo},

diff --git a/.devel/sphinx/news.md b/.devel/sphinx/news.md
@@ -1,7 +1,6 @@
 # Changelog
 
-
-## 1.1.4.9xxx
+## 1.1.5 (2023-10-18)
 
 * [BACKWARD INCOMPATIBILITY] [Python and R] Inequality measures
     are no longer referred to as inequity measures.
@@ -66,9 +65,6 @@
 
 ## 1.1.0 (2022-09-05)
 
-*  [GENERAL] The below-mentioned cluster validity measures are discussed
-   in more detail at <https://clustering-benchmarks.gagolewski.com>.
-
 *  [Python and R] New function: `adjusted_asymmetric_accuracy`.
 
 *  [Python and R] Implementations of the so-called internal cluster
@@ -89,6 +85,9 @@
    `silhouette_w_index`,
    `wcnn_index`.
 
+   These cluster validity measures are discussed
+   in more detail at <https://clustering-benchmarks.gagolewski.com>.
+
 *  [BACKWARD INCOMPATIBILITY] `normalized_confusion_matrix`
    now solves the maximal assignment problem instead of applying
    the somewhat primitive partial pivoting.

diff --git a/.devel/sphinx/weave/Makefile b/.devel/sphinx/weave/Makefile
@@ -3,40 +3,25 @@
 FILES_RMD = \
 	basics.Rmd              \
 	sklearn_toy_example.Rmd \
-	r.Rmd
+	noise.Rmd               \
+	r.Rmd                   \
+	benchmarks_approx.Rmd   \
+	benchmarks_ar.Rmd       \
+	benchmarks_details.Rmd  \
+	timings.Rmd
 
-
-FILES_RSTW = \
-	benchmarks_ar.rstw       \
-	benchmarks_details.rstw  \
-	benchmarks_approx.rstw   \
-	noise.rstw               \
-	timings.rstw
-
-#	string.rstw              \
-#	sparse.rstw              \
+#	sparse.Rmd              \
+#	string.Rmd              \
 
 
 RMD_MD_OUTPUTS=$(patsubst %.Rmd,%.md,$(FILES_RMD))
-#RMD_RST_OUTPUTS=$(patsubst %.Rmd,%.rst,$(FILES_RMD))
-
-RSTW_RST_OUTPUTS=$(patsubst %.rstw,%.rst,$(FILES_RSTW))
 
 %.md: %.Rmd
 	./Rmd2md.sh "$<"
 
-#%.rst: %.md
-#	pandoc -f markdown+grid_tables --wrap=none "$<" -o "$@"
-
-%.rst: %.rstw
-	./pweave_custom.py "$<" "$@"
-
-
-all : rmd rstw
+all : rmd
 
 rmd : $(RMD_MD_OUTPUTS)
 
-rstw : $(RSTW_RST_OUTPUTS)
-
 clean:
-	rm -f $(RSTW_RST_OUTPUTS) $(RMD_MD_OUTPUTS)
+	rm -f $(RMD_MD_OUTPUTS)
diff --git a/.devel/sphinx/weave/benchmarks_approx.rstw → .devel/sphinx/weave/benchmarks_approx.Rmd b/.devel/sphinx/weave/benchmarks_approx.rstw → .devel/sphinx/weave/benchmarks_approx.Rmd
@@ -1,29 +1,25 @@
-Benchmarks — Approximate Method
-===============================
+# Benchmarks — Approximate Method
 
-In one of the :any:`previous sections <timings>` we have demonstrated that the approximate version
-of the Genie algorithm (:class:`genieclust.Genie(exact=False, ...) <genieclust.Genie>`), i.e.,
-one which relies on `nmslib <https://github.com/nmslib/nmslib/tree/master/python_bindings>`_\ 's
-approximate nearest neighbour search, is much faster than the exact one
-on large, high-dimensional datasets. In particular, we have noted that
-clustering of 1 million points in a 100d Euclidean space
-takes less than 5 minutes on a laptop.
+In one of the [previous sections](timings), we have demonstrated that the approximate version
+of the Genie algorithm ([`genieclust.Genie(exact=False, ...)`](genieclust.Genie)), i.e.,
+one which relies on `nmslib`'s {cite}`nmslib` approximate nearest neighbour search,
+is much faster than the exact one on large, high-dimensional datasets.
+In particular, we have noted that clustering of 1 million points
+in a 100d Euclidean space takes less than 5 minutes on a laptop.
 
 As *fast* does not necessarily mean *meaningful* (tl;dr spoiler alert: in our case, it does),
 let's again consider  all the datasets
-from the `Benchmark Suite for Clustering Algorithms — Version 1 <https://github.com/gagolews/clustering-benchmarks>`_
-:cite:`clustering_benchmarks_v1`
-(except the ``h2mg`` and ``g2mg`` batteries). Features with variance of 0 were
+from the [Benchmark Suite for Clustering Algorithms (Version 1.0)](https://clustering-benchmarks.gagolewski.com)
+{cite}`clustering-benchmarks`
+(except the `h2mg` and `g2mg` batteries). Features with variance of 0 were
 removed, datasets were centred at **0** and scaled so that they have total
 variance of 1. Tiny bit of Gaussian noise was added to each observation.
 Clustering is performed with respect to the Euclidean distance.
 
 
 
 
-
-
-<<bench-approx-imports,results="hidden",echo=False>>=
+```{python bench-approx-imports,results="hide",echo=FALSE}
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
@@ -50,11 +46,11 @@ res  = pd.read_csv("v1-timings.csv") # see timings.py
 dims = pd.read_csv("v1-dims.csv")
 dims["dataset"] = dims["battery"]+"/"+dims["dataset"]
 dims = dims.loc[:,"dataset":]
-@
+```
 
 
 
-<<approx-diffs-load,results="hidden",echo=False>>=
+```{python approx-diffs-load,results="hide",echo=FALSE}
 # Load results file:
 res = pd.read_csv("v1-scores-approx.csv")
 # ari, afm can be negative --> replace negative indexes with 0.0
@@ -80,20 +76,20 @@ params.columns = ["method", "gini_threshold", "run"]
 res_max = pd.concat((res_max.drop("method", axis=1), params), axis=1)
 res_max["dataset"] = res_max["battery"] + "/" + res_max["dataset"]
 res_max = res_max.iloc[:, 1:]
-@
+```
 
 
 
 On each benchmark dataset ("small" and "large" altogether)
-we have fired 10 runs of the approximate Genie method (``exact=False``)
+we have fired 10 runs of the approximate Genie method (`exact=False`)
 and computed the adjusted Rand (AR) indices to quantify the similarity between the predicted
 outputs and the reference ones.
 
 We've computed the differences between each of the 10 AR indices
 and the AR index for the exact method. Here is the complete list of datasets
-and `gini_threshold`\ s where this discrepancy is seen at least 2 digits of precision:
+and `gini_threshold`s where this discrepancy is seen at least 2 digits of precision:
 
-<<approx-diffs,results="rst",echo=False>>=
+```{python approx-diffs,results="asis",echo=FALSE}
 # which similarity measure to report below:
 similarity_measure = "ar"
 
@@ -106,35 +102,35 @@ _dat = diffs_stats.loc[(np.abs(diffs_stats["min"])>=0.0095)|(np.abs(diffs_stats[
 #_dat = _dat.drop("count", axis=1)
 which_repeated = (_dat.dataset.shift(1) == _dat.dataset)
 _dat.loc[which_repeated, "dataset"] = ""
-print(tabulate(_dat, _dat.columns, tablefmt="rst", showindex=False), "\n\n")
-@
+print(tabulate(_dat, _dat.columns, tablefmt="github", showindex=False), "\n\n")
+```
 
 
-The only noteworthy  difference is for the ``sipu/birch2`` dataset
+The only noteworthy  difference is for the `sipu/birch2` dataset
 where we observe that the approximate method generates worse results
 (although recall that `gini_threshold` of 1 corresponds to the single linkage method).
-Interestingly, for ``sipu/worms_64``, the in-exact algorithm with `gini_threshold`
+Interestingly, for `sipu/worms_64`, the in-exact algorithm with `gini_threshold`
 of 0.5 yields a much better outcome than the original one.
 
 
 Here are the descriptive statistics for the AR indices across all the datasets
 (for the approximate method we chose the median AR in each of the 10 runs):
 
-<<approx-ar,results="rst",echo=False>>=
+```{python approx-ar,results="asis",echo=FALSE}
 _dat = res_max.groupby(["dataset", "method"])[similarity_measure].\
     median().reset_index().groupby(["method"]).describe().\
     round(3).reset_index()
 _dat.columns = [l0 if not l1 else l1 for l0, l1 in _dat.columns]
 
-_dat.method
+#_dat.method
 
 #which_repeated = (_dat.gini_threshold.shift(1) == _dat.gini_threshold)
 #_dat.loc[which_repeated, "gini_threshold"] = ""
 #_dat = _dat.drop("count", axis=1)
-print(tabulate(_dat, _dat.columns, tablefmt="rst", showindex=False), "\n\n")
-@
+print(tabulate(_dat, _dat.columns, tablefmt="github", showindex=False), "\n\n")
+```
 
 
 For the recommended ranges of the `gini_threshold` parameter,
 i.e., between 0.1 and 0.5, we see that the approximate version of Genie
-behaves as good as the original one.
+behaves similarly to the original one.
diff --git a/.devel/sphinx/weave/benchmarks_approx.md b/.devel/sphinx/weave/benchmarks_approx.md
@@ -0,0 +1,83 @@
+
+
+
+
+# Benchmarks — Approximate Method
+
+In one of the [previous sections](timings), we have demonstrated that the approximate version
+of the Genie algorithm ([`genieclust.Genie(exact=False, ...)`](genieclust.Genie)), i.e.,
+one which relies on `nmslib`'s {cite}`nmslib` approximate nearest neighbour search,
+is much faster than the exact one on large, high-dimensional datasets.
+In particular, we have noted that clustering of 1 million points
+in a 100d Euclidean space takes less than 5 minutes on a laptop.
+
+As *fast* does not necessarily mean *meaningful* (tl;dr spoiler alert: in our case, it does),
+let's again consider  all the datasets
+from the [Benchmark Suite for Clustering Algorithms (Version 1.0)](https://clustering-benchmarks.gagolewski.com)
+{cite}`clustering-benchmarks`
+(except the `h2mg` and `g2mg` batteries). Features with variance of 0 were
+removed, datasets were centred at **0** and scaled so that they have total
+variance of 1. Tiny bit of Gaussian noise was added to each observation.
+Clustering is performed with respect to the Euclidean distance.
+
+
+
+
+
+
+
+
+
+
+
+
+On each benchmark dataset ("small" and "large" altogether)
+we have fired 10 runs of the approximate Genie method (`exact=False`)
+and computed the adjusted Rand (AR) indices to quantify the similarity between the predicted
+outputs and the reference ones.
+
+We've computed the differences between each of the 10 AR indices
+and the AR index for the exact method. Here is the complete list of datasets
+and `gini_threshold`s where this discrepancy is seen at least 2 digits of precision:
+
+| dataset          |   gini_threshold |   count |   mean |   std |   min |   25% |   50% |   75% |   max |
+|------------------|------------------|---------|--------|-------|-------|-------|-------|-------|-------|
+| sipu/birch2      |              0.7 |      10 |  -0.01 |  0.01 | -0.02 | -0.02 | -0.01 | -0.01 |  0    |
+|                  |              1   |      10 |  -0.35 |  0.18 | -0.44 | -0.44 | -0.43 | -0.43 |  0    |
+| sipu/worms_64    |              0.1 |      10 |  -0.03 |  0.01 | -0.06 | -0.03 | -0.02 | -0.02 | -0.02 |
+|                  |              0.3 |      10 |   0.02 |  0.01 | -0.01 |  0.02 |  0.03 |  0.03 |  0.03 |
+|                  |              0.5 |      10 |   0.23 |  0.08 |  0.11 |  0.16 |  0.25 |  0.29 |  0.34 |
+| wut/trajectories |              0.1 |      10 |  -0    |  0.02 | -0.05 |  0    |  0    |  0    |  0    |
+|                  |              0.3 |      10 |  -0    |  0.02 | -0.05 |  0    |  0    |  0    |  0    |
+|                  |              0.5 |      10 |  -0    |  0.02 | -0.05 |  0    |  0    |  0    |  0    |
+|                  |              0.7 |      10 |  -0    |  0.02 | -0.05 |  0    |  0    |  0    |  0    |
+|                  |              1   |      10 |  -0.1  |  0.32 | -1    |  0    |  0    |  0    |  0    | 
+
+
+The only noteworthy  difference is for the `sipu/birch2` dataset
+where we observe that the approximate method generates worse results
+(although recall that `gini_threshold` of 1 corresponds to the single linkage method).
+Interestingly, for `sipu/worms_64`, the in-exact algorithm with `gini_threshold`
+of 0.5 yields a much better outcome than the original one.
+
+
+Here are the descriptive statistics for the AR indices across all the datasets
+(for the approximate method we chose the median AR in each of the 10 runs):
+
+| method           |   count |   mean |   std |   min |   25% |   50% |   75% |   max |
+|------------------|---------|--------|-------|-------|-------|-------|-------|-------|
+| Genie_0.1        |      79 |  0.728 | 0.307 |     0 | 0.516 | 0.844 |     1 |     1 |
+| Genie_0.1_approx |      79 |  0.728 | 0.307 |     0 | 0.516 | 0.844 |     1 |     1 |
+| Genie_0.3        |      79 |  0.755 | 0.292 |     0 | 0.555 | 0.9   |     1 |     1 |
+| Genie_0.3_approx |      79 |  0.755 | 0.292 |     0 | 0.568 | 0.9   |     1 |     1 |
+| Genie_0.5        |      79 |  0.731 | 0.332 |     0 | 0.531 | 0.844 |     1 |     1 |
+| Genie_0.5_approx |      79 |  0.734 | 0.326 |     0 | 0.531 | 0.844 |     1 |     1 |
+| Genie_0.7        |      79 |  0.624 | 0.376 |     0 | 0.264 | 0.719 |     1 |     1 |
+| Genie_0.7_approx |      79 |  0.624 | 0.376 |     0 | 0.264 | 0.719 |     1 |     1 |
+| Genie_1.0        |      79 |  0.415 | 0.447 |     0 | 0     | 0.174 |     1 |     1 |
+| Genie_1.0_approx |      79 |  0.409 | 0.45  |     0 | 0     | 0.148 |     1 |     1 | 
+
+
+For the recommended ranges of the `gini_threshold` parameter,
+i.e., between 0.1 and 0.5, we see that the approximate version of Genie
+behaves similarly to the original one.