`ELBMcoclust` and `SELBMcoclust`

Sparse and Non-Sparse Exponential Family Latent Block Model for Co-clustering

The goal of the statistical approach is to analyze the behavior of the data by considering the probability distribution. The complete log-likelihood function for three version of LBM, Exponential LBM and Sparse Exponential LBM, will be as follows:

LBM

$$L^{\text{LBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma})= \sum\limits_{i,k}r_{ik} \log\pi_{k} +\sum\limits_{j,h} \log\rho_{h} c^{\top}_{jh}+ \sum\limits_{i,j,k,h} r_{ik}\log \varphi(x_{ij};\alpha_{kh})c^{\top}_{hj}.$$

ELBM

$$\begin{align*} L^{\text{ELBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma}) \propto& \sum\limits_{i,k}r_{ik} \log\pi_{k} +\sum\limits_{j,h} \log\rho_{h} c^{\top}_{jh} + \text{Tr}\left( (\mathbf{R}^{\top} (\mathbf{S_{x}}\odot \hat{\boldsymbol{\beta}}) \mathbf{C})^{\top} \mathbf{A}_{\boldsymbol{\alpha}} \right)\nonumber\\\ &- \text{Tr}\left( (\mathbf{R}^{\top} (\mathbf{E}_{mn}\odot \hat{\boldsymbol{\beta}}) \mathbf{C})^{\top} \mathbf{F}_{\boldsymbol{\alpha}} \right). \end{align*}$$

SELBM

$$\begin{align*} L^{\text{SELBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma}) \propto& \sum\limits_{k} r_{.k} \log\pi_{k} + \sum\limits_{h} c_{.h}\log\rho_{h} + \sum\limits_{k} \left[ \mathbf{R}^{\top}(\mathbf{S_{x}}\odot \hat{\boldsymbol{\beta}})\mathbf{C} \right]_{kk} \left( A(\alpha_{kk}) - A(\alpha) \right)\nonumber\\\ &- \sum\limits_{k} [\mathbf{R}^{\top} (\mathbf{E}_{mn} \odot \hat{\boldsymbol{\beta}} )\mathbf{C}]_{kk} \left( F(A(\alpha_{kk})) -F(A(\alpha)) \right). \end{align*}$$

Datasets

Datasets	Topics	#Classes	(#Documents, #Words)	Sparsity(%0)	Balance
Classic3	Medical, Information retrieval, Aeronautical systems	3	(3891, 4303)	98.95	0.71
CSTR	Robotics/Vision, Systems, Natural Language Processing, Theory	4	(475, 1000)	96.60	0.399
WebACE	20 different topics from WebACE project	20	(2340, 1000)	91.83	0.169
Reviews	Food, Music, Movies, Radio, Restaurants	5	(4069, 18483)	98.99	0.099
Sports	Baseball, Basketball, Bicycling, Boxing, Football, Golfing, Hockey	7	(8580, 14870)	99.14	0.036
TDT2	30 different topics	30	(9394, 36771)	99.64	0.028

Balance: (#documents in the smallest class)/(#documents in the largest class)

Implement

from ELBMcoclust.Models.coclust_ELBMcem import CoclustELBMcem
from ELBMcoclust.Models.coclust_SELBMcem import CoclustSELBMcem

from NMTFcoclust.Evaluation.EV import Process_EV

ELBM = CoclustELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
ELBM.fit(X_CSTR)

SELBM = CoclustSELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
SELBM.fit(X_CSTR)

Process_Ev = Process_EV(true_labels ,X_CSTR, ELBM)

from sklearn.metrics import confusion_matrix 

confusion_matrix(true_labels, np.sort(ELBM.row_labels_))


       [[101,   0,   0,   0],
        [  4,  52,  15,   0],
        [  0,   0,  178,  0],
        [  0,   0,   34, 91]]

Visualization

Confusion Matrices

Vertical Bar chart

Horizontal Bar chart

Box plots

Scatter plots

Swarm plots

Reorganized 3*3 Word cloud of PoissonSELBM for Classic3

Reorganized 3*3 Bar Chart of PoissonSELBM for Classic3

Contributions

In this paper, we provide a summary of the main contributions:

Exponential family Latent Block Model (ELBM) and Sparse version (SELBM): We propose these models, which unify many leading algorithms suited to various data types.
Classification Expectation Maximization Approach: Our proposed algorithms use this approach and have a general framework based on matrix form.
Focus on Document-Word Matrices: While we propose a flexible matrix formalism for different models according to different distributions, we focus on document-word matrices in this work. We evaluate ELBMs and SELBMs using six real document-word matrices and three synthetic datasets.

Highlights

Exponential family Latent Block Model (ELBM) and Sparse version (SELBM) were proposed, which unify many models with various data types.
The proposed algorithms using the classification expectation maximization approach have a general framework based on matrix form.
Using six real document-word matrices and three synthetic datasets (Bernoulli, Poisson, Gaussian), we compared ELBM with SELBM.
All datasets and algorithm codes are available on GitHub as ELBMcoclust repository.

Supplementary materials

More details about the Classic3 real-text dataset are available here.
For additional visualization, see here.
Orginal paper is available on Advances in Data Analysis and Classification (ADAC).

Data Availability

The code of algorithms, all datasets, additional visualizations, and materials are available at ELBMcoclust repository. Our experiments were performed on a PC (Intel(R), Core(TM) i7-10510U, 2.30 GHz), and all figures were produced in Python using the Seaborn and Matplotlib libraries.

Cite

Please cite the following paper in your publication if you are using ELBMcoclust in your research:

 @article{ELBMcoclust, 
    title=           {A Sparse Exponential Family Latent Block Model for Co-clustering}, 
    Journal=         {Advances in Data Analysis and Classification}
    authors=         {Saeid Hoseinipour, Mina Aminghafari, Adel Mohammadpour, Mohamed Nadif},
    pages=           {1-37},
    DOI=             {https://doi.org/10.1007/s11634-024-00608-3}    
    year=            {2024}
}

References

[1] Ailem, Melissa et al, Sparse Poisson latent block model for document clustering, IEEE Transactions on Knowledge and Data Engineering (2017a).

[2] Ailem, Melissa et al, Model-based co-clustering for the effective handling of sparse data, Pattern Recognition (2017b)

[3] Govaert and Nadif, Clustering with block mixture models, Pattern Recognition (2013).

[4] Govaert and Nadif, Block clustering with Bernoulli mixture models: Comparison of different approaches, Computational Statistics and Data Analysis (2008).

[5] Govaert and Nadif, Latent block model for contingency table, Communications in Statistics - Theory and Methods (2010).

[6] Govaert and Nadif, Co-clustering: models, algorithms and applications, John Wiley and Sons (2013).

[7] Rodolphe Priam et al, Topographic Bernoulli block mixture mapping for binary tables, Pattern Analysis and Applications (2014).

[8] DasGupta, The exponential family and statistical applications, Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics (2011).

[9] Del Buono et al, Non-negative matrix tri-factorization for co-clustering: an analysis of the block matrix, Information Sciences (2015).

[10] Laclau et al, Diagonal latent block model for binary data, Statistics and Computing (2017).

[11] Fossier, Riverain et al, Semi-supervised Latent Block Model with pairwise constraints, Machine Learning (2022).

[12] Hartigan, Direct clustering of a data matrix, Journal of the American Statistical Association (1972)

[13] Saeid, Hoseinipour et al, Orthogonal parametric non-negative matrix tri-factorization with $\alpha$-Divergence for co-clustering, Expert Systems with Applications (2023).

Name		Name	Last commit message	Last commit date
Latest commit History 503 Commits
Datasets		Datasets
Evaluation		Evaluation
Images		Images
Models		Models
Results		Results
Synthetic_Data		Synthetic_Data
Visualization		Visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

`ELBMcoclust` and `SELBMcoclust`

Datasets

Implement

Visualization

Contributions

Highlights

Supplementary materials

Data Availability

Cite

References

About

Releases

Packages

Languages

License

Saeidhoseinipour/ELBMcoclust

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

ELBMcoclust and SELBMcoclust

Datasets

Implement

Visualization

Contributions

Highlights

Supplementary materials

Data Availability

Cite

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`ELBMcoclust` and `SELBMcoclust`

Packages