Sparse and Non-Sparse Exponential Family Latent Block Model for Co-clustering
The goal of the statistical approach is to analyze the behavior of the data by considering the probability distribution. The complete log-likelihood function for three version of LBM, Exponential LBM and Sparse Exponential LBM, will be as follows:
- LBM
- ELBM
- SELBM
Datasets | Topics | #Classes | (#Documents, #Words) | Sparsity(%0) | Balance |
---|---|---|---|---|---|
Classic3 | Medical, Information retrieval, Aeronautical systems | 3 | (3891, 4303) | 98.95 | 0.71 |
CSTR | Robotics/Vision, Systems, Natural Language Processing, Theory | 4 | (475, 1000) | 96.60 | 0.399 |
WebACE | 20 different topics from WebACE project | 20 | (2340, 1000) | 91.83 | 0.169 |
Reviews | Food, Music, Movies, Radio, Restaurants | 5 | (4069, 18483) | 98.99 | 0.099 |
Sports | Baseball, Basketball, Bicycling, Boxing, Football, Golfing, Hockey | 7 | (8580, 14870) | 99.14 | 0.036 |
TDT2 | 30 different topics | 30 | (9394, 36771) | 99.64 | 0.028 |
- Balance: (#documents in the smallest class)/(#documents in the largest class)
from ELBMcoclust.Models.coclust_ELBMcem import CoclustELBMcem
from ELBMcoclust.Models.coclust_SELBMcem import CoclustSELBMcem
from NMTFcoclust.Evaluation.EV import Process_EV
ELBM = CoclustELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
ELBM.fit(X_CSTR)
SELBM = CoclustSELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
SELBM.fit(X_CSTR)
Process_Ev = Process_EV(true_labels ,X_CSTR, ELBM)
from sklearn.metrics import confusion_matrix
confusion_matrix(true_labels, np.sort(ELBM.row_labels_))
[[101, 0, 0, 0],
[ 4, 52, 15, 0],
[ 0, 0, 178, 0],
[ 0, 0, 34, 91]]
- Confusion Matrices
- Vertical Bar chart
- Horizontal Bar chart
- Box plots
- Scatter plots
- Swarm plots
- Reorganized 3*3 Word cloud of
PoissonSELBM
for Classic3
- Reorganized 3*3 Bar Chart of
PoissonSELBM
for Classic3
In this paper, we provide a summary of the main contributions:
-
Exponential family Latent Block Model (ELBM) and Sparse version (SELBM): We propose these models, which unify many leading algorithms suited to various data types.
-
Classification Expectation Maximization Approach: Our proposed algorithms use this approach and have a general framework based on matrix form.
-
Focus on Document-Word Matrices: While we propose a flexible matrix formalism for different models according to different distributions, we focus on document-word matrices in this work. We evaluate ELBMs and SELBMs using six real document-word matrices and three synthetic datasets.
- Exponential family Latent Block Model (ELBM) and Sparse version (SELBM) were proposed, which unify many models with various data types.
- The proposed algorithms using the classification expectation maximization approach have a general framework based on matrix form.
- Using six real document-word matrices and three synthetic datasets (Bernoulli, Poisson, Gaussian), we compared ELBM with SELBM.
- All datasets and algorithm codes are available on GitHub as
ELBMcoclust
repository.
- More details about the Classic3 real-text dataset are available here.
- For additional visualization, see here.
- Orginal paper is available on Advances in Data Analysis and Classification (ADAC).
The code of algorithms, all datasets, additional visualizations, and materials are available at ELBMcoclust
repository. Our experiments were performed on a PC (Intel(R), Core(TM) i7-10510U, 2.30 GHz), and all figures were produced in Python using the Seaborn and Matplotlib libraries.
Please cite the following paper in your publication if you are using ELBMcoclust
in your research:
@article{ELBMcoclust,
title= {A Sparse Exponential Family Latent Block Model for Co-clustering},
Journal= {Advances in Data Analysis and Classification}
authors= {Saeid Hoseinipour, Mina Aminghafari, Adel Mohammadpour, Mohamed Nadif},
pages= {1-37},
DOI= {https://doi.org/10.1007/s11634-024-00608-3}
year= {2024}
}
[3] Govaert and Nadif, Clustering with block mixture models, Pattern Recognition (2013).
[6] Govaert and Nadif, Co-clustering: models, algorithms and applications, John Wiley and Sons (2013).
[10] Laclau et al, Diagonal latent block model for binary data, Statistics and Computing (2017).
[12] Hartigan, Direct clustering of a data matrix, Journal of the American Statistical Association (1972)