Skip to content

We unified some latent block models by proposing a flexible ELBM that is extended to SELBM to address the sparse problem by revealing a diagonal structure from sparse datasets. This leads to obtain more homogeneous co-clusters and therefore produce useful, ready-to-use and easy-to-interpret results.

License

Notifications You must be signed in to change notification settings

Saeidhoseinipour/ELBMcoclust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: MIT https://github.com/Saeidhoseinipour/NMTFcoclust https://github.com/Saeidhoseinipour/NMTFcoclust https://github.com/Saeidhoseinipour/NMTFcoclust https://github.com/Saeidhoseinipour/EM-typecoclust/tree/main https://github.com/Saeidhoseinipour/EM-typecoclust/tree/main

Screenshot: 'README.md'

Table of Contents

A sparse exponential family latent block model for co-clustering, Saeid Hoseinipour A sparse exponential family latent block model for co-clustering, Saeid Hoseinipour

ELBMcoclust and SELBMcoclust

Sparse and Non-Sparse Exponential Family Latent Block Model for Co-clustering

The goal of the statistical approach is to analyze the behavior of the data by considering the probability distribution. The complete log-likelihood function for three version of LBM, Exponential LBM and Sparse Exponential LBM, will be as follows:

  • LBM
$$L^{\text{LBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma})= \sum\limits_{i,k}r_{ik} \log\pi_{k} +\sum\limits_{j,h} \log\rho_{h} c^{\top}_{jh}+ \sum\limits_{i,j,k,h} r_{ik}\log \varphi(x_{ij};\alpha_{kh})c^{\top}_{hj}.$$
  • ELBM
$$\begin{align*} L^{\text{ELBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma}) \propto& \sum\limits_{i,k}r_{ik} \log\pi_{k} +\sum\limits_{j,h} \log\rho_{h} c^{\top}_{jh} + \text{Tr}\left( (\mathbf{R}^{\top} (\mathbf{S_{x}}\odot \hat{\boldsymbol{\beta}}) \mathbf{C})^{\top} \mathbf{A}_{\boldsymbol{\alpha}} \right)\nonumber\\\ &- \text{Tr}\left( (\mathbf{R}^{\top} (\mathbf{E}_{mn}\odot \hat{\boldsymbol{\beta}}) \mathbf{C})^{\top} \mathbf{F}_{\boldsymbol{\alpha}} \right). \end{align*}$$
  • SELBM
$$\begin{align*} L^{\text{SELBM}}(\mathbf{r},\mathbf{c},\boldsymbol{\gamma}) \propto& \sum\limits_{k} r_{.k} \log\pi_{k} + \sum\limits_{h} c_{.h}\log\rho_{h} + \sum\limits_{k} \left[ \mathbf{R}^{\top}(\mathbf{S_{x}}\odot \hat{\boldsymbol{\beta}})\mathbf{C} \right]_{kk} \left( A(\alpha_{kk}) - A(\alpha) \right)\nonumber\\\ &- \sum\limits_{k} [\mathbf{R}^{\top} (\mathbf{E}_{mn} \odot \hat{\boldsymbol{\beta}} )\mathbf{C}]_{kk} \left( F(A(\alpha_{kk})) -F(A(\alpha)) \right). \end{align*}$$

Datasets Topics #Classes (#Documents, #Words) Sparsity(%0) Balance
Classic3 Medical, Information retrieval, Aeronautical systems 3 (3891, 4303) 98.95 0.71
CSTR Robotics/Vision, Systems, Natural Language Processing, Theory 4 (475, 1000) 96.60 0.399
WebACE 20 different topics from WebACE project 20 (2340, 1000) 91.83 0.169
Reviews Food, Music, Movies, Radio, Restaurants 5 (4069, 18483) 98.99 0.099
Sports Baseball, Basketball, Bicycling, Boxing, Football, Golfing, Hockey 7 (8580, 14870) 99.14 0.036
TDT2 30 different topics 30 (9394, 36771) 99.64 0.028
  • Balance: (#documents in the smallest class)/(#documents in the largest class)
from ELBMcoclust.Models.coclust_ELBMcem import CoclustELBMcem
from ELBMcoclust.Models.coclust_SELBMcem import CoclustSELBMcem
from NMTFcoclust.Evaluation.EV import Process_EV

ELBM = CoclustELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
ELBM.fit(X_CSTR)

SELBM = CoclustSELBMcem(n_row_clusters = 4, n_col_clusters = 4, model = "Poisson")
SELBM.fit(X_CSTR)

Process_Ev = Process_EV(true_labels ,X_CSTR, ELBM) 
from sklearn.metrics import confusion_matrix 

confusion_matrix(true_labels, np.sort(ELBM.row_labels_))


       [[101,   0,   0,   0],
        [  4,  52,  15,   0],
        [  0,   0,  178,  0],
        [  0,   0,   34, 91]]   

Visualization

  • Confusion Matrices

Screenshot: 'README.md'

  • Vertical Bar chart

A sparse exponential family latent block model for co-clustering, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour

  • Horizontal Bar chart

A sparse exponential family latent block model for co-clustering, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour

  • Box plots

A sparse exponential family latent block model for co-clustering, Screenshot: 'README.md'

  • Scatter plots

A sparse exponential family latent block model for co-clustering, Screenshot: 'README.md'

  • Swarm plots

A sparse exponential family latent block model for co-clustering, Screenshot: 'README.md'

  • Reorganized 3*3 Word cloud of PoissonSELBM for Classic3

A sparse exponential family latent block model for co-clustering, Word clouds top 60 words in classic3 dataset obtined by PoissonSELBM for co-clustering,  Latent Block Model, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour, clustering

  • Reorganized 3*3 Bar Chart of PoissonSELBM for Classic3

A sparse exponential family latent block model for co-clustering, Bar charts top 50 words in classic3 dataset obtined by PoissonSELBM for co-clustering, Saeid Hoseinipour, text mining, clustering, Expoential family, Latent Block Model, Text mining, Matrix factorization, Co-clustering, Saeid Hoseinipour

Contributions

In this paper, we provide a summary of the main contributions:

  • Exponential family Latent Block Model (ELBM) and Sparse version (SELBM): We propose these models, which unify many leading algorithms suited to various data types.

  • Classification Expectation Maximization Approach: Our proposed algorithms use this approach and have a general framework based on matrix form.

  • Focus on Document-Word Matrices: While we propose a flexible matrix formalism for different models according to different distributions, we focus on document-word matrices in this work. We evaluate ELBMs and SELBMs using six real document-word matrices and three synthetic datasets.

Highlights

  • Exponential family Latent Block Model (ELBM) and Sparse version (SELBM) were proposed, which unify many models with various data types.
  • The proposed algorithms using the classification expectation maximization approach have a general framework based on matrix form.
  • Using six real document-word matrices and three synthetic datasets (Bernoulli, Poisson, Gaussian), we compared ELBM with SELBM.
  • All datasets and algorithm codes are available on GitHub as ELBMcoclust repository.

Supplementary materials

Data Availability

The code of algorithms, all datasets, additional visualizations, and materials are available at ELBMcoclust repository. Our experiments were performed on a PC (Intel(R), Core(TM) i7-10510U, 2.30 GHz), and all figures were produced in Python using the Seaborn and Matplotlib libraries.

Cite

Please cite the following paper in your publication if you are using ELBMcoclust in your research:

 @article{ELBMcoclust, 
    title=           {A Sparse Exponential Family Latent Block Model for Co-clustering}, 
    Journal=         {Advances in Data Analysis and Classification}
    authors=         {Saeid Hoseinipour, Mina Aminghafari, Adel Mohammadpour, Mohamed Nadif},
    pages=           {1-37},
    DOI=             {https://doi.org/10.1007/s11634-024-00608-3}    
    year=            {2024}
} 

References

[1] Ailem, Melissa et al, Sparse Poisson latent block model for document clustering, IEEE Transactions on Knowledge and Data Engineering (2017a).

[2] Ailem, Melissa et al, Model-based co-clustering for the effective handling of sparse data, Pattern Recognition (2017b)

[3] Govaert and Nadif, Clustering with block mixture models, Pattern Recognition (2013).

[4] Govaert and Nadif, Block clustering with Bernoulli mixture models: Comparison of different approaches, Computational Statistics and Data Analysis (2008).

[5] Govaert and Nadif, Latent block model for contingency table, Communications in Statistics - Theory and Methods (2010).

[6] Govaert and Nadif, Co-clustering: models, algorithms and applications, John Wiley and Sons (2013).

[7] Rodolphe Priam et al, Topographic Bernoulli block mixture mapping for binary tables, Pattern Analysis and Applications (2014).

[8] DasGupta, The exponential family and statistical applications, Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics (2011).

[9] Del Buono et al, Non-negative matrix tri-factorization for co-clustering: an analysis of the block matrix, Information Sciences (2015).

[10] Laclau et al, Diagonal latent block model for binary data, Statistics and Computing (2017).

[11] Fossier, Riverain et al, Semi-supervised Latent Block Model with pairwise constraints, Machine Learning (2022).

[12] Hartigan, Direct clustering of a data matrix, Journal of the American Statistical Association (1972)

[13] Saeid, Hoseinipour et al, Orthogonal parametric non-negative matrix tri-factorization with $\alpha$-Divergence for co-clustering, Expert Systems with Applications (2023).

About

We unified some latent block models by proposing a flexible ELBM that is extended to SELBM to address the sparse problem by revealing a diagonal structure from sparse datasets. This leads to obtain more homogeneous co-clusters and therefore produce useful, ready-to-use and easy-to-interpret results.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages