Clustering is a fundamental task in unsupervised learning, aiming to group similar observations into meaningful clusters. Gaussian Mixture Models (GMMs) are among the most popular probabilistic approaches for clustering, providing a flexible parametric framework. However, they assume a fixed and predefined number of clusters, which limits their applicability in practice. Dirichlet Process Mixture Models (DPMMs), a Bayesian non-parametric extension of GMMs, overcome this limitation by allowing the number of clusters to be inferred automatically from the data.When a dataset exhibits a dual structure between observations and features, co-clustering—which simultaneously partitions both rows and columns into homogeneous blocks—outperforms traditional clustering methods that partition only the rows. The Non-Parametric Latent Block Model (NPLBM) extends parametric block mixture models to the Bayesian non-parametric setting, allowing the automatic estimation of the number of row and column clusters.While inference using Markov Chain Monte Carlo (MCMC) methods, such as collapsed Gibbs sampling, provides strong asymptotic guarantees and high precision, it becomes computationally prohibitive for large datasets. To address this limitation, we introduce DisCGS, a distributed inference algorithm that approximates the collapsed Gibbs sampler using sufficient statistics. Designed for horizontally partitioned data across multiple workers, DisCGS enables efficient and scalable clustering for both continuous and discrete data.For continuous data, our implementation with Gaussian components achieves a runtime of approximately 3 minutes for 100 iterations on a dataset with 100,000 points, yielding over a 200× speedup compared to a centralized sampler that requires 12 hours. For discrete data, we extend our framework to the case where the components are multinomial with an application to text clustering. Furthermore, our approach generalizes easily to the exponential family of distributions.We further extend this distributed framework to co-clustering by proposing DisNPLBM, a scalable inference algorithm for Bayesian Non-Parametric Latent Block Models. DisNPLBM employs a master/worker architecture that distributes rows of the data matrix among workers, enabling parallel inference without inter-worker communication. It models latent multivariate Gaussian blocks to simultaneously partition rows and columns, effectively capturing complex co-cluster structures. We validate its scalability and accuracy on synthetic datasets and demonstrate its practical effectiveness on gene expression data.Beyond scalability, algorithmic reliability is critical for trustworthy clustering results. We investigate the stability of the Expectation-Maximization (EM) algorithm for GMMs through the lens of average sensitivity, quantifying the sensitivity of inferred parameters to small data perturbations such as the removal of a single point. We theoretically prove that EM is stable under fixed, equal mixture proportions and extend this analysis to the more general setting with unknown proportions. Our results assume spherical clusters with known diagonal covariance and do not rely on cluster separability or initialization conditions. Extensive experiments on synthetic and real datasets corroborate our theoretical findings, underscoring the robustness of EM in practical scenarios.
Reda Khoufache (Fri,) studied this question.