What question did this study set out to answer?

June 4, 2026Open Access

A comparative simulation study of cluster ensemble algorithms integrated with multiple imputation for clustering with missing data

Key Points

The aim is to evaluate the effectiveness of various cluster ensemble algorithms integrated with multiple imputation techniques for managing missing data in clustering.
Conducted numerical comparisons of several cluster ensemble algorithms with k-means++ on multiply imputed datasets.
Applied combined approaches to two real datasets to assess performance under different scenarios of class balance and imbalance.
Recommended simulation experiments to reflect dataset characteristics and missing value assumptions before application.
The non-negative matrix factorization algorithm performed well in balanced class scenarios.
Greedy and agglomerative cluster algorithms were effective in scenarios with class imbalance.
Simulation results highlight the importance of choosing the right algorithm based on data characteristics.

Abstract

Since cluster analysis methods usually cannot be applied directly to data with missing values, various approaches have been investigated to handle the problem. Multiple imputation is one of the standard procedures for addressing the problem of missing data. In cluster analysis, instead of Rubin's rule, cluster ensemble methods have been proposed to be combined with multiple imputation. However, it remains unrevealed which of the cluster ensemble algorithms leads to better performance when integrated with the procedure. Therefore, we conducted numerical comparisons of several algorithms to integrate the results from k-means++ clustering for multiply imputed datasets and also applied the combined approaches to two real datasets. Our results suggest that the non-negative matrix factorization algorithm may be suitable for scenarios with class balance, whereas the greedy and agglomerative cluster algorithms may be suitable for scenarios with class imbalance. Before application to actual datasets, we still recommend performing simulation experiments in scenarios reflecting the characteristics of the datasets and the assumption of missing value mechanisms.

Mark Helpful

Bookmark

Relay

View Full Paper