What question did this study set out to answer?

The aim is to scale single-linkage hierarchical clustering for large datasets using advanced parallel algorithms.

March 25, 2026Open Access

Advanced Parallel Algorithms for Single‐Linkage Hierarchical Clustering

Key Points

The aim is to scale single-linkage hierarchical clustering for large datasets using advanced parallel algorithms.
Developed novel distributed approaches using WSPD and BCCP computations.
Implemented algorithms on the Apache Spark framework.
Explored algorithmic variations leveraging data partitioning, iterative Kruskal merging, and kd-tree optimizations.
Conducted extensive experiments in local and clustered environments.
Achieved up to 2× speed-ups with MemoGFK compared to baseline methods.
Demonstrated significant improvements in performance from partitioning and pruning strategies.
Showed superior scalability for increasing input sizes.

Abstract

This thesis addresses the challenge of scaling single-linkage hierarchical clustering (SLHC) to very large datasets by developing and evaluating advanced parallel algorithms on distributed systems. Traditional SLHC, while useful for uncovering nested structures in data, suffers from prohibitive time and memory complexity, making it impractical for big data applications. To overcome these limitations, we introduce and analyze novel distributed approaches based on Well-Separated Pair Decomposition (WSPD) and Bichromatic Closest Pair (BCCP) computations, implemented on the Apache Spark framework inspired by the work of Wang et al., 2021. We propose multiple algorithmic variations that exploit data partitioning, iterative Kruskal merging, and kd-tree optimizations to reduce computation and communication costs. Through extensive experiments on both local and clustered environments, we evaluate scalability, speed-up, and memory efficiency across datasets of up to several million points. Results demonstrate that partitioning and pruning strategies significantly improve performance, with MemoGFK achieving up to 2× speed-ups compared to baseline methods, and superior scalability over increasing input sizes. Overall, this work contributes practical distributed algorithms for hierarchical clustering, bridging the gap between theoretical advances in geometric graph structures and real-world big data analytics. It lays the foundation for further research on efficient clustering methods in large-scale, high-dimensional environments.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Δημήτριος Μάρκου (Wed,) studied this question.

synapsesocial.com/papers/69c37af0b34aaaeb1a67cf1c https://doi.org/https://doi.org/10.26262/heal.auth.ir.370985

Bookmark

View Full Paper