This thesis addresses the challenge of scaling single-linkage hierarchical clustering (SLHC) to very large datasets by developing and evaluating advanced parallel algorithms on distributed systems. Traditional SLHC, while useful for uncovering nested structures in data, suffers from prohibitive time and memory complexity, making it impractical for big data applications. To overcome these limitations, we introduce and analyze novel distributed approaches based on Well-Separated Pair Decomposition (WSPD) and Bichromatic Closest Pair (BCCP) computations, implemented on the Apache Spark framework inspired by the work of Wang et al., 2021. We propose multiple algorithmic variations that exploit data partitioning, iterative Kruskal merging, and kd-tree optimizations to reduce computation and communication costs. Through extensive experiments on both local and clustered environments, we evaluate scalability, speed-up, and memory efficiency across datasets of up to several million points. Results demonstrate that partitioning and pruning strategies significantly improve performance, with MemoGFK achieving up to 2× speed-ups compared to baseline methods, and superior scalability over increasing input sizes. Overall, this work contributes practical distributed algorithms for hierarchical clustering, bridging the gap between theoretical advances in geometric graph structures and real-world big data analytics. It lays the foundation for further research on efficient clustering methods in large-scale, high-dimensional environments.
Δημήτριος Μάρκου (Wed,) studied this question.