March 3, 2026Open Access

Hyperparameter Optimization for Big Data: Adapting Sampling Methods to Apache Spark MLlib

Key Points

Tuning methods improve resource efficiency in machine learning algorithms, highlighting their importance for better performance.
Empirical comparison of Monte Carlo and Cross-Entropy sampling with grid and random search techniques reveals distinct advantages.
Observational analysis across various configurations shows that optimized algorithms consistently outperform default settings under specific conditions.
Future research may unlock the full tuning potential of Apache Spark MLlib, addressing identified performance bottlenecks and optimization areas.

Abstract

MLlib is an Apache Spark library that provides many machine learning algorithms and data processing utilities. Although the default configuration of these algorithms yields satisfactory results for practitioners, further tuning is often needed to improve resource usage efficiency. Furthermore, tuned MLlib algorithms may run faster than those using default configurations. However, this improvement depends on several factors, including machine settings, dataset design, and operating system preferences. Previous studies have generally focused on developing sophisticated tuners for MLlib, evaluating algorithm-focused optimizers for their competitiveness. Although derivative-based and model-free optimizers have been modified for use with MLlib, sampling-based optimizers are generally overlooked. To fill this research gap, this study empirically compares sampling-based and model-free techniques for tuning MLlib. Firstly, Monte Carlo and Cross-Entropy sampling algorithms are adapted to optimize MLlib algorithms. Subsequently, model-free techniques, including grid and random search algorithms, are compared with these sampling-based algorithms. Through extensive experimentation, their advantages and limitations are highlighted. Finally, threats to validity and future directions for unlocking the tuning potential of Apache Spark are discussed by interpreting performance bottlenecks and promising areas for optimization.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

M. Maruf Ozturk

Journals

ADBA computer science.

Actions

Institutions

Suleyman Demirel University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hyperparameter Optimization for Big Data: Adapting Sampling Methods to Apache Spark MLlib

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider