March 3, 2026Open Access

New hybrid multi-objective feature selection: Boruta-XGBoost

Key Points

Boruta-XGBoost framework achieved an R2 of 0.88, surpassing conventional models.
The best conventional model had an R2 of 0.77 when using all features.
Feature selection utilized the Boruta algorithm, optimizing dimensionality reduction effectively.
The approach combines robust statistical methods with advanced predictive algorithms, enhancing performance.

Abstract

In the era of data-intensive science, feature selection has become a cornerstone of building robust machine learning models, particularly in biomedical fields where datasets are often high-dimensional. The process of identifying a subset of relevant and non-redundant features is not merely a means of reducing computational overhead; it is crucial for enhancing model accuracy, preventing overfitting, and improving the interpretability of results. Despite its importance, finding an optimal feature subset remains a significant challenge. This study introduces and validates a novel two-stage hybrid framework, named Boruta-XGBoost (Boruta-Extreme Gradient Boosting), designed to synergize feature selection and prediction. We demonstrate the efficacy of this framework on the complex task of estimating body fat percentage, a problem characterized by correlated anatomical predictors. Our primary goal is to benchmark its performance against a comprehensive suite of established machine learning algorithms. Our proposed framework unfolds in two distinct stages. Initially, we leverage the Boruta algorithm—a powerful wrapper method based on Random Forest—to systematically identify all statistically relevant features. This step ensures that no potentially important variables, including those with subtle interaction effects, are prematurely discarded. Subsequently, an Extreme Gradient Boosting (XGBoost) model is trained exclusively on this refined feature subset. To establish a robust benchmark, we evaluated 16 other models (six linear and 10 non-linear) using the complete feature set. All models were assessed via a 70/30 train-test split and validated using 10-fold cross-validation, with the coefficient of determination (R 2 ) and Root Mean Square Error (RMSE) serving as the primary performance metrics. When trained on the full set of features, the best-performing conventional model—non-linear support vector regression—yielded an R 2 of 0.77. In striking contrast, our Boruta-XGBoost framework achieved a markedly superior R 2 of 0.88 and a correspondingly lower RMSE. A key finding is that this enhanced predictive accuracy was accomplished using just six features (representing only 46% of the original 13), which were autonomously identified by the Boruta stage. Conclusion Our results compellingly demonstrate that the proposed Boruta-XGBoost hybrid framework offers a substantial improvement over conventional modeling approaches. By intelligently decoupling feature selection from prediction, our method constructs a more parsimonious and powerful model. This two-stage strategy underscores the value of integrating rigorous statistical feature filtering with state-of-the-art predictive algorithms, presenting a potent and generalizable approach for tackling complex prediction problems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tohid Yousefi

Özlem Varlıklar

Journals

SHILAP Revista de lepidopterología

PeerJ Computer Science

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

New hybrid multi-objective feature selection: Boruta-XGBoost

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study