In the era of data-intensive science, feature selection has become a cornerstone of building robust machine learning models, particularly in biomedical fields where datasets are often high-dimensional. The process of identifying a subset of relevant and non-redundant features is not merely a means of reducing computational overhead; it is crucial for enhancing model accuracy, preventing overfitting, and improving the interpretability of results. Despite its importance, finding an optimal feature subset remains a significant challenge. This study introduces and validates a novel two-stage hybrid framework, named Boruta-XGBoost (Boruta-Extreme Gradient Boosting), designed to synergize feature selection and prediction. We demonstrate the efficacy of this framework on the complex task of estimating body fat percentage, a problem characterized by correlated anatomical predictors. Our primary goal is to benchmark its performance against a comprehensive suite of established machine learning algorithms. Our proposed framework unfolds in two distinct stages. Initially, we leverage the Boruta algorithm—a powerful wrapper method based on Random Forest—to systematically identify all statistically relevant features. This step ensures that no potentially important variables, including those with subtle interaction effects, are prematurely discarded. Subsequently, an Extreme Gradient Boosting (XGBoost) model is trained exclusively on this refined feature subset. To establish a robust benchmark, we evaluated 16 other models (six linear and 10 non-linear) using the complete feature set. All models were assessed via a 70/30 train-test split and validated using 10-fold cross-validation, with the coefficient of determination (R 2 ) and Root Mean Square Error (RMSE) serving as the primary performance metrics. When trained on the full set of features, the best-performing conventional model—non-linear support vector regression—yielded an R 2 of 0.77. In striking contrast, our Boruta-XGBoost framework achieved a markedly superior R 2 of 0.88 and a correspondingly lower RMSE. A key finding is that this enhanced predictive accuracy was accomplished using just six features (representing only 46% of the original 13), which were autonomously identified by the Boruta stage. Conclusion Our results compellingly demonstrate that the proposed Boruta-XGBoost hybrid framework offers a substantial improvement over conventional modeling approaches. By intelligently decoupling feature selection from prediction, our method constructs a more parsimonious and powerful model. This two-stage strategy underscores the value of integrating rigorous statistical feature filtering with state-of-the-art predictive algorithms, presenting a potent and generalizable approach for tackling complex prediction problems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tohid Yousefi
Özlem Varlıklar
SHILAP Revista de lepidopterología
PeerJ Computer Science
Building similarity graph...
Analyzing shared references across papers
Loading...
Yousefi et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69a75c2fc6e9836116a24c77 — DOI: https://doi.org/10.7717/peerj-cs.3463