What question did this study set out to answer?

This research aims to evaluate the effectiveness of balanced datasets and feature selection on heart disease classification using machine learning techniques.

April 10, 2026Open Access

Importance of balanced datasets with feature selection and ensemble methods on heart disease classification using distinctive machine learning techniques: a comparative analysis

Key Points

This research aims to evaluate the effectiveness of balanced datasets and feature selection on heart disease classification using machine learning techniques.
Compared seven machine learning models for heart disease classification: LR, DT, RF, NB, SVM, ANN, and KNN.
Employed dataset balancing with bagging ensemble methods to enhance prediction accuracy.
Tested three feature selection techniques: ANOVA, Chi-Square, and Regression Analysis in various combinations.
Random Forest achieved the highest accuracy of 92% with balanced datasets and feature selection.
Decision Tree and Random Forest also reached accuracies of 82% and 85%, respectively.
Recall and AUC scores were relatively low despite improved classification accuracy, indicating potential misses in positive cases.

Abstract

Heart disease, a leading cause of death worldwide, accounts for 31% of global fatalities and requires effective early detection methods to combat its rising prevalence. Early detection and prediction of heart disease remain one of the most pressing challenges in current healthcare. In recent years, machine learning (ML) technologies have offered opportunities to address these inequities by improving heart disease detection and prediction capabilities. This study offers a comparative evaluation of seven machine learning models: Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB), Support Vector Machine (SVM), Artificial Neural Networks (ANN), and K-Nearest Neighbors (KNN) for classifying heart disease. Using the ‘BRFSS 2020 Heart Disease Dataset’, this research examines the effects of dataset balancing with various feature selection techniques and an ensemble method with bagging on classification and prediction accuracy. Three feature selection methods ANOVA, Chi-Square, and Regression Analysis were tested through eight different combinations based on union and intersection of these methods: (i) ANOVA ∪ Chi-Square, (ii) ANOVA ∪ Regression, (iii) Chi-Square ∪ Regression, (iv) ANOVA ∪ Chi-Square ∪ Regression, (v) ANOVA ∩ Chi-Square, (vi) ANOVA ∩ Regression, (vii) Chi-Square ∩ Regression, and (viii) ANOVA ∩ Chi-Square ∩ Regression. Experimental results demonstrate that with a balanced dataset, RF and DT achieved the highest accuracies of 85% and 82%, respectively. Besides, the outcome of the balanced dataset incorporating feature selection techniques indicates that ANOVA-based feature selection was associated with higher performance under the ANOVA ∪ Chi-Square and ANOVA ∪ Chi-Square ∪ Regression feature combinations, where RF reached the highest accuracy (92%), recall (93%), and AUC score (0. 92). Additionally, bagging-based ensemble techniques improved performance for certain high-variance models (DT, RF, and ANN) when applied to the balanced dataset, although the impact varied across models. Despite promising accuracy with dataset balancing incorporating an ensemble method, the recall and AUC scores were relatively low, indicating many positive cases were missing. Consequently, dataset balancing combined with feature selection techniques showed comparatively improved performance across several evaluation metrics under the specific experimental setup. These findings provide comparative insights into preprocessing strategies and optimal machine learning models for heart disease classification, which would be helpful for future research.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jinat Ara

Hanif Bhuiyan

Isfara Islam Roza

Journals

Scientific Reports

Actions

Institutions

University of Pannonia

Ahsanullah University of Science and Technology

Gold Coast City Council

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Importance of balanced datasets with feature selection and ensemble methods on heart disease classification using distinctive machine learning techniques: a comparative analysis

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study