What question did this study set out to answer?

June 1, 2026Open Access

Machine learning-based malicious URL detection using feature selection techniques and WHOIS features

Key Points

The research aims to develop a robust method for detecting malicious URLs by analyzing both URL characteristics and WHOIS data.
Analyzed 5,000 URLs (malicious or benign) using multiple feature selection techniques.
Employed five ML classifiers: random forest, logistic regression, SVM, naive bayes, and KNN.
Utilized WHOIS attributes such as domain age and registration date to enhance classification accuracy.
RFE produced the best KNN classification results with an F1 Score of 0.988 and accuracy of 0.988.
WHOIS features significantly improved accuracy, precision, recall, and F1-measure for malicious URL detection.
The proposed methodology enhances the effectiveness of cybersecurity practices in detecting malicious domains.

Abstract

In today's cybersecurity landscape, malicious Uniform Resource Locators (URLs) continue to pose a serious threat, as they can be used to deliver malware, phishing, and unauthorized data access, all of which can result in significant financial and reputational losses. Innovative, data-driven methods must be developed because traditional detection methods, such as blacklisting and rule-based detection, cannot detect newly created, obfuscated, and temporary URLs. We aimed to create a consistent method for detecting malicious URLs by analyzing both the characteristics of the URL and the information collected from the WHOIS database for that URL. We utilized five different feature selection methods to identify the best features from 5,000 URLs (malicious or benign) so we could test how they would classify using five different types of machine learning (ML) classifiers: random forest, logistic regression, support vector machine (SVM), naive bayes, and k-nearest neighbor (KNN). The classification methods were Random Forest Feature Importance, Chi-squared, mutual information (MI), L1-lasso, and recursive feature elimination (RFE). The feature selection method that produced the best results for KNN classification was RFE, achieving an F1 Score of 0.988, an accuracy rate of 0.988, and an area under the curve (AUC) of 0.996. We also examined how the inclusion of WHOIS attributes (i.e., domain age, registration date, and privacy) affected the classifiers' ability to perform correct classification. From the experimental results, we see a significant improvement in accuracy, precision, recall, and F1-measure when we include features extracted from WHOIS data. Therefore, WHOIS domain registration details are highly significant when distinguishing between legitimate and malicious websites. Furthermore, we did an extensive exploration of 25 distinct combinations of feature selection methods and ML models. The proposed methodology is a safe, efficient, and interpretable way of detecting malicious domains. Cybersecurity practitioners can design more effective prevention models by leveraging insights from our research (e.g., on model selection, feature importance, and the utility of WHOIS attributes), thereby setting the stage for future research in ML-based cybersecurity methodologies.

Bookmark

View Full Paper

Bookmark

View Full Paper

Machine learning-based malicious URL detection using feature selection techniques and WHOIS features

Key Points

Abstract

Cite This Study