In today's cybersecurity landscape, malicious Uniform Resource Locators (URLs) continue to pose a serious threat, as they can be used to deliver malware, phishing, and unauthorized data access, all of which can result in significant financial and reputational losses. Innovative, data-driven methods must be developed because traditional detection methods, such as blacklisting and rule-based detection, cannot detect newly created, obfuscated, and temporary URLs. We aimed to create a consistent method for detecting malicious URLs by analyzing both the characteristics of the URL and the information collected from the WHOIS database for that URL. We utilized five different feature selection methods to identify the best features from 5,000 URLs (malicious or benign) so we could test how they would classify using five different types of machine learning (ML) classifiers: random forest, logistic regression, support vector machine (SVM), naive bayes, and k-nearest neighbor (KNN). The classification methods were Random Forest Feature Importance, Chi-squared, mutual information (MI), L1-lasso, and recursive feature elimination (RFE). The feature selection method that produced the best results for KNN classification was RFE, achieving an F1 Score of 0.988, an accuracy rate of 0.988, and an area under the curve (AUC) of 0.996. We also examined how the inclusion of WHOIS attributes (i.e., domain age, registration date, and privacy) affected the classifiers' ability to perform correct classification. From the experimental results, we see a significant improvement in accuracy, precision, recall, and F1-measure when we include features extracted from WHOIS data. Therefore, WHOIS domain registration details are highly significant when distinguishing between legitimate and malicious websites. Furthermore, we did an extensive exploration of 25 distinct combinations of feature selection methods and ML models. The proposed methodology is a safe, efficient, and interpretable way of detecting malicious domains. Cybersecurity practitioners can design more effective prevention models by leveraging insights from our research (e.g., on model selection, feature importance, and the utility of WHOIS attributes), thereby setting the stage for future research in ML-based cybersecurity methodologies.
Khedekar et al. (Thu,) studied this question.