What question did this study set out to answer?

April 10, 2026Open Access

Evaluating Embedding Representations for Multiclass Code Smell Detection: A Comparative Study of CodeBERT and General-Purpose Embeddings

Key Points

This study aims to compare the effectiveness of different embedding representations for multiclass code smell detection.
Extracted source code fragments from the Crowdsmelling dataset.
Transformed code fragments into vector representations using CodeBERT and a general-purpose embedding model.
Evaluated the representations using several machine learning classifiers with a stratified validation protocol.
CodeBERT consistently outperformed general-purpose embeddings in balanced accuracy and macro F1-score.
Achieved a macro F1-score of 0.8619 compared to 0.7622 for general-purpose embeddings.
Dimensionality reduction analyses showed CodeBERT organized code smells more effectively.

Abstract

Code smells are indicators of potential design problems in software systems and are commonly used to guide refactoring activities. Recent advances in representation learning have enabled the use of embedding-based models for analyzing source code, offering an alternative to traditional approaches based on manually engineered metrics. However, the effectiveness of different embedding representations for multiclass code smell detection remains insufficiently explored. This study presents an empirical comparison of embedding models for the automatic detection of three widely studied code smells: Long Method, God Class, and Feature Envy. Using the Crowdsmelling dataset as an empirical basis, source code fragments were extracted from the original projects and transformed into vector representations using two embedding approaches: a general-purpose embedding model and the code-specialized CodeBERT model. The resulting representations were evaluated using several machine learning classifiers under a stratified group-based validation protocol. The results show that CodeBERT consistently outperforms the general-purpose embeddings across multiple evaluation metrics, including balanced accuracy, macro F1-score, and Matthews correlation coefficient. Dimensionality reduction analyses using PCA and t-SNE further indicate that CodeBERT organizes code smell instances in a more structured latent representation space, which facilitates the separation of smell categories. In particular, CodeBERT achieved a macro F1-score of 0.8619, outperforming general-purpose embeddings (0.7622) and substantially surpassing a classical TF-IDF baseline (0.4555). These findings highlight the value of this study as a controlled multiclass evaluation of embedding representations and demonstrate the practical value of domain-specific representations for improving automated code smell detection and class separability in real-world software engineering scenarios.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Marcela Mosquera

Rodolfo Bojorque

Journals

Applied Sciences

Actions

Institutions

National Polytechnic School

Politecnica Salesiana University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating Embedding Representations for Multiclass Code Smell Detection: A Comparative Study of CodeBERT and General-Purpose Embeddings

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study