March 3, 2026Open Access

Improving medical data quality via synthetic data generation: a review

Key Points

Synthetic data generation improves medical data quality and diversity, directly impacting AI algorithms.
Utilizing methods like SMOTE can enhance classification performance by up to 12%, showcasing tangible benefits.
Analysis targets statistical and machine learning approaches for generating tabular synthetic data across peer-reviewed studies.
Data quality is contingent on the generation process and the characteristics of real-world data used in modeling.

Abstract

Most Artificial intelligence (AI) algorithms are trained on large-scale datasets to learn complex, underlying patterns to make informed decisions. However, due to their data-intensive nature, the performance of these algorithms highly depends on the quality of the training data. Due to strict privacy regulations, large medical datasets are not readily available, which leads to reduced data sizes as well as under-representation of some classes and demographic groups. These data shortcomings, if not handled, are replicated by the AI algorithms, thus compromising their performance. One potential solution to this problem is the augmentation of data by generating synthetic samples that possess the same real-world data properties. Therefore, this study explores the synthetic data generation process and pre-existing research, mainly focusing on statistical, probabilistic, and Machine Learning (ML) based tabular data generation methods. For this purpose, we analysed high-quality peer-reviewed articles extracted from different databases. The findings show that synthetic data not only increases data volume but also improves its quality by increasing diversity and enhancing the representation of various demographic groups within the datasets. Synthetic Minority Oversampling Technique (SMOTE) and its variants are the commonly used techniques with up to 12% improvement reported in the classification performance of a classifier. However, the quality of artificial data depends on the underlying data generation process, the characteristics of real-world data used for subsequent modelling, and the evaluation metrics to assess data quality.

Improving medical data quality via synthetic data generation: a review

Key Points

Abstract

Cite This Study