Small sample sizes in clinical research make it challenging to achieve statistical precision and reliability in this field, so more generative data augmentation is recommended but its inferential value is still lacking. An inference-based framework is proposed with normalization of flow-based generative modeling and a structured statistical model to evaluate the quality of estimators. We used three real-life datasets in Stroke, Alzheimer, and Dementia populations and analyzed them in an 80/20 split where the hold-out data were the quasi-population reference estimators. We generated synthetic data with augmentation ratios of r: 0, 1, 2, and 5 and compared their performance in terms of bias, variance, mean squared error (MSE), and coverage probability by Monte Carlo replication and a nested bootstrap to account for sampling variation and model uncertainty. The augmentation effectiveness was strongly dataset-dependent and non-monotonic. Moderate augmentation reduced variance (32–41%) in the Stroke dataset with approximately 5% bias reduction, yielding an 18–27% lower MSE while preserving near-nominal coverage, representing the real inferential benefit. In the Alzheimer dataset, variance decreases were compensated by bias increases of 6–10%, resulting in only modest improvements in the MSE. In contrast, for the Dementia dataset, augmentation amplified bias by about 15%, increased the MSE by 12–25%, and reduced coverage below 90% at higher augmentation ratios, which shows inferential instability. In general, augmentation proceeds by a dataset-dependent bias–variance trade-off, where effectiveness relies on generative model fidelity and the appropriate augmentation intensity.
Papavramidou et al. (Sat,) studied this question.