Synthetic patient data offer a promising avenue for clinical research, but their usefulness depends on preserving statistical fidelity, biomedical plausibility, and patient privacy. To address this, a dual adversarial autoencoder is employed to generate longitudinal synthetic datasets from real-world clinical data of nearly one million individuals with diabetes from the Andalusian Population Health Database. A multi-faceted evaluation assesses data utility in a machine learning task, predicting chronic kidney disease onset, and evaluates the biomedical plausibility of generated disease trajectories. Models trained exclusively on synthetic data demonstrate predictive performance comparable to those trained on real data and show stability in feature importance rankings, indicating clinical coherence. However, bias and domain-specific sex-stratified analyses reveal inconsistencies not discernible through standard metrics, while data augmentation provides no performance benefit, as data saturation is reached given the large source population. These findings demonstrate that while synthetic data can replicate predictive performance, a robust validation framework combining machine learning utility with domain-specific biomedical evaluation is essential. This work supports the use of synthetic data for large-scale, privacy-preserving research to enable a collaborative healthcare data ecosystem.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ortuño et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69ba42cf4e9516ffd37a3648 — DOI: https://doi.org/10.1002/advs.202516196
Francisco Ortuño
Víctor Manuel de la Oliva Roque
Javier‐Ignacio Ramirez‐Lopez
Advanced Science
Universidad de Granada
Universidad de Sevilla
Institute of Molecular Biotechnology
Building similarity graph...
Analyzing shared references across papers
Loading...