Abstract The generation of synthetic data is a critical area of research in domains where real data are either not available in large quantities or cannot be directly used. Different techniques have been developed to produce high-quality, realistic synthetic datasets which retain the statistical properties of the original data. State-of-the-art results focus on the use of neural networks to capture the latent space extracted from the data. While recent advances in the field of deep learning motivate the application in this new context, this paper proposes a naive baseline to generate synthetic data based on pairwise probabilities. We name the technique DISCO (Discrete Intersection Synthesizer through Combinatorial Optimization), a novel synthetic generator for tabular data. DISCO models data by optimizing the intersection of pairwise probabilities on each generated row in order to resemble the original dataset. Our approach preserves marginal (and pairwise) distributions and as a result, resembles the original data with high fidelity with a very simple approach. Evaluation on various synthetic and real-world datasets as well as regression and classification tasks prove DISCO’s ability to generate high-quality data that rivals state-of-the-art models in both statistical accuracy and machine learning efficacy.
Building similarity graph...
Analyzing shared references across papers
Loading...
Josep Maria Salvia Hornos
Cèsar Fernández Camón
Carles Mateu Piñol
International Journal of Data Science and Analytics
Building similarity graph...
Analyzing shared references across papers
Loading...
Hornos et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69f6e5f38071d4f1bdfc69ca — DOI: https://doi.org/10.1007/s41060-026-01063-3