What question did this study set out to answer?

This research aims to develop a new method for generating high-quality synthetic data that retains the statistical properties of the original datasets.

May 3, 2026Open Access

Synthetic data through combinatorial optimization of pairwise probabilities

Key Points

This research aims to develop a new method for generating high-quality synthetic data that retains the statistical properties of the original datasets.
Proposed DISCO (Discrete Intersection Synthesizer through Combinatorial Optimization) for synthetic data generation.
Optimized intersection of pairwise probabilities to emulate original datasets.
Evaluated on various datasets and tasks including regression and classification.
DISCO generates synthetic data that rivals state-of-the-art models in statistical accuracy.
Evaluation results show high fidelity in generated data compared to original datasets.
Effective in preserving marginal and pairwise distributions.

Abstract

Abstract The generation of synthetic data is a critical area of research in domains where real data are either not available in large quantities or cannot be directly used. Different techniques have been developed to produce high-quality, realistic synthetic datasets which retain the statistical properties of the original data. State-of-the-art results focus on the use of neural networks to capture the latent space extracted from the data. While recent advances in the field of deep learning motivate the application in this new context, this paper proposes a naive baseline to generate synthetic data based on pairwise probabilities. We name the technique DISCO (Discrete Intersection Synthesizer through Combinatorial Optimization), a novel synthetic generator for tabular data. DISCO models data by optimizing the intersection of pairwise probabilities on each generated row in order to resemble the original dataset. Our approach preserves marginal (and pairwise) distributions and as a result, resembles the original data with high fidelity with a very simple approach. Evaluation on various synthetic and real-world datasets as well as regression and classification tasks prove DISCO’s ability to generate high-quality data that rivals state-of-the-art models in both statistical accuracy and machine learning efficacy.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Josep Maria Salvia Hornos

Cèsar Fernández Camón

Carles Mateu Piñol

Journals

International Journal of Data Science and Analytics

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Synthetic data through combinatorial optimization of pairwise probabilities

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study