High-throughput sequencing has created an omics data boom, opening the door for biology to adopt a data-driven perspective. In human evolutionary genetics, however, uneven representation across populations, restricted data accessibility, and stringent privacy constraints have resulted in a paradox: abundant genomic data exist, but their use remains highly constrained. To address these challenges, we investigate the use of modern generative models, such as Generative Adversarial Networks (GANs) and diffusion models, trained on diverse real genomic sequences to produce Artificial Genomes (AGs): synthetic haplotypes that statistically reproduce real data without exposing sensitive information. Our study focuses on two central challenges: scalability to very high-dimensional genomic sequences and privacy preservation in the generated data.First, we introduce scalable generators for long haplotype sequences that combine dimensionality reduction with generative modeling in a low-dimensional latent space, as well as a frugal variant whose performance, with respect to population genetics metrics, is comparable to state-of-the-art approaches while using far fewer parameters.Second, we propose rigorous evaluation tools for synthetic genomic data: an information-theoretic measure of local haplotypic diversity to quantify biological realism, and PRIVET, a modality-agnostic, sample-level privacy metric leveraging extreme value statistics of nearest-neighbor distances. Beyond a quantitative risk estimate, PRIVET offers interpretable, individual-level privacy scores and reliably detects memorization and some forms of privacy leakage across diverse data modalities, including genetic data.Finally, we demonstrate the practical value of these artificial genomes in local ancestry inference (LAI): models trained solely on AGs match the performance of those trained on real data, while augmenting limited real datasets with AGs substantially improves accuracy.The long-term goal is to establish Artificial Genomes as privacy-preserving alternatives to real biobanks, providing a means to broaden access, mitigating biases, and ensuring confidentiality in population genomics.
Antoine Szatkownik (Mon,) studied this question.