Abstract Large-scale, multi-dimensional mixed datasets are characterized by the pervasive "long-tail distribution." This phenomenon results in data sparsity in subspaces defined by multi-dimensional attribute combinations. This sparsity severely hinders data-driven analysis and insights. Existing data-augmentation methods primarily focus on single dimensions, ignoring the complex, intrinsic multi-dimensional correlations of the real world, which leads to generated samples that lack logic and realism. To address this challenge, we propose a systematic, on-demand, and fine-grained sample augmentation framework. Our core idea is to precisely locate and augment data-sparse regions from a multi-dimensional combinatorial perspective. For textual data, we have designe two flexible augmentation modes. The first, "augmentation by subset scope", adopts a strategy of model merging and incremental updates. The second, "augmentation by topic", proposes a heuristic search algorithm based on the "Explore-Exploit" paradigm. For numerical data, we pre-construct a global distribution index to achieve efficient identification of sparse intervals. In the sample generation phase, we combine Large Language Models, Retrieval-Augmented Generation, and Chain-of-Thought techniques to ensure that the generated samples meet high-fidelity standards in semantic logic and contextual style. Extensive experiments on real-world datasets demonstrate that our method outperforms baseline approaches in query response speed and the efficiency of sparse region discovery, while maintaining topic coherence and accuracy.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kun Wu
Yaxi Hou
Shan Yang
Data Science and Engineering
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d896046c1944d70ce0738c — DOI: https://doi.org/10.1007/s41019-025-00334-6