What question did this study set out to answer?

The aim is to enhance data-driven analysis by addressing data sparsity in multi-dimensional datasets through targeted augmentation.

April 10, 2026Open Access

On-Demand Augmentation for Long-Tail Data Through Subset- and Topic-Driven Sparsity Identification Framework

Key Points

The aim is to enhance data-driven analysis by addressing data sparsity in multi-dimensional datasets through targeted augmentation.
Proposed a systematic on-demand sample augmentation framework.
Developed two augmentation modes for textual data: by subset scope and by topic.
Created a global distribution index for efficient identification of sparse intervals in numerical data.
Utilized large language models and retrieval-augmented generation during sample generation.
Outperformed baseline approaches in query response speed.
Improved efficiency in discovering sparse regions.
Maintained high topic coherence and accuracy in generated samples.

Abstract

Abstract Large-scale, multi-dimensional mixed datasets are characterized by the pervasive "long-tail distribution." This phenomenon results in data sparsity in subspaces defined by multi-dimensional attribute combinations. This sparsity severely hinders data-driven analysis and insights. Existing data-augmentation methods primarily focus on single dimensions, ignoring the complex, intrinsic multi-dimensional correlations of the real world, which leads to generated samples that lack logic and realism. To address this challenge, we propose a systematic, on-demand, and fine-grained sample augmentation framework. Our core idea is to precisely locate and augment data-sparse regions from a multi-dimensional combinatorial perspective. For textual data, we have designe two flexible augmentation modes. The first, "augmentation by subset scope", adopts a strategy of model merging and incremental updates. The second, "augmentation by topic", proposes a heuristic search algorithm based on the "Explore-Exploit" paradigm. For numerical data, we pre-construct a global distribution index to achieve efficient identification of sparse intervals. In the sample generation phase, we combine Large Language Models, Retrieval-Augmented Generation, and Chain-of-Thought techniques to ensure that the generated samples meet high-fidelity standards in semantic logic and contextual style. Extensive experiments on real-world datasets demonstrate that our method outperforms baseline approaches in query response speed and the efficiency of sparse region discovery, while maintaining topic coherence and accuracy.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kun Wu

Yaxi Hou

Shan Yang

Journals

Data Science and Engineering

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

On-Demand Augmentation for Long-Tail Data Through Subset- and Topic-Driven Sparsity Identification Framework

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study