Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jianghao Chen
Junhong Wu
Yangyifan Xu
Building similarity graph...
Analyzing shared references across papers
Loading...
Chen et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68f5fcce8d54a28a75cf1a56 — DOI: https://doi.org/10.48550/arxiv.2503.02502