The ongoing enlargement of receptive fields in foundational language models to encompass hundreds of thousands of tokens has severely exacerbated the drought of premium extended-context training corpora. Current generative augmentation techniques heavily lean on localized query-response paradigms (mechanically akin to traditional needle-search evaluations). Unfortunately, these paradigms fall short in embedding an encyclopedic comprehension of narrative architecture and distant semantic interdependencies. To bypass this vulnerability, we present Next Chunk Prediction (NCP), an unorthodox data generation pipeline that entirely eradicates the reliance on naturally occurring long-form data. NCP's primary mechanism constructs training samples of strictly arbitrary lengths by amalgamating an array of independent, short-form documents. The conventional approach of shuffling segments within a single contiguous document is subsumed merely as a constrained, special case within our generalized framework. By fracturing this multi-document composite into discrete blocks, completely randomizing their global arrangement, and training the network to reconstruct the cohesive chronological sequence for each constituent text, NCP forces the architecture to digest overarching thematic progressions and complex structural boundaries. By shifting the learning objective from local retrieval to macro-structural recovery, this approach offers a highly scalable and resource-efficient paradigm for engendering genuine panoramic text assimilation, bypassing the bottleneck of long-text data scarcity.
Building similarity graph...
Analyzing shared references across papers
Loading...
Leon Mitchell
Building similarity graph...
Analyzing shared references across papers
Loading...
Leon Mitchell (Sat,) studied this question.
www.synapsesocial.com/papers/69df2c88e4eeef8a2a6b1ada — DOI: https://doi.org/10.5281/zenodo.19550415