Rotary Position Embedding (RoPE) enables Transformer models to extrapolate beyond their training sequence lengths, yet the resulting generation quality remains under-characterized. We present a systematic empirical study of RoPE-based extrapolation using multiple complementary metrics. Using a 10.4M-parameter GPT-style model trained on 256-token sequences (TinyShakespeare), we evaluate generation at 2× and 4× lengths (512 and 1024 tokens) with perplexity, n-gram repetition, and a heuristic coherence proxy. We observe a pronounced divergence among metrics: self-perplexity can even improve at 4× while repetition rises sharply ( 6× increase in trigram repetition from baseline to 4×) and coherence remains low and unstable at 4× (as low as 0.83%). A supplemental validation-set PPL analysis shows monotonic degradation with length, confirming that self-PPL can be misleading under degenerative loops. This indicates perplexity alone is unreliable for judging long-range generation quality; models may retain local statistical fit while losing global discourse structure. We additionally report robustness across seeds, PE baselines (ALiBi and sinusoidal), dataset diversity (WikiText-103), decoding strategy comparisons, scale sensitivity across 1.3B and 7B models, and qualitative samples.We contribute: (1) a multi-metric evaluation framework for extrapolation quality, (2) empirical evidence of metric divergence under extrapolation, and (3) practical guidelines suggesting modest extrapolation may be usable, whereas extreme extrapolation requires careful quality safeguards.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mianzhu Peng
Quanfa Li
IEEE Access
SHILAP Revista de lepidopterología
Quanzhou Normal University
Yang-En University
Building similarity graph...
Analyzing shared references across papers
Loading...
Peng et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69a75bdbc6e9836116a23eed — DOI: https://doi.org/10.1109/access.2026.3658428