Sequential generation systems that produce outputs independently at each timestep---including vision-language-action (VLA) models in robotics and per-frame motion generators in animation---exhibit pronounced temporal discontinuity even when trained on smooth demonstrations. We develop a first-principles kinematic framework that explains this phenomenon through four propositions with exact, zero-parameter predictions. Our key theoretical result is that per-step independent generation drives the velocity autocorrelation toward a universal limit of -0.5 and maximizes jerk among all same-energy error processes---the spectral worst case. We derive closed-form scaling laws: temporal ensemble reduces jerk as 12σ²/N², and action chunking dilutes boundary discontinuity as 1/K. We validate all four predictions in two domains: (i) robot manipulation, where three VLA families (OpenVLA, Octo, π₀) confirm the theory with R² > 0.99, including a controlled experiment isolating inference mechanism from model weights; and (ii) human motion generation, where a pre-registered experiment on HumanML3D yields Cohen's d = 9.0. Our framework provides design principles explaining why action chunking, diffusion, and temporal ensembling all improve smoothness, grounded in first principles rather than empirical tuning.
Building similarity graph...
Analyzing shared references across papers
Loading...
Woojin Jung
Building similarity graph...
Analyzing shared references across papers
Loading...
Woojin Jung (Mon,) studied this question.
www.synapsesocial.com/papers/69ba434a4e9516ffd37a465d — DOI: https://doi.org/10.5281/zenodo.19050964