What question did this study set out to answer?

The aim is to identify and characterize structural conditions that prevent failures in multi-task training, making convergence predictable.

April 10, 2026Open Access

Paving the Loss Landscape

Key Points

The aim is to identify and characterize structural conditions that prevent failures in multi-task training, making convergence predictable.
Characterization of three structural conditions: analytic network, squared-norm certificate, and orthogonal task decomposition.
Analysis of loss landscape properties under these conditions.
Investigation of convergence behavior using a proof-of-concept task (Shakespeare char-LM) with varying task numbers.
Under the identified conditions, convergence is predictable and no spurious minima arise for two or more tasks.
Achievable zero loss is confirmed in specified scenarios like PINNs and robotics, but not in language modeling.
Adaptive learning rate bounds improve convergence stability by monitoring task-gradient orthogonality, resulting in a 17% lower validation loss.

Abstract

When multi-task training fails — loss stalls, tasks conflict silently, convergence degrades without warning — the root cause is rarely the optimizer. It is the learning environment: the interaction between task decomposition, loss aggregation, and network architecture that determines whether the loss landscape is navigable. This paper characterises the structural conditions on the environment that make such failures predictable and preventable. We identify three conditions — analytic network, the squared-norm certificate h = (1/K) ∑Φₖ², and orthogonal task decomposition — that collectively eliminate the residual Hessian (S≡0) via complexification (the CANON condition). Under these conditions, convergence becomes predictable: Gauss–Newton curvature is exact, landscape smoothness is self-bounding, and in the realizable regime where zero loss is achievable, certification of global optimality reduces to polynomial time. Scope. The strongest results require realizability (zero loss achievable), which holds in PINNs, robotics control, and overparameterised regression, but not in language modelling or classification with intrinsic entropy. In non-realizable settings, the framework still provides well-conditioned gradient flow and monotone h-decrease, but the certification guarantee does not apply. AdamW's per-parameter variance already tracks Gauss–Newton curvature (Spearman ρ > 0. 88), consistent with its broad empirical success. The framework complements this by characterising failure modes: when task-gradient orthogonality is violated (κ (G) ≫ 1), convergence degrades predictably. Three structural consequences follow: (i) No spurious minima — for K≥2 tasks, any critical point with L≠0 and d>2K is a saddle; (ii) Per-step safety — closed-form adaptive learning rate bound with emergent warmup; (iii) Self-bounding convergence — smoothness shrinks as training progresses, giving linear convergence in ℝᵈ and quadratic under exact Newton in ℂᵈ. On a Shakespeare char-LM proof-of-concept (d=64, K∈2, 3, 5), exact task-gradient orthogonality (κ (G) =1) yields the best convergence; deliberate violations degrade the certificate by up to 14% and conditioning by up to 28×. With K=7 tasks (100K steps), a certificate controller that monitors gradient correlation achieves 17% lower validation CE than the uncontrolled baseline.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Been Seo

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Paving the Loss Landscape

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider