When multi-task training fails — loss stalls, tasks conflict silently, convergence degrades without warning — the root cause is rarely the optimizer. It is the learning environment: the interaction between task decomposition, loss aggregation, and network architecture that determines whether the loss landscape is navigable. This paper characterises the structural conditions on the environment that make such failures predictable and preventable. We identify three conditions — analytic network, the squared-norm certificate h = (1/K) ∑Φₖ², and orthogonal task decomposition — that collectively eliminate the residual Hessian (S≡0) via complexification (the CANON condition). Under these conditions, convergence becomes predictable: Gauss–Newton curvature is exact, landscape smoothness is self-bounding, and in the realizable regime where zero loss is achievable, certification of global optimality reduces to polynomial time. Scope. The strongest results require realizability (zero loss achievable), which holds in PINNs, robotics control, and overparameterised regression, but not in language modelling or classification with intrinsic entropy. In non-realizable settings, the framework still provides well-conditioned gradient flow and monotone h-decrease, but the certification guarantee does not apply. AdamW's per-parameter variance already tracks Gauss–Newton curvature (Spearman ρ > 0. 88), consistent with its broad empirical success. The framework complements this by characterising failure modes: when task-gradient orthogonality is violated (κ (G) ≫ 1), convergence degrades predictably. Three structural consequences follow: (i) No spurious minima — for K≥2 tasks, any critical point with L≠0 and d>2K is a saddle; (ii) Per-step safety — closed-form adaptive learning rate bound with emergent warmup; (iii) Self-bounding convergence — smoothness shrinks as training progresses, giving linear convergence in ℝᵈ and quadratic under exact Newton in ℂᵈ. On a Shakespeare char-LM proof-of-concept (d=64, K∈2, 3, 5), exact task-gradient orthogonality (κ (G) =1) yields the best convergence; deliberate violations degrade the certificate by up to 14% and conditioning by up to 28×. With K=7 tasks (100K steps), a certificate controller that monitors gradient correlation achieves 17% lower validation CE than the uncontrolled baseline.
Building similarity graph...
Analyzing shared references across papers
Loading...
Been Seo
Building similarity graph...
Analyzing shared references across papers
Loading...
Been Seo (Tue,) studied this question.
www.synapsesocial.com/papers/69d895046c1944d70ce06040 — DOI: https://doi.org/10.5281/zenodo.19454889
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: