HyDRA v4: Hyperbolic Distillation with Riemannian Adaptation— Channel Attribution and Proof by Elimination HyDRA is a compact autoregressive language model (≈12M parameters) thatoperates natively on the Lorentz hyperboloid H¹²⁸ and is trained viaknowledge distillation from a frozen GPT-2-small teacher (117M parameters). Every layer — attention, feed-forward, and residual connections — preservesthe Riemannian manifold constraint to float64 precision. Training onWikiText-2 achieves a best validation perplexity of 282. 0 (Variant F, 44, 000 steps), with Minkowski constraint violations below 10⁻⁶ throughout. ─────────────────────────────────────────────────────────────MAIN FINDING: Channel 2 (LM Head) is the Dominant Cause of DegEq───────────────────────────────────────────────────────────── We identify, characterise, and causally attribute Degenerate Equilibrium (DegEq): a stable fixed point of KL-based hyperbolic distillation in whichangular alignment stabilises while radial dynamics remain active, yielding ageometrically valid but semantically degraded configuration. Five from-scratch experiments spanning the complete space of loss-sideinterventions — standard KL (Variant F), Projective KL (D1), DecoupledRadial-Angular (D3), Origin-Tangent Euclidean Distillation (OTED), and OTEDwith radial anchor (V5-D) — all converge to the same fixed point (rdc* ≈ 10, relative deviation <5%). This constitutes a proof by exhaustion that noloss-side intervention prevents DegEq. A complete 2×2 channel-isolation matrix (V5) then surgically attributes theattractor to its architectural source: V5-D (no fix): rdc* = 10. 74 — DegEq baseline V5-B (Ch2 fix only): rdc* = 0. 96 — 91% reduction ★ V5-A (Ch1 fix only): RadiusCollapse — Ch1 alone is unstable V5-C (both channels): NaN explosion — numerical instability V5-B activates only AngularLMHead (Channel 2 fix), leaving the optimizerunchanged. The result is a 91% reduction in rdc*, establishing Channel 2— the LM head radial gradient ∂logit/∂r = cosh (r) ≠ 0 — as the dominantand sufficient cause of DegEq. Channel 2 alone is necessary and sufficientto eliminate the attractor. ─────────────────────────────────────────────────────────────MECHANISTIC EXPLANATION───────────────────────────────────────────────────────────── Two independent radial channels drive DegEq: Channel 1 (optimizer): First-order parallel transport of AdamW momentumaccumulates a radially biased approximation error εₜ ∝ xₜ via theChristoffel symbol Γʳ_θθ = −sinh (r) cosh (r). Delays DegEq onset but cannotneutralise the attractor. When applied without Channel 2, triggersRadiusCollapse — a distinct pathology distinct from DegEq. Channel 2 (LM head): The vocabulary projection computes∂logitₖ/∂r = cosh (rₕ) ≠ 0, injecting radial gradient that bypasses anyloss-side or optimizer-side intervention. This is the dominant channel: zeroing it via AngularLMHead eliminates 91% of the attractor value. Additionally, the Krioukov (2010) curvature–Zipf equilibrium predictsK* = 1/ (4 (γ−1) ²) = 66. 4 for WikiText-2 (γ ≈ 1. 06), versus the model'sfixed K = 1 — a mismatch of 65. 4 units that may explain the specificattractor value rdc* ≈ 10 as a thermodynamic equilibrium between manifoldcurvature and corpus statistics. ─────────────────────────────────────────────────────────────STRUCTURAL CONTRIBUTIONS───────────────────────────────────────────────────────────── (1) Radial Drift Coefficient (RDC). Real-time diagnostic proxy RDC = σₗogit / (Lₕidden + ε), EMA β = 0. 95, with Lyapunov potential Lq = ½·rdc². Predicts DegEq onset 500–1, 000 steps in advance. (2) Riemannian Natural Gradient Correction. r/sinh (r) scaling of AdamW updates on manifold parameters (Amari, 1998). Delays DegEq onset from step ≈5, 400 to beyond step 33, 400 in the extended Variant F run. (3) AngularLMHead. Cosine-similarity vocabulary head with ∂logit/∂r = 0 exact. Eliminates Channel 2 — the dominant DegEq source. V5-B result: rdc* 10. 74 → 0. 96 (91% reduction). (4) EarlyStoppingV3. Dual-EMA stopping (fast β = 0. 3, slow β = 0. 9) with detrended noise estimation. Eliminates false positives on true loss plateaus where single-reference EMA fires spuriously. (5) Origin-Tangent Euclidean Distillation (OTED). All objectives computed in Tₒ Hⁿ ≅ ℝⁿ, eliminating Christoffel symbols from the backward pass entirely. Reaches rdc* = 10. 67 — confirming loss geometry is not the causal channel. (6) cgt. diagnostics. Post-training DegEq analysis module: Krioukov K* equilibrium (kₑquilibriumfromᵦipf), Khrulkov frequency–radius correlation (freqᵣadiuscorrelation), and DegEqDiagnostics unified report. Purely additive — no training code modified. ─────────────────────────────────────────────────────────────GEOMETRIC–LINGUISTIC DECOUPLING (Negative Result) ───────────────────────────────────────────────────────────── Geometric fidelity is neither sufficient nor predictive of linguisticcompetence. Despite 9/10 geometry tests passing and Minkowski violationsbelow 10⁻⁶ throughout 44, 000 steps, generated text remains incoherent. Good geometry is cheap — enforcing Riemannian correctness underarchitectural constraints requires no special effort — but does not implymeaningful representation learning. ─────────────────────────────────────────────────────────────LIMITATIONS───────────────────────────────────────────────────────────── Single-seed results (SEED=42). DegEq characterisation is empirical andrestricted to the tested architecture (4L×128d) and dataset (WikiText-2). V5-C (Ch1+Ch2) suffered NaN explosion at step 13 — a numerical instabilityin the Ch1+Ch2+OTED interaction, distinct from DegEq and unresolved. The Krioukov K* prediction (learnable K shifting rdc* to a data-dependentfixed point) remains falsifiable but untested. Code: https: //github. com/gokuhayda/MyShowCase/tree/main/hyperbolic-intelligenceLicense: CC BY-NC-SA 4. 0
Building similarity graph...
Analyzing shared references across papers
Loading...
ERIC GUSTAVO REIS DE SENA
Building similarity graph...
Analyzing shared references across papers
Loading...
ERIC GUSTAVO REIS DE SENA (Fri,) studied this question.
www.synapsesocial.com/papers/69db38534fe01fead37c688a — DOI: https://doi.org/10.5281/zenodo.19501160