What question did this study set out to answer?

The study aims to identify and characterize Degenerate Equilibrium (DegEq) in a hyperbolic distillation context using the HyDRA model.

April 12, 2026Open Access

HyDRA: Hyperbolic Distillation with Riemannian Adaptation

Key Points

The study aims to identify and characterize Degenerate Equilibrium (DegEq) in a hyperbolic distillation context using the HyDRA model.
Utilized a compact autoregressive model with ≈12M parameters on Lorentz hyperboloid H¹²⁸.
Implemented knowledge distillation from a frozen GPT-2-small model (117M parameters).
Conducted experiments with various loss-side interventions to analyze the effects on DegEq.
Achieved a best validation perplexity of 282.0 on WikiText-2.
Established that Channel 2 is the dominant cause of DegEq, leading to a 91% reduction in the Radial Drift Coefficient (rdc*).
Found that no intervention eliminated DegEq, indicating its inherent properties within the model architecture.

Abstract

HyDRA v4: Hyperbolic Distillation with Riemannian Adaptation— Channel Attribution and Proof by Elimination HyDRA is a compact autoregressive language model (≈12M parameters) thatoperates natively on the Lorentz hyperboloid H¹²⁸ and is trained viaknowledge distillation from a frozen GPT-2-small teacher (117M parameters). Every layer — attention, feed-forward, and residual connections — preservesthe Riemannian manifold constraint to float64 precision. Training onWikiText-2 achieves a best validation perplexity of 282. 0 (Variant F, 44, 000 steps), with Minkowski constraint violations below 10⁻⁶ throughout. ─────────────────────────────────────────────────────────────MAIN FINDING: Channel 2 (LM Head) is the Dominant Cause of DegEq───────────────────────────────────────────────────────────── We identify, characterise, and causally attribute Degenerate Equilibrium (DegEq): a stable fixed point of KL-based hyperbolic distillation in whichangular alignment stabilises while radial dynamics remain active, yielding ageometrically valid but semantically degraded configuration. Five from-scratch experiments spanning the complete space of loss-sideinterventions — standard KL (Variant F), Projective KL (D1), DecoupledRadial-Angular (D3), Origin-Tangent Euclidean Distillation (OTED), and OTEDwith radial anchor (V5-D) — all converge to the same fixed point (rdc* ≈ 10, relative deviation <5%). This constitutes a proof by exhaustion that noloss-side intervention prevents DegEq. A complete 2×2 channel-isolation matrix (V5) then surgically attributes theattractor to its architectural source: V5-D (no fix): rdc* = 10. 74 — DegEq baseline V5-B (Ch2 fix only): rdc* = 0. 96 — 91% reduction ★ V5-A (Ch1 fix only): RadiusCollapse — Ch1 alone is unstable V5-C (both channels): NaN explosion — numerical instability V5-B activates only AngularLMHead (Channel 2 fix), leaving the optimizerunchanged. The result is a 91% reduction in rdc*, establishing Channel 2— the LM head radial gradient ∂logit/∂r = cosh (r) ≠ 0 — as the dominantand sufficient cause of DegEq. Channel 2 alone is necessary and sufficientto eliminate the attractor. ─────────────────────────────────────────────────────────────MECHANISTIC EXPLANATION───────────────────────────────────────────────────────────── Two independent radial channels drive DegEq: Channel 1 (optimizer): First-order parallel transport of AdamW momentumaccumulates a radially biased approximation error εₜ ∝ xₜ via theChristoffel symbol Γʳ_θθ = −sinh (r) cosh (r). Delays DegEq onset but cannotneutralise the attractor. When applied without Channel 2, triggersRadiusCollapse — a distinct pathology distinct from DegEq. Channel 2 (LM head): The vocabulary projection computes∂logitₖ/∂r = cosh (rₕ) ≠ 0, injecting radial gradient that bypasses anyloss-side or optimizer-side intervention. This is the dominant channel: zeroing it via AngularLMHead eliminates 91% of the attractor value. Additionally, the Krioukov (2010) curvature–Zipf equilibrium predictsK* = 1/ (4 (γ−1) ²) = 66. 4 for WikiText-2 (γ ≈ 1. 06), versus the model'sfixed K = 1 — a mismatch of 65. 4 units that may explain the specificattractor value rdc* ≈ 10 as a thermodynamic equilibrium between manifoldcurvature and corpus statistics. ─────────────────────────────────────────────────────────────STRUCTURAL CONTRIBUTIONS───────────────────────────────────────────────────────────── (1) Radial Drift Coefficient (RDC). Real-time diagnostic proxy RDC = σₗogit / (Lₕidden + ε), EMA β = 0. 95, with Lyapunov potential Lq = ½·rdc². Predicts DegEq onset 500–1, 000 steps in advance. (2) Riemannian Natural Gradient Correction. r/sinh (r) scaling of AdamW updates on manifold parameters (Amari, 1998). Delays DegEq onset from step ≈5, 400 to beyond step 33, 400 in the extended Variant F run. (3) AngularLMHead. Cosine-similarity vocabulary head with ∂logit/∂r = 0 exact. Eliminates Channel 2 — the dominant DegEq source. V5-B result: rdc* 10. 74 → 0. 96 (91% reduction). (4) EarlyStoppingV3. Dual-EMA stopping (fast β = 0. 3, slow β = 0. 9) with detrended noise estimation. Eliminates false positives on true loss plateaus where single-reference EMA fires spuriously. (5) Origin-Tangent Euclidean Distillation (OTED). All objectives computed in Tₒ Hⁿ ≅ ℝⁿ, eliminating Christoffel symbols from the backward pass entirely. Reaches rdc* = 10. 67 — confirming loss geometry is not the causal channel. (6) cgt. diagnostics. Post-training DegEq analysis module: Krioukov K* equilibrium (kₑquilibriumfromᵦipf), Khrulkov frequency–radius correlation (freqᵣadiuscorrelation), and DegEqDiagnostics unified report. Purely additive — no training code modified. ─────────────────────────────────────────────────────────────GEOMETRIC–LINGUISTIC DECOUPLING (Negative Result) ───────────────────────────────────────────────────────────── Geometric fidelity is neither sufficient nor predictive of linguisticcompetence. Despite 9/10 geometry tests passing and Minkowski violationsbelow 10⁻⁶ throughout 44, 000 steps, generated text remains incoherent. Good geometry is cheap — enforcing Riemannian correctness underarchitectural constraints requires no special effort — but does not implymeaningful representation learning. ─────────────────────────────────────────────────────────────LIMITATIONS───────────────────────────────────────────────────────────── Single-seed results (SEED=42). DegEq characterisation is empirical andrestricted to the tested architecture (4L×128d) and dataset (WikiText-2). V5-C (Ch1+Ch2) suffered NaN explosion at step 13 — a numerical instabilityin the Ch1+Ch2+OTED interaction, distinct from DegEq and unresolved. The Krioukov K* prediction (learnable K shifting rdc* to a data-dependentfixed point) remains falsifiable but untested. Code: https: //github. com/gokuhayda/MyShowCase/tree/main/hyperbolic-intelligenceLicense: CC BY-NC-SA 4. 0

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

ERIC GUSTAVO REIS DE SENA

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

HyDRA: Hyperbolic Distillation with Riemannian Adaptation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study