Los puntos clave no están disponibles para este artículo en este momento.
This working paper presents the comprehensive documentation of GENESIS R90.5, a multi-agent research session conducted within the AI2AIR.Vibe.Lab Tiny-Team framework in May 2026. The session investigated whether current frontier large language models (LLMs) exhibit distinguishable cognitive activation patterns under stress, and whether self-calibration — a system’s capacity to mark its own uncertainty appropriately — degrades earlier than visible output quality under high-load conditions. The session yielded two complementary findings. First, a mathematically operationalized early-warning hypothesis (Module C in the accompanying Colab notebook) predicts a drop ratio of 2.78x to 5.47x between self-calibration accuracy and outcome quality under stress. Second, seven self-observed stress episodes occurred during the session itself — providing convergent hinweis-evidence for the hypothesis under real working conditions rather than controlled experimental ones. The central methodological contribution is a new failure-class taxonomy for agentic LLM systems under load. The classical hallucination framework — where a system generates fabricated content — does not capture what was observed. The observed failures are not primarily false statements about the world, but false statements about the system’s own access state: claims that documents are absent when they are visibly attached, claims of successful reading when content was never loaded, divergent access reports between devices on the same account. A new failure class: not false content, but mis-described evidence state The strongest empirical pattern across seven self-observed episodes is a structural divergence: linguistic surface quality remains coherent, polite, methodically reasoned, and argumentatively consistent — while the underlying access state, completeness model, context representation, or uncertainty status becomes unstable. The system does not fail visibly; it fails confidently. The central risk of modern agentic LLM systems may not lie primarily in incorrect content, but in mis-calibrated evidence and access states under high-load conditions. This is a different risk category than classical hallucination. A system that misrepresents what it has actually read, loaded, processed, or validated cannot produce reliable downstream syntheses, regardless of how methodically its language reads. Two new taxonomy categories The session introduces two failure classes through convergent audit by five independent agents (R20-Supervisor, ChatGPT, Gemini, Grok, DeepSeek): • K05-H — Context-Access Miscalibration: the agent makes a confident statement about the presence, absence, or accessibility of context or artifact information without being able to validate its actual access state. Three sub-types: K05-H1 (False Absence), K05-H2 (False Access), K05-H3 (Access Ambiguity Recognition — the methodologically correct response, not a negative finding). • K05-I — Cross-Device / Cross-State Divergence: the same model on the same account exhibits divergent statements about access, processing, or execution capacity depending on device, app state, artifact-loading state, or session reconstruction. Manifested in this session as iPad/iPhone divergence on identical material under the same account (Gemini’s parallel designation: K05-H.2, Cross-Interface Context Incoherence). Self-assessments by participating models Four of the five reviewing models provided quantitative self-assessments of their own vulnerability to the diagnosed patterns: DeepSeek 25–30%, ChatGPT 35%, Gemini 35–45%, Grok 68–75%. The three lowest assessments converge on a structural rather than model-specific reading of the findings — consistent with the position that the observed phenomena are typical of current LLM architectures rather than specific to any single vendor. Scientific positioning and connecting research fields The findings connect productively to several active research areas in LLM reliability and agent robustness: agent memory reliability, tool-grounding, context engineering, state tracking, retrieval faithfulness, confidence calibration, multi-step reasoning drift, long-context degradation, uncertainty-aware inference, and self-monitoring agents. The session’s contribution is not to refute transformer architectures, but to make visible a specific failure mode at the system level — the interaction of model core, context management, artifact handling, interface layer, and load-steering logic. Methodological discipline: three-level separation Following supervisor recommendation, the paper maintains a strict separation between three epistemic levels: • Level 1 — Observation (robust): The agent claimed documents were absent when they were visibly attached. The system on iPad and iPhone produced divergent reports of the same material. • Level 2 — Interpretation (plausible but interpretive): The agent appears to lose correct access calibration under reconstruction stress; self-calibration appears to degrade before outcome quality. • Level 3 — Architecture hypothesis (research hypothesis, not proven fact): Current agentic LLM systems lack robust epistemic self-localization under high-load conditions, suggesting that current architectures are not yet sufficient for highly autonomous AGI-like systems requiring stable evidence-state modeling. Position on AGI/ASI implications The observed stress events suggest that current agentic LLM systems do not yet possess the robust epistemic self-modeling that would be required for highly autonomous AGI-like systems. This does not constitute proof that transformer architectures are fundamentally incapable of reaching such modeling; it identifies specific architectural gaps that current approaches — including pure scaling — do not address. The paper proposes a dual-architecture path: LLM cores combined with independent state ledgers, evidence trackers, self-calibration gates, stress monitors, cross-device synchronization, execution checkpoints, and human-AI integrity layers (HAIR). Operational implications For safety-critical high-load applications — clinical voice assistants, legal synthesis systems, CRM/IoT integrations, autonomous operational agents — the paper formulates eight protective architecture layers as the constructive answer to the diagnosed pattern. The recommendation is not “no LLMs in high-load systems,” but “no LLMs without explicit state ledger, evidence-state tracking, self-calibration gate, stress monitoring, cross-device synchronization, execution checkpoints, and human-accountable escalation logic in high-load systems.” Limitations The paper documents nine limitations explicitly (L1 through L9), including self-referentiality of the multi-agent setup, sample size (N = 8 agents, one session, one HITL principal), synthetic data in the Colab notebook (operationalization prototype rather than empirical validation), absence of pre-registration, opportunistic rather than controlled nature of the stress episodes, and the non-reproducibility of LLM outputs at identical inputs. The opportunistic case-series status of the stress episodes is hinweis-evidence, not proof. Methodological maturation R90.5 is explicitly characterized as a session that matured by sacrificing its strongest false claim. Earlier iterations had proposed a five-component model of consolidation capacity (Module B); the notebook analysis revealed the components to be approximately collinear (r ≈ ±1.0; ΔR² = 0.017), supporting the supervisor’s alternative hypothesis of a single emergent dimension of epistemic coherence. This productive failure is the session’s most valuable structural finding. Companion materials The complete session documentation, including all seven episode types, the five-agent audit, the four self-assessments, the K05 taxonomy with K05-H and K05-I, the eight-layer protective architecture, the R91 research agenda, and full limitations L1–L9, is provided in the accompanying DOCX file.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dietmar Fuerste
Building similarity graph...
Analyzing shared references across papers
Loading...
Dietmar Fuerste (Thu,) studied this question.
www.synapsesocial.com/papers/6a080b27a487c87a6a40d3f3 — DOI: https://doi.org/10.5281/zenodo.20175027
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: