Rule-based evaluators applied to large language model (LLM) outputs systematically misclassify outputs that satisfy the communicative intent of a task while failing the surface formrequired by the rule. We identify this as the rule–semantic labeling gap and describe a lightweightpattern—Shadow Semantic Review—that addresses it through three steps: (1) a secondLLM pass that reviews rule-assigned labels for semantic correctness, (2) classification of disagreements into a named failure motif taxonomy, and (3) forward injection of those motif namesas generation constraints in the next prompt cycle. We instantiate this pattern in a deployedfinancial signal evaluation system where a local 7-billion-parameter LLM generates structuredmarket analyses that are evaluated by a deterministic rule-based harness and audited by ashadow LLM reviewer. Early observations from 14 shadow-reviewed rows reveal that a singlemotif class—thematic mismatch—accounts for 78.6% of rule–semantic disagreements, indicatingthe gap is systematic rather than random. We release the pattern and motif taxonomy as priorart; a 90-day longitudinal ablation study comparing pre- and post-injection recurrence rates isactive, with a full results paper to follow.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sami Allan Kaurila (Wed,) studied this question.
www.synapsesocial.com/papers/69d895ea6c1944d70ce07139 — DOI: https://doi.org/10.5281/zenodo.19474649
Sami Allan Kaurila
Building similarity graph...
Analyzing shared references across papers
Loading...