Rule-based evaluators applied to large language model (LLM) outputs systematically misclassify outputs that satisfy the communicative intent of a task while failing the surface formrequired by the rule. We identify this as the rule–semantic labeling gap and describe a lightweightpattern—Shadow Semantic Review—that addresses it through three steps: (1) a secondLLM pass that reviews rule-assigned labels for semantic correctness, (2) classification of disagreements into a named failure motif taxonomy, and (3) forward injection of those motif namesas generation constraints in the next prompt cycle. We instantiate this pattern in a deployedfinancial signal evaluation system where a local 7-billion-parameter LLM generates structuredmarket analyses that are evaluated by a deterministic rule-based harness and audited by ashadow LLM reviewer. Early observations from 14 shadow-reviewed rows reveal that a singlemotif class—thematic mismatch—accounts for 78.6% of rule–semantic disagreements, indicatingthe gap is systematic rather than random. We release the pattern and motif taxonomy as priorart; a 90-day longitudinal ablation study comparing pre- and post-injection recurrence rates isactive, with a full results paper to follow.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sami Allan Kaurila
Building similarity graph...
Analyzing shared references across papers
Loading...
Sami Allan Kaurila (Wed,) studied this question.
www.synapsesocial.com/papers/69d895ea6c1944d70ce07139 — DOI: https://doi.org/10.5281/zenodo.19474649