This paper started as a bug. During development of Potato, a local AI agent, Claude Code misunderstood an implementation instruction and proposed mirroring a GPU diffusion pipeline entirely in text: three language model steps standing in for three GPU steps. The intent had been different. The mistake was better. A text-only pipeline that simulates the same three steps as a visual pipeline is not a fallback. It is a baseline. You cannot measure what visual grounding contributes without something that lacks it. That recognition produced the experiment described here. Force a language model to commit to a detailed spatial description of a physical scene before it predicts what happens. If the model is doing real physics reasoning, the extra step should not change the answer. If it is pattern-matching keywords to likely outcomes, the forced description breaks that shortcut and the prediction changes. The size of that change is the diverging signal. Across 102 scenarios and six models, predictions change 29 to 71 percent of the time. The range spans autoregressive models from 8 billion to frontier scale, a reasoning model with explicit chain-of-thought, and Mercury 2, a diffusion LLM with a fundamentally different generation architecture. Two independent grounding methods, one built on language and one built on diffusion models rendering actual pixels, converge on similar answers while both diverge from the ungrounded shortcut. The grounded answers are the consistent ones. The direct answers are the outlier. The results sort into three classes: categories where models genuinely know the physics and grounding changes nothing; categories where the knowledge is present but does not surface without spatial commitment; and categories that are genuinely ambiguous and resist confident prediction under any method. Each class has a different practical implication for how a system should route queries. Two null results matter as much as the positive ones. Resolution does not affect grounding quality. Diffusion model size does not affect grounding quality. A 256px output from a compact model carries the same physics signal as a 768px output from a larger one. That means the pipeline runs on edge hardware. The practical output is a routing architecture. Not every physics question needs the full pipeline. The category divergence table tells you which ones do.
Brian Riggleman (Mon,) studied this question.