What question did this study set out to answer?

To investigate how spatial grounding influences the physics predictions of large language models.

March 18, 2026Open Access

Spatial Grounding and the Physics Prediction Gap in Large Language Models

Key Points

To investigate how spatial grounding influences the physics predictions of large language models.
Examined 102 scenarios across six language models
Compared predictions with and without spatial descriptions
Evaluated responses using two independent grounding methods
Analyzed categories based on model performance With spatial grounding
Developed routing architecture for query handling
Predictions changed 29 to 71 percent of the time depending on grounding
Consistent grounded answers diverged from ungrounded shortcut predictions
Three classes of prediction outcomes were identified, each with different implications for query routing
Model size did not affect grounding quality, highlighting edge hardware capabilities

Abstract

This paper started as a bug. During development of Potato, a local AI agent, Claude Code misunderstood an implementation instruction and proposed mirroring a GPU diffusion pipeline entirely in text: three language model steps standing in for three GPU steps. The intent had been different. The mistake was better. A text-only pipeline that simulates the same three steps as a visual pipeline is not a fallback. It is a baseline. You cannot measure what visual grounding contributes without something that lacks it. That recognition produced the experiment described here. Force a language model to commit to a detailed spatial description of a physical scene before it predicts what happens. If the model is doing real physics reasoning, the extra step should not change the answer. If it is pattern-matching keywords to likely outcomes, the forced description breaks that shortcut and the prediction changes. The size of that change is the diverging signal. Across 102 scenarios and six models, predictions change 29 to 71 percent of the time. The range spans autoregressive models from 8 billion to frontier scale, a reasoning model with explicit chain-of-thought, and Mercury 2, a diffusion LLM with a fundamentally different generation architecture. Two independent grounding methods, one built on language and one built on diffusion models rendering actual pixels, converge on similar answers while both diverge from the ungrounded shortcut. The grounded answers are the consistent ones. The direct answers are the outlier. The results sort into three classes: categories where models genuinely know the physics and grounding changes nothing; categories where the knowledge is present but does not surface without spatial commitment; and categories that are genuinely ambiguous and resist confident prediction under any method. Each class has a different practical implication for how a system should route queries. Two null results matter as much as the positive ones. Resolution does not affect grounding quality. Diffusion model size does not affect grounding quality. A 256px output from a compact model carries the same physics signal as a 768px output from a larger one. That means the pipeline runs on edge hardware. The practical output is a routing architecture. Not every physics question needs the full pipeline. The category divergence table tells you which ones do.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Brian Riggleman (Mon,) studied this question.

synapsesocial.com/papers/69ba44654e9516ffd37a60e4 https://doi.org/https://doi.org/10.5281/zenodo.19043970

Bookmark

View Full Paper