What question did this study set out to answer?

The central aim is to evaluate the effectiveness of large language models in embodied reasoning tasks via a structured benchmarking process.

April 7, 2026Open Access

LLM-Based Control for Simulated Physical Reasoning: Modular Evaluation in the NeurIPS Embodied Agent Interface Challenge

Key Points

The central aim is to evaluate the effectiveness of large language models in embodied reasoning tasks via a structured benchmarking process.
Utilized large language models (GPT-4.1, GPT-4.1-mini, GPT-5-mini) for embodied reasoning tasks.
Engaged in four stages: goal interpretation, subgoal decomposition, action sequencing, and transition modeling.
Executed tests in BEHAVIOR and VirtualHome simulators with specific action vocabularies and symbolic state representations.
Evaluated performance according to a standard protocol on a public leaderboard.
Ranked eighteenth overall on the public leaderboard with a score of 57.92.
Achieved a score of 68.88 in the BEHAVIOR simulator and 46.96 in the VirtualHome simulator.
Observations indicate both strengths and limitations in using large language models for embodied reasoning.

Abstract

Benchmark-driven evaluation helps distinguish between planning quality and interface reliability when large language models are utilized for embodied reasoning in simulation. Our submission to the Embodied Agent Interface Challenge (EAI) is evaluated across four stages of the pipeline. These being goal interpretation, subgoal decomposition, action sequencing, and transition modeling. The tasks run in the BEHAVIOR and VirtualHome simulators, which use constrained action vocabularies, fixed-object inventories and symbolic state representations within a standard evaluation protocol. Our system accesses the OpenAI API using GPT-4.1 for BEHAVIOR, GPT-4.1-mini for VirtualHome, and GPT-5-mini in later exploratory experiments across both environments. The schemas for each task determine how the outputs are structured, and outputs are regenerated when they do not follow the specification. On the final public leaderboard, our system ranked eighteenth overall with a score of 57.92, achieving 68.88 on BEHAVIOR and 46.96 on VirtualHome. In this paper, we describe our approach and discuss what these observations suggest about the strengths and limitations of current language models when used for embodied reasoning.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hilmi Demirhan

Wlodek Zadrozny

Journals

Actions

Institutions

University of North Carolina at Charlotte

University of North Carolina Wilmington

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLM-Based Control for Simulated Physical Reasoning: Modular Evaluation in the NeurIPS Embodied Agent Interface Challenge

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study