Benchmark-driven evaluation helps distinguish between planning quality and interface reliability when large language models are utilized for embodied reasoning in simulation. Our submission to the Embodied Agent Interface Challenge (EAI) is evaluated across four stages of the pipeline. These being goal interpretation, subgoal decomposition, action sequencing, and transition modeling. The tasks run in the BEHAVIOR and VirtualHome simulators, which use constrained action vocabularies, fixed-object inventories and symbolic state representations within a standard evaluation protocol. Our system accesses the OpenAI API using GPT-4.1 for BEHAVIOR, GPT-4.1-mini for VirtualHome, and GPT-5-mini in later exploratory experiments across both environments. The schemas for each task determine how the outputs are structured, and outputs are regenerated when they do not follow the specification. On the final public leaderboard, our system ranked eighteenth overall with a score of 57.92, achieving 68.88 on BEHAVIOR and 46.96 on VirtualHome. In this paper, we describe our approach and discuss what these observations suggest about the strengths and limitations of current language models when used for embodied reasoning.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hilmi Demirhan
Wlodek Zadrozny
AI
University of North Carolina at Charlotte
University of North Carolina Wilmington
Building similarity graph...
Analyzing shared references across papers
Loading...
Demirhan et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69d49f1cb33cc4c35a227a28 — DOI: https://doi.org/10.3390/ai7040131