Large Language Models (LLMs) are increasingly used for social simulation, where populations of agents are expected to reproduce human-like collective behavior. However, we find that many recent studies adopt experimental designs that systematically undermine the validity of their claims. From a survey of over 40 papers, we identify six recurring methodological flaws: agents are often homogeneous (Profile), interactions are absent or artificially imposed (Interaction), memory is discarded (Memory), prompts tightly control outcomes (Minimal-Control), agents can infer the experimental hypothesis (Unawareness), and validation relies on simplified theoretical models rather than real-world data (Realism). For instance, GPT-4o and Qwen-3 correctly infer the underlying social experiment in 53.1% of cases when given instructions from prior work-violating the Unawareness principle. We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation. To demonstrate their impact, we re-run five representative studies using a framework that enforces PIMMUR and find that the reported social phenomena frequently fail to emerge under more rigorous conditions. Our work establishes methodological standards for LLM-based multi-agent research and provides a foundation for more reliable and reproducible claims about "AI societies."
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiaxu Zhou
Jen-tse Huang
Chao Zhou
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhou et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e02f2cf0e39f13e7fa1e7b — DOI: https://doi.org/10.48550/arxiv.2509.18052
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: