How closely do the subjective perceptions simulated by Artificial Intelligence align with the subjective perceptions of human participants when evaluating an urban environment? This study serves as a pilot investigation to explore how far multimodal Large Language Models can effectively model human responses to visual stimuli based on subjective criteria. The exploratory nature of this research intends to test the feasibility of the methodology rather than provide a definitive standard. By focusing on a small set of detailed audits, a small-scale experiment performs an in-depth, qualitative examination of how machines and human assessments compare to each other in specific situations. To conduct the comparison, ratings of urban scenes were collected from human participants and two multimodal Large Language Models: ChatGPT and Gemini. After showing them an image of a sidewalk, these appraisers used a set of proposed statements to rate three sidewalks on a Likert scale. The investigation focuses on seven statements that subjectively characterize walkability factors, overall friendliness of an area, and the environment’s influence on well-being. Each participant rated each image once for all statements to establish a human baseline. The algorithms’ scores were generated using the exact same prompt, repeated multiple times to account for non-determinism. We then compared the AI’s scores to the humans’ distribution of scores and evaluated their alignment according to different experiential qualities across diverse visual environments.
Belaroussi et al. (Fri,) studied this question.