This study investigates the capability of a non-reasoning large language model (GPT-4o) to generate private speech and evaluates its similarity to human private speech. We placed the model in a simulated solitary block-construction scenario via textual prompts, eliciting and classifying its self-directed utterances using an established semantic framework for categorizing private speech in children. The distribution of these categories was compared to two human benchmarks: a classic block-construction study and a more recent experiment employing a similar task setting. Analysis using scatter plots and Pearson correlation coefficients revealed a striking pattern: GPT-4o's semantic profile showed negligible similarity to the classic benchmark (r = 0.01) but very strong similarity to the recent benchmark (r = 0.93). This discrepancy is interpreted as stemming from differences in task nature, namely goal-directed, scaffolded task versus self-determined, unscaffolded play, which exert a stronger influence on speech content than experimental subject difference between GPT-4o and children. In an exploratory serial recall study, we tasked GPT-3.5-Turbo-instruct and observed incidental private speech, indicating that the phenomenon extends across contexts. This provides an avenue for investigating LLM replication of private speech and, potentially, computational consciousness.
Liang et al. (Thu,) studied this question.