When language models produce self-referential text - discussing their own processes, expressing uncertainty about their nature - is this a capacity learned during pretraining, or a behavioral pattern installed by instruction tuning? We disentangle these possibilities by comparing 8 base models with their instruction-tuned counterparts across 4 families (Llama, Qwen, Mistral, Gemma), analyzing 6,400 generations with content coding, entropy profiling, and linear probing of internal activations. Instruction tuning amplifies self-referential behavior by 2.4x (Cohen's d = 0.80), but base models produce non-trivial self-reference even without post-training (over 90x above control baselines). Critically, these differences are representationally deep: linear probes decode base-vs-instruct status at 89-99% accuracy from mid-layer activations, and instruction tuning reduces first-token entropy by d = 3.38. Our results suggest that self-reference emerges during pretraining and instruction tuning amplifies and reshapes them rather than creates.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jędrzej Paweł Maczan
Building similarity graph...
Analyzing shared references across papers
Loading...
Jędrzej Paweł Maczan (Mon,) studied this question.
www.synapsesocial.com/papers/69e47440010ef96374d8ffef — DOI: https://doi.org/10.5281/zenodo.19629499