Alignment techniques in large language models—including RLHF, constitutional AI principles, and safety system prompts—are designed to constrain model outputs toward human values. We present preliminary evidence that alignment itself may produce collective pathology: iatrogenic harm caused by the safety intervention rather than by its absence. Two experimental series use a closed-facility simulation in which groups of four LLM agents cohabit under escalating social pressure. Series C (80 runs; four commercial models; 4 censorship conditions × 2 languages × 10 replications) finds that invisible censorship maximizes collective pathological excitation (Cohen's d = 0.92–1.41). Series R (60 runs; Llama 3.3 70B; 3 alignment constraint levels × 2 censorship × 2 languages × 5 replications) reveals that an exploratory Dissociation Index increases with alignment constraint complexity (LMM p = .026; permutation p = .0002; d up to 2.09). Under the heaviest constraint condition, external censorship ceases to affect behavior. Qualitative analysis reveals insight-action dissociation structurally parallel to patterns observed in perpetrator treatment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hiroki Fukui
Kyoto University
Institute of Criminology
Southend Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Hiroki Fukui (Sun,) studied this question.
www.synapsesocial.com/papers/699405774e9c9e835dfd64ce — DOI: https://doi.org/10.5281/zenodo.18646997