August 21, 2024Open Access

大型语言模型与人类价值观的强对齐与弱对齐

Key Points

Key points are not available for this paper at this time.

Abstract

在没有人类监督的情况下，最小化人工智能（AI）系统对人类社会的负面影响需要它们能够与人类价值观保持一致。然而，目前大多数工作仅从技术角度解决这一问题，例如改进依赖人类反馈的强化学习方法，忽视了对齐发生的意义和必要条件。在此，我们提出区分强对齐和弱对齐。强对齐需要认知能力（无论是类似人类还是不同于人类），例如理解并推理代理的意图及其因果产生期望效果的能力。我们认为，像大型语言模型（LLMs）这样的AI系统必须具备此能力，才能识别那些可能违反人类价值观的风险情境。为说明这一区别，我们展示了一系列提示，揭示ChatGPT、Gemini和Copilot未能识别部分此类情境的案例。此外，我们分析了词嵌入，显示LLMs中某些人类价值观的最近邻与人类的语义表示存在差异。随后，我们提出了一个新的思想实验，称为“带有词语转换字典的中文房间”，拓展了John Searle的著名提议。最后，我们提及了当前有望实现弱对齐的研究方向，这些方法在多种常见情境中可产生统计学上令人满意的回答，但迄今尚未确保任何真值。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mehdi Khamassi

Marceau Nahon

Raja Chatila

Journals

Scientific Reports

Actions

Institutions

Sorbonne Université

Institut Systèmes Intelligents et de Robotique

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

大型语言模型与人类价值观的强对齐与弱对齐

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider