September 5, 2022Open Access

通过手工制作的对抗样本评估预训练语言模型的易受攻击性

Key Points

Key points are not available for this paper at this time.

Abstract

近年来大型语言模型的发展取得了显著进展，使公众能够访问包括Generative Pre-trained Transformer 3 (GPT-3)和Bidirectional Encoder Representations from Transformers (BERT)在内的最先进预训练语言模型（PLMs）。然而，实践中对PLMs的评估表明，在训练和微调阶段，这些模型易受对抗攻击的影响。此类攻击可能导致错误输出、模型生成仇恨言论以及用户敏感信息泄露。尽管现有研究集中于PLMs训练或微调期间的对抗攻击，但针对这两个开发阶段之间的攻击信息仍然匮乏。本文强调了GPT-3公开发布中存在的重大安全漏洞，并在其他最先进的PLMs中进一步研究了这一漏洞。我们的研究仅限于未经过微调的预训练模型。此外，我们强调了最小化令牌距离的扰动作为一种有效的对抗方法，能够绕过有监督和无监督的质量测量。采用该方法，我们在评估语义相似度时观察到文本分类质量显著下降。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hezekiah J. Branch

Jonathan Rodriguez Cefalù

Jeremy McHugh

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

通过手工制作的对抗样本评估预训练语言模型的易受攻击性

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study