Los puntos clave no están disponibles para este artículo en este momento.
Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shi et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e67e28b6db643587608195 — DOI: https://doi.org/10.48550/arxiv.2405.19592
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Zhenmei Shi
Junyi Wei
Zhuoyan Xu
Building similarity graph...
Analyzing shared references across papers
Loading...