Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective
by
Zhou, Yunhua
, Guo, Qipeng
, Qiu, Xipeng
, Li, Ruixiao
, Chen, Mingshu
, Peng, Runyu
in
Large language models
2026
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective
by
Zhou, Yunhua
, Guo, Qipeng
, Qiu, Xipeng
, Li, Ruixiao
, Chen, Mingshu
, Peng, Runyu
in
Large language models
2026
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective
Paper
How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective
2026
Request Book From Autostore
and Choose the Collection Method
Overview
Large Language Models (LLMs) often allocate disproportionate attention to specific tokens, a phenomenon commonly referred to as the attention sink. While such sinks are generally considered detrimental, prior studies have identified a notable exception: the model's consistent emphasis on the first token of the input sequence. This structural bias can influence a wide range of downstream applications and warrants careful consideration. Despite its prevalence, the precise mechanisms underlying the emergence and persistence of attention sinks remain poorly understood. In this work, we trace the formation of attention sinks around the first token of the input. We identify a simple mechanism, referred to as the P0 Sink Circuit, that enables the model to recognize token at position zero and induce an attention sink within two transformer blocks, without relying on any semantic information. This mechanism serves as the basis for the attention sink on position zero. Furthermore, by analyzing training traces from a 30B A3B MoE model trained from scratch, we find that this mechanism emerges early in training and becomes increasingly concentrated in the first two layers, suggesting a possible signal for tracking pre training convergence states.
Publisher
Cornell University Library, arXiv.org
Subject
This website uses cookies to ensure you get the best experience on our website.