Furkejuvvon:
Bibliográfalaš dieđut
Váldodahkki: Leonardo Cofone
Materiálatiipa: Recurso digital
Giella:eaŋgalasgiella
Almmustuhtton: Zenodo 2026
Fáttát:
Liŋkkat:https://doi.org/10.5281/zenodo.19954051
Fáddágilkorat: Lasit fáddágilkoriid
Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!
Sisdoallologahallan:
  • <div class="flex flex-col text-sm pb-25"> <div class="text-base my-auto mx-auto pb-10 [--thread-content-margin:var(--thread-content-margin-xs,calc(var(--spacing)*4))] @w-sm/main:[--thread-content-margin:var(--thread-content-margin-sm,calc(var(--spacing)*6))] @w-lg/main:[--thread-content-margin:var(--thread-content-margin-lg,calc(var(--spacing)*16))] px-(--thread-content-margin)"> <div class="[--thread-content-max-width:40rem] @w-lg/main:[--thread-content-max-width:48rem] mx-auto max-w-(--thread-content-max-width) flex-1 group/turn-messages focus-visible:outline-hidden relative flex w-full min-w-0 flex-col agent-turn"> <div class="flex max-w-full flex-col gap-4 grow"> <div class="min-h-8 text-message relative flex w-full flex-col items-end gap-2 text-start break-words whitespace-normal outline-none keyboard-focused:focus-ring [.text-message+&]:mt-1"> <div class="flex w-full flex-col gap-1 empty:hidden"> <div class="markdown prose dark:prose-invert w-full wrap-break-word light markdown-new-styling"> <p>Dense self-attention assigns strictly positive weights to all tokens within the context window via the softmax operation, regardless of their semantic relevance. As a result, representations aggregate information from both relevant and irrelevant tokens, and this effect compounds across heads and layers in deep Transformer architectures. Building on the rank collapse analysis of Dong et al., we formalize how such accumulation contributes to progressive representational homogenization in dense attention models. We further hypothesize that this loss of representational distinctiveness may be related to degradation phenomena observed in long-context language modeling, including hallucination-like behavior and performance drops reported in prior work. While this connection remains conjectural, we provide a mechanistic interpretation grounded in information propagation through attention layers. To address these limitations, we propose DSALT (Dynamic Sparse Attention with Landmark Tokens), a sparse attention mechanism that combines local windowed attention with a small set of dynamically selected global landmark tokens. Landmark selection is performed using a hybrid energy-based scoring function that balances representational magnitude and output relevance. By restricting attention to structured subsets of tokens, DSALT reduces redundant interactions while preserving long-range dependencies. From a computational perspective, DSALT reduces the attention complexity from O(n²d) to O(n(w + k)d), enabling more efficient scaling to long sequences while maintaining expressive contextual modeling.</p> </div> </div> </div> </div> </div> </div> </div>