Saved in:
Bibliographic Details
Main Authors: Wang, Jingtao, Wang, Yucong, Ding, Jun, Cai, Rui, Wang, Xun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.11067
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915854994636800
author Wang, Jingtao
Wang, Yucong
Ding, Jun
Cai, Rui
Wang, Xun
author_facet Wang, Jingtao
Wang, Yucong
Ding, Jun
Cai, Rui
Wang, Xun
contents Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
format Preprint
id arxiv_https___arxiv_org_abs_2603_11067
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation
Wang, Jingtao
Wang, Yucong
Ding, Jun
Cai, Rui
Wang, Xun
Computation and Language
Artificial Intelligence
Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
title Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2603.11067