Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Qiao, Haoyu, Zhang, Hao, Mao, Shanwen, Cheng, Siyao, Liu, Jie
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.21237
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912977717821440
author	Qiao, Haoyu Zhang, Hao Mao, Shanwen Cheng, Siyao Liu, Jie
author_facet	Qiao, Haoyu Zhang, Hao Mao, Shanwen Cheng, Siyao Liu, Jie
contents	Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_21237
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models Qiao, Haoyu Zhang, Hao Mao, Shanwen Cheng, Siyao Liu, Jie Artificial Intelligence Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
title	ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.21237

Similar Items