Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.05974 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915845207228416 |
|---|---|
| author | Lu, Hanzhen Fan, Lishui Chen, Jiachi Chen, Qiuyuan Wei, Zhao Liu, Zhongxin |
| author_facet | Lu, Hanzhen Fan, Lishui Chen, Jiachi Chen, Qiuyuan Wei, Zhao Liu, Zhongxin |
| contents | Line-level code completion requires a critical balance between high accuracy and low latency. Existing methods suffer from a trade-off: large language models (LLMs) provide high-quality suggestions but incur high latency, while small language models (SLMs) are fast but often suboptimal. We propose MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM. To achieve effective cascading, MCCom leverages user actions as a novel signal to trigger the LLM only when the SLM fails, significantly reducing cloud computation costs. Furthermore, we introduce a two-stage speculative decoding strategy and an iterative retrieval mechanism to enhance collaboration between the models. We also train a 121M-parameter lightweight model, which achieves 73.8% of the performance of a 7B state-of-the-art model. Evaluated on RepoEval and a new real-world benchmark StmtEval, MCCom reduces inference latency by up to 47.9% and LLM usage by 46.3%, while improving the LLM's exact match rate by 8.9% through effective collaboration. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_05974 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading Lu, Hanzhen Fan, Lishui Chen, Jiachi Chen, Qiuyuan Wei, Zhao Liu, Zhongxin Software Engineering Line-level code completion requires a critical balance between high accuracy and low latency. Existing methods suffer from a trade-off: large language models (LLMs) provide high-quality suggestions but incur high latency, while small language models (SLMs) are fast but often suboptimal. We propose MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM. To achieve effective cascading, MCCom leverages user actions as a novel signal to trigger the LLM only when the SLM fails, significantly reducing cloud computation costs. Furthermore, we introduce a two-stage speculative decoding strategy and an iterative retrieval mechanism to enhance collaboration between the models. We also train a 121M-parameter lightweight model, which achieves 73.8% of the performance of a 7B state-of-the-art model. Evaluated on RepoEval and a new real-world benchmark StmtEval, MCCom reduces inference latency by up to 47.9% and LLM usage by 46.3%, while improving the LLM's exact match rate by 8.9% through effective collaboration. |
| title | Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading |
| topic | Software Engineering |
| url | https://arxiv.org/abs/2603.05974 |