Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ge, Danying, Gao, Jianhua, Yang, Yixue, Ji, Weixing
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence C.4; E.4; I.2
Online Access:	https://arxiv.org/abs/2510.20878
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915573471903744
author	Ge, Danying Gao, Jianhua Yang, Yixue Ji, Weixing
author_facet	Ge, Danying Gao, Jianhua Yang, Yixue Ji, Weixing
contents	Retrieval-Augmented Generation (RAG) improves model output accuracy by leveraging external knowledge bases, serving as an effective solution to address hallucination issues and knowledge-update delays in Large Language Models (LLMs). However, the introduction of external knowledge bases presents RAG with challenges in long-context processing, significantly increasing memory consumption and inference latency. Existing research accelerates inference by precomputing Key and Value (KV) of the knowledge base and loading them on-demand during inference. Based on the access frequency of different KV chunks within the external knowledge base, this paper proposes a hotness-aware RAG (HA-RAG) inference optimization system. First, leveraging the numerical distribution of KV chunks, we introduce a hotness-aware mixed-precision compressing and loading method to reduce disk I/O and memory access overhead. Second, we design a hotness-aware data placement strategy that prioritizes storing frequently accessed KV chunks in high-speed memory to improve data access efficiency. Experimental results demonstrate that, compared with TurboRAG, the proposed HA-RAG achieves an average speedup of 2.10x and maximum speedup of 10.49x in Time-To-First-Token (TTFT) with negligible accuracy loss.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_20878
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	HA-RAG: Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement Ge, Danying Gao, Jianhua Yang, Yixue Ji, Weixing Machine Learning Artificial Intelligence C.4; E.4; I.2 Retrieval-Augmented Generation (RAG) improves model output accuracy by leveraging external knowledge bases, serving as an effective solution to address hallucination issues and knowledge-update delays in Large Language Models (LLMs). However, the introduction of external knowledge bases presents RAG with challenges in long-context processing, significantly increasing memory consumption and inference latency. Existing research accelerates inference by precomputing Key and Value (KV) of the knowledge base and loading them on-demand during inference. Based on the access frequency of different KV chunks within the external knowledge base, this paper proposes a hotness-aware RAG (HA-RAG) inference optimization system. First, leveraging the numerical distribution of KV chunks, we introduce a hotness-aware mixed-precision compressing and loading method to reduce disk I/O and memory access overhead. Second, we design a hotness-aware data placement strategy that prioritizes storing frequently accessed KV chunks in high-speed memory to improve data access efficiency. Experimental results demonstrate that, compared with TurboRAG, the proposed HA-RAG achieves an average speedup of 2.10x and maximum speedup of 10.49x in Time-To-First-Token (TTFT) with negligible accuracy loss.
title	HA-RAG: Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement
topic	Machine Learning Artificial Intelligence C.4; E.4; I.2
url	https://arxiv.org/abs/2510.20878

Similar Items