Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hou, Haowen, Ma, Fei, Bai, Binwen, Zhu, Xinxin, Yu, Fei
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2408.15491
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909299182141440
author	Hou, Haowen Ma, Fei Bai, Binwen Zhu, Xinxin Yu, Fei
author_facet	Hou, Haowen Ma, Fei Bai, Binwen Zhu, Xinxin Yu, Fei
contents	Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_15491
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression Hou, Haowen Ma, Fei Bai, Binwen Zhu, Xinxin Yu, Fei Computation and Language Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.
title	Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
topic	Computation and Language
url	https://arxiv.org/abs/2408.15491

Similar Items