Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ruiz, Alfredo Garrachón, de la Rosa, Tomás, Borrajo, Daniel
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2412.07682
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914167639769088
author	Ruiz, Alfredo Garrachón de la Rosa, Tomás Borrajo, Daniel
author_facet	Ruiz, Alfredo Garrachón de la Rosa, Tomás Borrajo, Daniel
contents	The high inference cost of Large Language Models (LLMs) poses challenges, especially for tasks requiring lengthy outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language (i.e., concise outputs that retain essential meaning) when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which the LLM omits a predefined set of semantically irrelevant and easily inferable words based on the context during inference. Then, a specifically trained smaller language model with lower inference cost reconstructs the distilled answer into the ideal answer. Our experiments show promising results, particularly on the proposed NaLDA evaluation dataset focused on the reconstruction task, with 19.4% saved tokens on average for GPT-4o and only a tiny decrease in evaluation metrics. This suggests that the approach can effectively balance efficiency and accuracy in language processing tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_07682
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation Ruiz, Alfredo Garrachón de la Rosa, Tomás Borrajo, Daniel Computation and Language The high inference cost of Large Language Models (LLMs) poses challenges, especially for tasks requiring lengthy outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language (i.e., concise outputs that retain essential meaning) when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which the LLM omits a predefined set of semantically irrelevant and easily inferable words based on the context during inference. Then, a specifically trained smaller language model with lower inference cost reconstructs the distilled answer into the ideal answer. Our experiments show promising results, particularly on the proposed NaLDA evaluation dataset focused on the reconstruction task, with 19.4% saved tokens on average for GPT-4o and only a tiny decrease in evaluation metrics. This suggests that the approach can effectively balance efficiency and accuracy in language processing tasks.
title	TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
topic	Computation and Language
url	https://arxiv.org/abs/2412.07682

Similar Items