Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Zongjie, Qiu, Wenying, Ma, Pingchuan, Li, Yichen, Li, You, He, Sijia, Jiang, Baozheng, Wang, Shuai, Gu, Weixi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2402.01723
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911770021462016
author	Li, Zongjie Qiu, Wenying Ma, Pingchuan Li, Yichen Li, You He, Sijia Jiang, Baozheng Wang, Shuai Gu, Weixi
author_facet	Li, Zongjie Qiu, Wenying Ma, Pingchuan Li, Yichen Li, You He, Sijia Jiang, Baozheng Wang, Shuai Gu, Weixi
contents	Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_01723
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios Li, Zongjie Qiu, Wenying Ma, Pingchuan Li, Yichen Li, You He, Sijia Jiang, Baozheng Wang, Shuai Gu, Weixi Computation and Language Artificial Intelligence Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support.
title	An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2402.01723

Similar Items