Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Andrew H., Semnani, Sina J., Castillo-López, Galo, de Chalendar, Gäel, Choudhury, Monojit, Dua, Ashna, Kavitha, Kapil Rajesh, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Lombard, Alexis, Moradshahi, Mehrad, Park, Gihyun, Semmar, Nasredine, Seo, Jiwon, Shen, Tianhao, Shrivastava, Manish, Xiong, Deyi, Lam, Monica S.
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2405.17840
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916289569619968
author	Lee, Andrew H. Semnani, Sina J. Castillo-López, Galo de Chalendar, Gäel Choudhury, Monojit Dua, Ashna Kavitha, Kapil Rajesh Kim, Sungkyun Kodali, Prashant Kumaraguru, Ponnurangam Lombard, Alexis Moradshahi, Mehrad Park, Gihyun Semmar, Nasredine Seo, Jiwon Shen, Tianhao Shrivastava, Manish Xiong, Deyi Lam, Monica S.
author_facet	Lee, Andrew H. Semnani, Sina J. Castillo-López, Galo de Chalendar, Gäel Choudhury, Monojit Dua, Ashna Kavitha, Kapil Rajesh Kim, Sungkyun Kodali, Prashant Kumaraguru, Ponnurangam Lombard, Alexis Moradshahi, Mehrad Park, Gihyun Semmar, Nasredine Seo, Jiwon Shen, Tianhao Shrivastava, Manish Xiong, Deyi Lam, Monica S.
contents	Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_17840
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents Lee, Andrew H. Semnani, Sina J. Castillo-López, Galo de Chalendar, Gäel Choudhury, Monojit Dua, Ashna Kavitha, Kapil Rajesh Kim, Sungkyun Kodali, Prashant Kumaraguru, Ponnurangam Lombard, Alexis Moradshahi, Mehrad Park, Gihyun Semmar, Nasredine Seo, Jiwon Shen, Tianhao Shrivastava, Manish Xiong, Deyi Lam, Monica S. Computation and Language Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
title	Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
topic	Computation and Language
url	https://arxiv.org/abs/2405.17840

Similar Items