Saved in:
Bibliographic Details
Main Authors: Lee, Andrew H., Semnani, Sina J., Castillo-López, Galo, de Chalendar, Gäel, Choudhury, Monojit, Dua, Ashna, Kavitha, Kapil Rajesh, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Lombard, Alexis, Moradshahi, Mehrad, Park, Gihyun, Semmar, Nasredine, Seo, Jiwon, Shen, Tianhao, Shrivastava, Manish, Xiong, Deyi, Lam, Monica S.
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.17840
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916289569619968
author Lee, Andrew H.
Semnani, Sina J.
Castillo-López, Galo
de Chalendar, Gäel
Choudhury, Monojit
Dua, Ashna
Kavitha, Kapil Rajesh
Kim, Sungkyun
Kodali, Prashant
Kumaraguru, Ponnurangam
Lombard, Alexis
Moradshahi, Mehrad
Park, Gihyun
Semmar, Nasredine
Seo, Jiwon
Shen, Tianhao
Shrivastava, Manish
Xiong, Deyi
Lam, Monica S.
author_facet Lee, Andrew H.
Semnani, Sina J.
Castillo-López, Galo
de Chalendar, Gäel
Choudhury, Monojit
Dua, Ashna
Kavitha, Kapil Rajesh
Kim, Sungkyun
Kodali, Prashant
Kumaraguru, Ponnurangam
Lombard, Alexis
Moradshahi, Mehrad
Park, Gihyun
Semmar, Nasredine
Seo, Jiwon
Shen, Tianhao
Shrivastava, Manish
Xiong, Deyi
Lam, Monica S.
contents Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
format Preprint
id arxiv_https___arxiv_org_abs_2405_17840
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
Lee, Andrew H.
Semnani, Sina J.
Castillo-López, Galo
de Chalendar, Gäel
Choudhury, Monojit
Dua, Ashna
Kavitha, Kapil Rajesh
Kim, Sungkyun
Kodali, Prashant
Kumaraguru, Ponnurangam
Lombard, Alexis
Moradshahi, Mehrad
Park, Gihyun
Semmar, Nasredine
Seo, Jiwon
Shen, Tianhao
Shrivastava, Manish
Xiong, Deyi
Lam, Monica S.
Computation and Language
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
title Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
topic Computation and Language
url https://arxiv.org/abs/2405.17840