Saved in:
Bibliographic Details
Main Authors: Golany, Lotem, Galgani, Filippo, Mamo, Maya, Parasol, Nimrod, Vandsburger, Omer, Bar, Nadav, Dagan, Ido
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.01121
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Automating data generation with Large Language Models (LLMs) has become increasingly popular. In this work, we investigate the feasibility and effectiveness of LLM-based data generation in the challenging setting of source-grounded information-seeking dialogs, with response attribution, over long documents. Our source texts consist of long and noisy meeting transcripts, adding to the task complexity. Since automating attribution remains difficult, we propose a semi-automatic approach: dialog queries and responses are generated with LLMs, followed by human verification and identification of attribution spans. Using this approach, we created MISeD -- Meeting Information Seeking Dialogs dataset -- a dataset of information-seeking dialogs focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance compared to off-the-shelf models, even those of larger size. Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.