Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Das, Debarati, De Langis, Karin, Martin-Boyle, Anna, Kim, Jaehyung, Lee, Minhwa, Kim, Zae Myung, Hayati, Shirley Anugrah, Owan, Risako, Hu, Bin, Parkar, Ritik, Koo, Ryan, Park, Jonginn, Tyagi, Aahan, Ferland, Libby, Roy, Sanjali, Liu, Vincent, Kang, Dongyeop
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2401.14698
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929227959369728
author	Das, Debarati De Langis, Karin Martin-Boyle, Anna Kim, Jaehyung Lee, Minhwa Kim, Zae Myung Hayati, Shirley Anugrah Owan, Risako Hu, Bin Parkar, Ritik Koo, Ryan Park, Jonginn Tyagi, Aahan Ferland, Libby Roy, Sanjali Liu, Vincent Kang, Dongyeop
author_facet	Das, Debarati De Langis, Karin Martin-Boyle, Anna Kim, Jaehyung Lee, Minhwa Kim, Zae Myung Hayati, Shirley Anugrah Owan, Risako Hu, Bin Parkar, Ritik Koo, Ryan Park, Jonginn Tyagi, Aahan Ferland, Libby Roy, Sanjali Liu, Vincent Kang, Dongyeop
contents	This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_14698
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Under the Surface: Tracking the Artifactuality of LLM-Generated Data Das, Debarati De Langis, Karin Martin-Boyle, Anna Kim, Jaehyung Lee, Minhwa Kim, Zae Myung Hayati, Shirley Anugrah Owan, Risako Hu, Bin Parkar, Ritik Koo, Ryan Park, Jonginn Tyagi, Aahan Ferland, Libby Roy, Sanjali Liu, Vincent Kang, Dongyeop Computation and Language Artificial Intelligence This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
title	Under the Surface: Tracking the Artifactuality of LLM-Generated Data
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2401.14698

Similar Items