Saved in:
Bibliographic Details
Main Authors: Das, Debarati, De Langis, Karin, Martin-Boyle, Anna, Kim, Jaehyung, Lee, Minhwa, Kim, Zae Myung, Hayati, Shirley Anugrah, Owan, Risako, Hu, Bin, Parkar, Ritik, Koo, Ryan, Park, Jonginn, Tyagi, Aahan, Ferland, Libby, Roy, Sanjali, Liu, Vincent, Kang, Dongyeop
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2401.14698
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929227959369728
author Das, Debarati
De Langis, Karin
Martin-Boyle, Anna
Kim, Jaehyung
Lee, Minhwa
Kim, Zae Myung
Hayati, Shirley Anugrah
Owan, Risako
Hu, Bin
Parkar, Ritik
Koo, Ryan
Park, Jonginn
Tyagi, Aahan
Ferland, Libby
Roy, Sanjali
Liu, Vincent
Kang, Dongyeop
author_facet Das, Debarati
De Langis, Karin
Martin-Boyle, Anna
Kim, Jaehyung
Lee, Minhwa
Kim, Zae Myung
Hayati, Shirley Anugrah
Owan, Risako
Hu, Bin
Parkar, Ritik
Koo, Ryan
Park, Jonginn
Tyagi, Aahan
Ferland, Libby
Roy, Sanjali
Liu, Vincent
Kang, Dongyeop
contents This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
format Preprint
id arxiv_https___arxiv_org_abs_2401_14698
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Under the Surface: Tracking the Artifactuality of LLM-Generated Data
Das, Debarati
De Langis, Karin
Martin-Boyle, Anna
Kim, Jaehyung
Lee, Minhwa
Kim, Zae Myung
Hayati, Shirley Anugrah
Owan, Risako
Hu, Bin
Parkar, Ritik
Koo, Ryan
Park, Jonginn
Tyagi, Aahan
Ferland, Libby
Roy, Sanjali
Liu, Vincent
Kang, Dongyeop
Computation and Language
Artificial Intelligence
This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
title Under the Surface: Tracking the Artifactuality of LLM-Generated Data
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2401.14698