Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Hutiri, Wiebke, Cimpoi, Mircea, Scheuerman, Morgan, Matthews, Victoria, Xiang, Alice
Formato:	Preprint
Publicado:	2025
Materias:	Computers and Society Artificial Intelligence Audio and Speech Processing
Acceso en línea:	https://arxiv.org/abs/2505.17841
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910964402618368
author	Hutiri, Wiebke Cimpoi, Mircea Scheuerman, Morgan Matthews, Victoria Xiang, Alice
author_facet	Hutiri, Wiebke Cimpoi, Mircea Scheuerman, Morgan Matthews, Victoria Xiang, Alice
contents	Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_17841
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation Hutiri, Wiebke Cimpoi, Mircea Scheuerman, Morgan Matthews, Victoria Xiang, Alice Computers and Society Artificial Intelligence Audio and Speech Processing Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.
title	TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
topic	Computers and Society Artificial Intelligence Audio and Speech Processing
url	https://arxiv.org/abs/2505.17841

Ejemplares similares