Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Nezhad, Sina Bagheri
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.26013
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913086634459136
author	Nezhad, Sina Bagheri
author_facet	Nezhad, Sina Bagheri
contents	Multilingual NLP is often treated as a route to global inclusion, but linguistic coverage and cultural competence frequently diverge. This paper synthesizes over 50 papers spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal benchmarks, benchmark-design critique, and community-grounded data practices. Across this literature, training data coverage remains important, but outcomes are also shaped by tokenization, prompt language, translated benchmark design, culturally grounded supervision, modality, and who authors or validates evaluation data. We argue that culturally grounded NLP should move beyond treating languages as isolated rows in benchmark tables and instead model communicative ecologies: the institutions, scripts, domains, modalities, and communities through which language is used. We propose a layered evaluation and reporting agenda centered on representation audits, mixed elicitation, ecological validity, community validation, adaptation provenance, within-language variation, and maintenance of living cultural resources.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_26013
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Toward Culturally Grounded Natural Language Processing Nezhad, Sina Bagheri Computation and Language Multilingual NLP is often treated as a route to global inclusion, but linguistic coverage and cultural competence frequently diverge. This paper synthesizes over 50 papers spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal benchmarks, benchmark-design critique, and community-grounded data practices. Across this literature, training data coverage remains important, but outcomes are also shaped by tokenization, prompt language, translated benchmark design, culturally grounded supervision, modality, and who authors or validates evaluation data. We argue that culturally grounded NLP should move beyond treating languages as isolated rows in benchmark tables and instead model communicative ecologies: the institutions, scripts, domains, modalities, and communities through which language is used. We propose a layered evaluation and reporting agenda centered on representation audits, mixed elicitation, ecological validity, community validation, adaptation provenance, within-language variation, and maintenance of living cultural resources.
title	Toward Culturally Grounded Natural Language Processing
topic	Computation and Language
url	https://arxiv.org/abs/2603.26013

Similar Items