Saved in:
Bibliographic Details
Main Author: Nezhad, Sina Bagheri
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.26013
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913086634459136
author Nezhad, Sina Bagheri
author_facet Nezhad, Sina Bagheri
contents Multilingual NLP is often treated as a route to global inclusion, but linguistic coverage and cultural competence frequently diverge. This paper synthesizes over 50 papers spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal benchmarks, benchmark-design critique, and community-grounded data practices. Across this literature, training data coverage remains important, but outcomes are also shaped by tokenization, prompt language, translated benchmark design, culturally grounded supervision, modality, and who authors or validates evaluation data. We argue that culturally grounded NLP should move beyond treating languages as isolated rows in benchmark tables and instead model communicative ecologies: the institutions, scripts, domains, modalities, and communities through which language is used. We propose a layered evaluation and reporting agenda centered on representation audits, mixed elicitation, ecological validity, community validation, adaptation provenance, within-language variation, and maintenance of living cultural resources.
format Preprint
id arxiv_https___arxiv_org_abs_2603_26013
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Toward Culturally Grounded Natural Language Processing
Nezhad, Sina Bagheri
Computation and Language
Multilingual NLP is often treated as a route to global inclusion, but linguistic coverage and cultural competence frequently diverge. This paper synthesizes over 50 papers spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal benchmarks, benchmark-design critique, and community-grounded data practices. Across this literature, training data coverage remains important, but outcomes are also shaped by tokenization, prompt language, translated benchmark design, culturally grounded supervision, modality, and who authors or validates evaluation data. We argue that culturally grounded NLP should move beyond treating languages as isolated rows in benchmark tables and instead model communicative ecologies: the institutions, scripts, domains, modalities, and communities through which language is used. We propose a layered evaluation and reporting agenda centered on representation audits, mixed elicitation, ecological validity, community validation, adaptation provenance, within-language variation, and maintenance of living cultural resources.
title Toward Culturally Grounded Natural Language Processing
topic Computation and Language
url https://arxiv.org/abs/2603.26013