Saved in:
| Main Authors: | , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.26329 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909816748769280 |
|---|---|
| author | Lin, Yi-Cheng Chen, Yu-Hua Dong, Jia-Kai Huang, Yueh-Hsuan Chen, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Lin, Yu-Jung Chen, Yu-Ling Chen, Zih-Yu Tsai, I-Ning Wang, Hsiu-Hsuan Chung, Ho-Lam Lu, Ke-Han Lee, Hung-yi |
| author_facet | Lin, Yi-Cheng Chen, Yu-Hua Dong, Jia-Kai Huang, Yueh-Hsuan Chen, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Lin, Yu-Jung Chen, Yu-Ling Chen, Zih-Yu Tsai, I-Ning Wang, Hsiu-Hsuan Chung, Ho-Lam Lu, Ke-Han Lee, Hung-yi |
| contents | Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2509_26329 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics Lin, Yi-Cheng Chen, Yu-Hua Dong, Jia-Kai Huang, Yueh-Hsuan Chen, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Lin, Yu-Jung Chen, Yu-Ling Chen, Zih-Yu Tsai, I-Ning Wang, Hsiu-Hsuan Chung, Ho-Lam Lu, Ke-Han Lee, Hung-yi Audio and Speech Processing Computation and Language Machine Learning Sound Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream. |
| title | TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics |
| topic | Audio and Speech Processing Computation and Language Machine Learning Sound |
| url | https://arxiv.org/abs/2509.26329 |