Salvato in:
Dettagli Bibliografici
Autori principali: Kim, Geewook, Seo, Minjoon
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2509.17901
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866914418292424704
author Kim, Geewook
Seo, Minjoon
author_facet Kim, Geewook
Seo, Minjoon
contents Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
format Preprint
id arxiv_https___arxiv_org_abs_2509_17901
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
Kim, Geewook
Seo, Minjoon
Computer Vision and Pattern Recognition
Multimedia
Sound
Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
title Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
topic Computer Vision and Pattern Recognition
Multimedia
Sound
url https://arxiv.org/abs/2509.17901