MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Kim, Geewook, Seo, Minjoon
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Computer Vision and Pattern Recognition Multimedia Sound
Accesso online:	https://arxiv.org/abs/2509.17901
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866914418292424704
author	Kim, Geewook Seo, Minjoon
author_facet	Kim, Geewook Seo, Minjoon
contents	Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_17901
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy Kim, Geewook Seo, Minjoon Computer Vision and Pattern Recognition Multimedia Sound Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
title	Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
topic	Computer Vision and Pattern Recognition Multimedia Sound
url	https://arxiv.org/abs/2509.17901

Documenti analoghi