Saved in:
Bibliographic Details
Main Authors: Allauzen, Cyril, Bagby, Tom, Heigold, Georg, Variani, Ehsan, Wu, Ke
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.04556
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918485935783936
author Allauzen, Cyril
Bagby, Tom
Heigold, Georg
Variani, Ehsan
Wu, Ke
author_facet Allauzen, Cyril
Bagby, Tom
Heigold, Georg
Variani, Ehsan
Wu, Ke
contents The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.
format Preprint
id arxiv_https___arxiv_org_abs_2605_04556
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)
Allauzen, Cyril
Bagby, Tom
Heigold, Georg
Variani, Ehsan
Wu, Ke
Sound
Machine Learning
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.
title Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)
topic Sound
Machine Learning
url https://arxiv.org/abs/2605.04556