Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Bochao, Xiao, Yang, Yin, Han
Format:	Preprint
Published:	2026
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2601.06829
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917194908041216
author	Sun, Bochao Xiao, Yang Yin, Han
author_facet	Sun, Bochao Xiao, Yang Yin, Han
contents	Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_06829
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation Sun, Bochao Xiao, Yang Yin, Han Sound Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.
title	MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
topic	Sound
url	https://arxiv.org/abs/2601.06829

Similar Items