Saved in:
Bibliographic Details
Main Authors: Sun, Bochao, Xiao, Yang, Yin, Han
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.06829
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917194908041216
author Sun, Bochao
Xiao, Yang
Yin, Han
author_facet Sun, Bochao
Xiao, Yang
Yin, Han
contents Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.
format Preprint
id arxiv_https___arxiv_org_abs_2601_06829
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
Sun, Bochao
Xiao, Yang
Yin, Han
Sound
Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.
title MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
topic Sound
url https://arxiv.org/abs/2601.06829