Enregistré dans:
Détails bibliographiques
Auteurs principaux: Sun, Bochao, Wang, Dong, Yang, ZhanLong, Yang, Jun, Yin, Han
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2508.15632
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866909752151244800
author Sun, Bochao
Wang, Dong
Yang, ZhanLong
Yang, Jun
Yin, Han
author_facet Sun, Bochao
Wang, Dong
Yang, ZhanLong
Yang, Jun
Yin, Han
contents Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, ASCMamba, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path Mamba blocks that capture long-range temporal and frequency dependencies using Mamba-based state space models. In addition, we present a two-step pseudo-labeling mechanism to generate more reliable pseudo-labels. Results show that the proposed system outperforms all the participating teams and achieves a 6.2% improvement over the baseline. Code, model and pre-trained checkpoints are available at https://github.com/S-Orion/ASCMamba.git.
format Preprint
id arxiv_https___arxiv_org_abs_2508_15632
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification
Sun, Bochao
Wang, Dong
Yang, ZhanLong
Yang, Jun
Yin, Han
Sound
Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, ASCMamba, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path Mamba blocks that capture long-range temporal and frequency dependencies using Mamba-based state space models. In addition, we present a two-step pseudo-labeling mechanism to generate more reliable pseudo-labels. Results show that the proposed system outperforms all the participating teams and achieves a 6.2% improvement over the baseline. Code, model and pre-trained checkpoints are available at https://github.com/S-Orion/ASCMamba.git.
title ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification
topic Sound
url https://arxiv.org/abs/2508.15632