Saved in:
| Main Authors: | Seki, Kentaro, Okamoto, Yuki, Yamaoka, Kouei, Saito, Yuki, Takamichi, Shinnosuke, Saruwatari, Hiroshi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.14785 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio
by: Kanamori, Yusuke, et al.
Published: (2025)
by: Kanamori, Yusuke, et al.
Published: (2025)
Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals
by: Seki, Kentaro, et al.
Published: (2024)
by: Seki, Kentaro, et al.
Published: (2024)
Human-CLAP: Human-perception-based contrastive language-audio pretraining
by: Takano, Taisei, et al.
Published: (2025)
by: Takano, Taisei, et al.
Published: (2025)
Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment
by: Igarashi, Takuto, et al.
Published: (2024)
by: Igarashi, Takuto, et al.
Published: (2024)
TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data
by: Seki, Kentaro, et al.
Published: (2025)
by: Seki, Kentaro, et al.
Published: (2025)
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
by: Nakata, Wataru, et al.
Published: (2024)
by: Nakata, Wataru, et al.
Published: (2024)
SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark
by: Saito, Yuki, et al.
Published: (2024)
by: Saito, Yuki, et al.
Published: (2024)
Active Learning for Text-to-Speech Synthesis with Informative Sample Collection
by: Seki, Kentaro, et al.
Published: (2025)
by: Seki, Kentaro, et al.
Published: (2025)
Building speech corpus with diverse voice characteristics for its prompt-based representation
by: Watanabe, Aya, et al.
Published: (2024)
by: Watanabe, Aya, et al.
Published: (2024)
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions
by: Xin, Detai, et al.
Published: (2023)
by: Xin, Detai, et al.
Published: (2023)
BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec
by: Xin, Detai, et al.
Published: (2024)
by: Xin, Detai, et al.
Published: (2024)
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
by: Saeki, Takaaki, et al.
Published: (2024)
by: Saeki, Takaaki, et al.
Published: (2024)
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
by: Yamauchi, Kazuki, et al.
Published: (2024)
by: Yamauchi, Kazuki, et al.
Published: (2024)
Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing
by: Nakata, Wataru, et al.
Published: (2025)
by: Nakata, Wataru, et al.
Published: (2025)
DistilMOS: Layer-Wise Self-Distillation For Self-Supervised Learning Model-Based MOS Prediction
by: Yang, Jianing, et al.
Published: (2026)
by: Yang, Jianing, et al.
Published: (2026)
AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences
by: Kishi, Minoru, et al.
Published: (2025)
by: Kishi, Minoru, et al.
Published: (2025)
Drum-to-Vocal Percussion Sound Conversion and Its Evaluation Methodology
by: Nobukawa, Rinka, et al.
Published: (2025)
by: Nobukawa, Rinka, et al.
Published: (2025)
DNN-based ensemble singing voice synthesis with interactions between singers
by: Hyodo, Hiroaki, et al.
Published: (2024)
by: Hyodo, Hiroaki, et al.
Published: (2024)
Causal Speech Enhancement with Predicting Semantics based on Quantized Self-supervised Learning Features
by: Tsunoo, Emiru, et al.
Published: (2024)
by: Tsunoo, Emiru, et al.
Published: (2024)
JaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus
by: Nakamura, Tomohiko, et al.
Published: (2022)
by: Nakamura, Tomohiko, et al.
Published: (2022)
Geneses: Unified Generative Speech Enhancement and Separation
by: Asai, Kohei, et al.
Published: (2026)
by: Asai, Kohei, et al.
Published: (2026)
Fast Multichannel NMF with Block-Diagonal Spatial Covariance Matrices for Efficient Blind Source Separation Using Distributed Microphone Arrays
by: Nishikori, Hirotaka, et al.
Published: (2026)
by: Nishikori, Hirotaka, et al.
Published: (2026)
The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
by: Baba, Kaito, et al.
Published: (2024)
by: Baba, Kaito, et al.
Published: (2024)
Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer
by: Nishikawa, Go, et al.
Published: (2025)
by: Nishikawa, Go, et al.
Published: (2025)
DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio
by: Nakata, Wataru, et al.
Published: (2026)
by: Nakata, Wataru, et al.
Published: (2026)
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis
by: Yang, Dong, et al.
Published: (2025)
by: Yang, Dong, et al.
Published: (2025)
Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
by: Yang, Jianing, et al.
Published: (2025)
by: Yang, Jianing, et al.
Published: (2025)
SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis
by: Take, Osamu, et al.
Published: (2024)
by: Take, Osamu, et al.
Published: (2024)
Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains
by: Wada, Aogu, et al.
Published: (2025)
by: Wada, Aogu, et al.
Published: (2025)
Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation
by: Ishikawa, Yuto, et al.
Published: (2024)
by: Ishikawa, Yuto, et al.
Published: (2024)
Construction and Analysis of Impression Caption Dataset for Environmental Sounds
by: Okamoto, Yuki, et al.
Published: (2024)
by: Okamoto, Yuki, et al.
Published: (2024)
Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora
by: Suda, Hitoshi, et al.
Published: (2025)
by: Suda, Hitoshi, et al.
Published: (2025)
Who Finds This Voice Attractive? A Large-Scale Experiment Using In-the-Wild Data
by: Suda, Hitoshi, et al.
Published: (2024)
by: Suda, Hitoshi, et al.
Published: (2024)
Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch
by: Imamura, Kanami, et al.
Published: (2026)
by: Imamura, Kanami, et al.
Published: (2026)
Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
by: Manabe, Toranosuke, et al.
Published: (2026)
by: Manabe, Toranosuke, et al.
Published: (2026)
Analysing the Language of Neural Audio Codecs
by: Park, Joonyong, et al.
Published: (2025)
by: Park, Joonyong, et al.
Published: (2025)
Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models
by: Kando, Shunsuke, et al.
Published: (2025)
by: Kando, Shunsuke, et al.
Published: (2025)
Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification
by: Wu, Bin, et al.
Published: (2024)
by: Wu, Bin, et al.
Published: (2024)
ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks
by: Jing, Xin, et al.
Published: (2024)
by: Jing, Xin, et al.
Published: (2024)
Learning Spatially-Aware Language and Audio Embeddings
by: Devnani, Bhavika, et al.
Published: (2024)
by: Devnani, Bhavika, et al.
Published: (2024)
Similar Items
-
RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio
by: Kanamori, Yusuke, et al.
Published: (2025) -
Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals
by: Seki, Kentaro, et al.
Published: (2024) -
Human-CLAP: Human-perception-based contrastive language-audio pretraining
by: Takano, Taisei, et al.
Published: (2025) -
Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment
by: Igarashi, Takuto, et al.
Published: (2024) -
TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data
by: Seki, Kentaro, et al.
Published: (2025)