Saved in:
| Main Authors: | Stooke, Adam, Prabhavalkar, Rohit, Sim, Khe Chai, Mengibar, Pedro Moreno |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.05232 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models
by: Munkhdalai, Tsendsuren, et al.
Published: (2024)
by: Munkhdalai, Tsendsuren, et al.
Published: (2024)
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models
by: Prabhavalkar, Rohit, et al.
Published: (2024)
by: Prabhavalkar, Rohit, et al.
Published: (2024)
Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
by: Lou, Haowei, et al.
Published: (2024)
by: Lou, Haowei, et al.
Published: (2024)
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
by: Lepage, Theo, et al.
Published: (2025)
by: Lepage, Theo, et al.
Published: (2025)
Label-Looping: Highly Efficient Decoding for Transducers
by: Bataev, Vladimir, et al.
Published: (2024)
by: Bataev, Vladimir, et al.
Published: (2024)
Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders
by: Li, Dichucheng, et al.
Published: (2025)
by: Li, Dichucheng, et al.
Published: (2025)
Scaling Self-Supervised Representation Learning for Symbolic Piano Performance
by: Bradshaw, Louis, et al.
Published: (2025)
by: Bradshaw, Louis, et al.
Published: (2025)
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
by: Alex, Tony, et al.
Published: (2025)
by: Alex, Tony, et al.
Published: (2025)
Translation-Equivariant Self-Supervised Learning for Pitch Estimation with Optimal Transport
by: Torres, Bernardo, et al.
Published: (2025)
by: Torres, Bernardo, et al.
Published: (2025)
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
by: Mu, Zhaoxi, et al.
Published: (2023)
by: Mu, Zhaoxi, et al.
Published: (2023)
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
by: Riou, Alain, et al.
Published: (2025)
by: Riou, Alain, et al.
Published: (2025)
DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
by: Lin, Tzu-Quan, et al.
Published: (2024)
by: Lin, Tzu-Quan, et al.
Published: (2024)
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech
by: Fu, Szu-Wei, et al.
Published: (2024)
by: Fu, Szu-Wei, et al.
Published: (2024)
RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
by: Bataev, Vladimir
Published: (2025)
by: Bataev, Vladimir
Published: (2025)
DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective
by: Chi, Hyung Gun, et al.
Published: (2025)
by: Chi, Hyung Gun, et al.
Published: (2025)
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
by: Chen, Wenxi, et al.
Published: (2024)
by: Chen, Wenxi, et al.
Published: (2024)
Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures
by: Ioannides, Georgios, et al.
Published: (2026)
by: Ioannides, Georgios, et al.
Published: (2026)
AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
by: Zhang, Junan, et al.
Published: (2025)
by: Zhang, Junan, et al.
Published: (2025)
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models
by: Grigoryan, Lilit, et al.
Published: (2025)
by: Grigoryan, Lilit, et al.
Published: (2025)
DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection
by: Yan, Sheng, et al.
Published: (2024)
by: Yan, Sheng, et al.
Published: (2024)
Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT
by: Jafarzadeh, Pourya, et al.
Published: (2024)
by: Jafarzadeh, Pourya, et al.
Published: (2024)
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
by: Andrusenko, Andrei, et al.
Published: (2024)
by: Andrusenko, Andrei, et al.
Published: (2024)
Model as Loss: A Self-Consistent Training Paradigm
by: Phaye, Saisamarth Rajesh, et al.
Published: (2025)
by: Phaye, Saisamarth Rajesh, et al.
Published: (2025)
Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment
by: Nihal, Ragib Amin, et al.
Published: (2025)
by: Nihal, Ragib Amin, et al.
Published: (2025)
JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
by: Ioannides, Georgios, et al.
Published: (2025)
by: Ioannides, Georgios, et al.
Published: (2025)
Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music
by: Bitra, Venkat Suprabath, et al.
Published: (2026)
by: Bitra, Venkat Suprabath, et al.
Published: (2026)
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations
by: Neekhara, Paarth, et al.
Published: (2023)
by: Neekhara, Paarth, et al.
Published: (2023)
DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
by: Fujita, Yoto, et al.
Published: (2024)
by: Fujita, Yoto, et al.
Published: (2024)
Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework
by: Kundu, Niloy Kumar, et al.
Published: (2024)
by: Kundu, Niloy Kumar, et al.
Published: (2024)
Whisfusion: Parallel ASR Decoding via a Diffusion Transformer
by: Kwon, Taeyoun, et al.
Published: (2025)
by: Kwon, Taeyoun, et al.
Published: (2025)
Masked Audio Generation using a Single Non-Autoregressive Transformer
by: Ziv, Alon, et al.
Published: (2024)
by: Ziv, Alon, et al.
Published: (2024)
DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening
by: Lu, Zhixiang, et al.
Published: (2025)
by: Lu, Zhixiang, et al.
Published: (2025)
Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network
by: Cai, Jinjin, et al.
Published: (2024)
by: Cai, Jinjin, et al.
Published: (2024)
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)
by: Wang, Yuancheng, et al.
Published: (2024)
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
by: Segal-Feldman, Yael, et al.
Published: (2024)
by: Segal-Feldman, Yael, et al.
Published: (2024)
Comparative Analysis of CNN and Transformer Architectures with Heart Cycle Normalization for Automated Phonocardiogram Classification
by: Sondermann, Martin, et al.
Published: (2025)
by: Sondermann, Martin, et al.
Published: (2025)
Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers
by: Işık, Atakan, et al.
Published: (2025)
by: Işık, Atakan, et al.
Published: (2025)
Multi-blank Transducers for Speech Recognition
by: Xu, Hainan, et al.
Published: (2022)
by: Xu, Hainan, et al.
Published: (2022)
TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms
by: Sui, Yueyuan, et al.
Published: (2024)
by: Sui, Yueyuan, et al.
Published: (2024)
Audio Transformers
by: Verma, Prateek, et al.
Published: (2021)
by: Verma, Prateek, et al.
Published: (2021)
Similar Items
-
Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models
by: Munkhdalai, Tsendsuren, et al.
Published: (2024) -
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models
by: Prabhavalkar, Rohit, et al.
Published: (2024) -
Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
by: Lou, Haowei, et al.
Published: (2024) -
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
by: Lepage, Theo, et al.
Published: (2025) -
Label-Looping: Highly Efficient Decoding for Transducers
by: Bataev, Vladimir, et al.
Published: (2024)