Guardado en:
| Autores principales: | Xie, Zhifei, Pang, Kaiyu, Zhang, Haobin, Ye, Deheng, Hu, Xiaobin, Yan, Shuicheng, Miao, Chunyan |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2605.19833 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
por: Xie, Zhifei, et al.
Publicado: (2025)
por: Xie, Zhifei, et al.
Publicado: (2025)
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
por: Li, Yuanchao, et al.
Publicado: (2024)
por: Li, Yuanchao, et al.
Publicado: (2024)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
por: Yu, Fan, et al.
Publicado: (2024)
por: Yu, Fan, et al.
Publicado: (2024)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
por: Liu, Qianhui, et al.
Publicado: (2024)
por: Liu, Qianhui, et al.
Publicado: (2024)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
por: Su, Fei, et al.
Publicado: (2026)
por: Su, Fei, et al.
Publicado: (2026)
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
por: Gong, Jingyao
Publicado: (2026)
por: Gong, Jingyao
Publicado: (2026)
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
por: Kwak, Doyeop, et al.
Publicado: (2026)
por: Kwak, Doyeop, et al.
Publicado: (2026)
Low-latency Speech Enhancement via Speech Token Generation
por: Xue, Huaying, et al.
Publicado: (2023)
por: Xue, Huaying, et al.
Publicado: (2023)
Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
por: Cui, Yang, et al.
Publicado: (2025)
por: Cui, Yang, et al.
Publicado: (2025)
Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition
por: Nishida, Naoto, et al.
Publicado: (2025)
por: Nishida, Naoto, et al.
Publicado: (2025)
Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
por: Lu, Wenhuan, et al.
Publicado: (2025)
por: Lu, Wenhuan, et al.
Publicado: (2025)
Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining
por: Zhou, Rui, et al.
Publicado: (2024)
por: Zhou, Rui, et al.
Publicado: (2024)
MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
por: Pan, Yu, et al.
Publicado: (2023)
por: Pan, Yu, et al.
Publicado: (2023)
Conformer-based Ultrasound-to-Speech Conversion
por: Ibrahimov, Ibrahim, et al.
Publicado: (2025)
por: Ibrahimov, Ibrahim, et al.
Publicado: (2025)
A Survey on Multimodal Music Emotion Recognition
por: Liyanarachchi, Rashini, et al.
Publicado: (2025)
por: Liyanarachchi, Rashini, et al.
Publicado: (2025)
MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
por: Guan, Yiwen, et al.
Publicado: (2025)
por: Guan, Yiwen, et al.
Publicado: (2025)
Audio-Visual Speech Separation via Bottleneck Iterative Network
por: Zhang, Sidong, et al.
Publicado: (2025)
por: Zhang, Sidong, et al.
Publicado: (2025)
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
por: Shi, Jiatong, et al.
Publicado: (2024)
por: Shi, Jiatong, et al.
Publicado: (2024)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
por: Chen, Chen, et al.
Publicado: (2024)
por: Chen, Chen, et al.
Publicado: (2024)
Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content
por: Salvi, Davide, et al.
Publicado: (2024)
por: Salvi, Davide, et al.
Publicado: (2024)
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
por: Guo, Zixin, et al.
Publicado: (2024)
por: Guo, Zixin, et al.
Publicado: (2024)
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
por: Niu, Xinlei, et al.
Publicado: (2025)
por: Niu, Xinlei, et al.
Publicado: (2025)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
por: Zhang, Xiaohui, et al.
Publicado: (2024)
por: Zhang, Xiaohui, et al.
Publicado: (2024)
REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
por: Biyani, Ishan D., et al.
Publicado: (2025)
por: Biyani, Ishan D., et al.
Publicado: (2025)
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
por: Tian, Wenjie, et al.
Publicado: (2025)
por: Tian, Wenjie, et al.
Publicado: (2025)
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
por: Li, Jizhen, et al.
Publicado: (2024)
por: Li, Jizhen, et al.
Publicado: (2024)
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
por: Shi, Jiatong, et al.
Publicado: (2025)
por: Shi, Jiatong, et al.
Publicado: (2025)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
por: Pan, Tianrui, et al.
Publicado: (2024)
por: Pan, Tianrui, et al.
Publicado: (2024)
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
por: Liu, Rui, et al.
Publicado: (2025)
por: Liu, Rui, et al.
Publicado: (2025)
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
por: Yuan, Yi, et al.
Publicado: (2024)
por: Yuan, Yi, et al.
Publicado: (2024)
M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
por: Wang, Anna, et al.
Publicado: (2024)
por: Wang, Anna, et al.
Publicado: (2024)
SonicSense: Object Perception from In-Hand Acoustic Vibration
por: Liu, Jiaxun, et al.
Publicado: (2024)
por: Liu, Jiaxun, et al.
Publicado: (2024)
Jamendo-QA: A Large-Scale Music Question Answering Dataset
por: Koh, Junyoung, et al.
Publicado: (2025)
por: Koh, Junyoung, et al.
Publicado: (2025)
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
por: Ye, Zhen, et al.
Publicado: (2025)
por: Ye, Zhen, et al.
Publicado: (2025)
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
por: Chen, Ziyang, et al.
Publicado: (2024)
por: Chen, Ziyang, et al.
Publicado: (2024)
Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance
por: Chou, Huang-Cheng, et al.
Publicado: (2024)
por: Chou, Huang-Cheng, et al.
Publicado: (2024)
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
por: He, Jiajun, et al.
Publicado: (2024)
por: He, Jiajun, et al.
Publicado: (2024)
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
por: Zhou, Dongliang, et al.
Publicado: (2025)
por: Zhou, Dongliang, et al.
Publicado: (2025)
VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
por: Zhang, Hezhao, et al.
Publicado: (2026)
por: Zhang, Hezhao, et al.
Publicado: (2026)
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
por: Ma, Ziyang, et al.
Publicado: (2024)
por: Ma, Ziyang, et al.
Publicado: (2024)
Ejemplares similares
-
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
por: Xie, Zhifei, et al.
Publicado: (2025) -
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
por: Li, Yuanchao, et al.
Publicado: (2024) -
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
por: Yu, Fan, et al.
Publicado: (2024) -
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
por: Liu, Qianhui, et al.
Publicado: (2024) -
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
por: Su, Fei, et al.
Publicado: (2026)