Saved in:
| Main Author: | NAVER Cloud HyperCLOVA X Team |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.01792 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HyperCLOVA X 32B Think
by: NAVER Cloud HyperCLOVA X Team
Published: (2026)
by: NAVER Cloud HyperCLOVA X Team
Published: (2026)
HyperCLOVA X THINK Technical Report
by: NAVER Cloud HyperCLOVA X Team
Published: (2025)
by: NAVER Cloud HyperCLOVA X Team
Published: (2025)
HyperCLOVA X Technical Report
by: Yoo, Kang Min, et al.
Published: (2024)
by: Yoo, Kang Min, et al.
Published: (2024)
LongCat-Flash-Omni Technical Report
by: Meituan LongCat Team, et al.
Published: (2025)
by: Meituan LongCat Team, et al.
Published: (2025)
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
by: Gu, Zijin, et al.
Published: (2025)
by: Gu, Zijin, et al.
Published: (2025)
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
by: Xie, Zhifei, et al.
Published: (2024)
by: Xie, Zhifei, et al.
Published: (2024)
Ming-Omni: A Unified Multimodal Model for Perception and Generation
by: AI, Inclusion, et al.
Published: (2025)
by: AI, Inclusion, et al.
Published: (2025)
Learning When to Think While Listening in Large Audio-Language Models
by: Song, Zhiyuan, et al.
Published: (2026)
by: Song, Zhiyuan, et al.
Published: (2026)
MAEB: Massive Audio Embedding Benchmark
by: Assadi, Adnan El, et al.
Published: (2026)
by: Assadi, Adnan El, et al.
Published: (2026)
DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
by: Della Libera, Luca, et al.
Published: (2026)
by: Della Libera, Luca, et al.
Published: (2026)
EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
by: Modi, Smit Nautambhai, et al.
Published: (2026)
by: Modi, Smit Nautambhai, et al.
Published: (2026)
Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer
by: Vavaroutsos, Petros, et al.
Published: (2026)
by: Vavaroutsos, Petros, et al.
Published: (2026)
Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods
by: Shendabadi, Ali, et al.
Published: (2026)
by: Shendabadi, Ali, et al.
Published: (2026)
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
by: Bogavelli, Tara, et al.
Published: (2026)
by: Bogavelli, Tara, et al.
Published: (2026)
AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?
by: Ok, Hyunjong, et al.
Published: (2025)
by: Ok, Hyunjong, et al.
Published: (2025)
Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
by: Kumar, Gokul Karthik, et al.
Published: (2025)
by: Kumar, Gokul Karthik, et al.
Published: (2025)
PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
by: He, Jiajun, et al.
Published: (2025)
by: He, Jiajun, et al.
Published: (2025)
Synthetic Audio Helps for Cognitive State Tasks
by: Soubki, Adil, et al.
Published: (2025)
by: Soubki, Adil, et al.
Published: (2025)
Automatic Time Signature Determination for New Scores Using Lyrics for Latent Rhythmic Structure
by: Liao, Callie C., et al.
Published: (2023)
by: Liao, Callie C., et al.
Published: (2023)
LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
by: Sun, Yirong, et al.
Published: (2025)
by: Sun, Yirong, et al.
Published: (2025)
Music for All: Representational Bias and Cross-Cultural Adaptability of Music Generation Models
by: Mehta, Atharva, et al.
Published: (2025)
by: Mehta, Atharva, et al.
Published: (2025)
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
by: Deng, Qixin, et al.
Published: (2024)
by: Deng, Qixin, et al.
Published: (2024)
MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
by: Le-Duc, Khai, et al.
Published: (2025)
by: Le-Duc, Khai, et al.
Published: (2025)
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
by: Fang, Zheng, et al.
Published: (2026)
by: Fang, Zheng, et al.
Published: (2026)
Arabic Little STT: Arabic Children Speech Recognition Dataset
by: Alkadri, Mouhand, et al.
Published: (2025)
by: Alkadri, Mouhand, et al.
Published: (2025)
Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments
by: Choudhury, Ritabrata Roy
Published: (2024)
by: Choudhury, Ritabrata Roy
Published: (2024)
Kimi-Audio Technical Report
by: KimiTeam, et al.
Published: (2025)
by: KimiTeam, et al.
Published: (2025)
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
by: Kuan, Chun-Yi, et al.
Published: (2026)
by: Kuan, Chun-Yi, et al.
Published: (2026)
SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models
by: Hussain, Aafiya, et al.
Published: (2026)
by: Hussain, Aafiya, et al.
Published: (2026)
AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering
by: Kuan, Chun-Yi, et al.
Published: (2026)
by: Kuan, Chun-Yi, et al.
Published: (2026)
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
by: Kuan, Chun-Yi, et al.
Published: (2026)
by: Kuan, Chun-Yi, et al.
Published: (2026)
COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings
by: Zhu, Yonggang, et al.
Published: (2026)
by: Zhu, Yonggang, et al.
Published: (2026)
TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment
by: Kim, Taesoo, et al.
Published: (2025)
by: Kim, Taesoo, et al.
Published: (2025)
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
by: Yoo, Suho, et al.
Published: (2025)
by: Yoo, Suho, et al.
Published: (2025)
Sentiment Reasoning for Healthcare
by: Nguyen, Khai-Nguyen, et al.
Published: (2024)
by: Nguyen, Khai-Nguyen, et al.
Published: (2024)
Dual Knowledge Distillation for Efficient Sound Event Detection
by: Xiao, Yang, et al.
Published: (2024)
by: Xiao, Yang, et al.
Published: (2024)
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)
by: Peng, Puyuan, et al.
Published: (2024)
FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)
by: Ye, Zhen, et al.
Published: (2024)
Similar Items
-
HyperCLOVA X 32B Think
by: NAVER Cloud HyperCLOVA X Team
Published: (2026) -
HyperCLOVA X THINK Technical Report
by: NAVER Cloud HyperCLOVA X Team
Published: (2025) -
HyperCLOVA X Technical Report
by: Yoo, Kang Min, et al.
Published: (2024) -
LongCat-Flash-Omni Technical Report
by: Meituan LongCat Team, et al.
Published: (2025) -
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
by: Gu, Zijin, et al.
Published: (2025)