Saved in:
| Main Authors: | Xu, Manjie, Li, Chenxing, Tu, Xinyi, Ren, Yong, Fu, Ruibo, Liang, Wei, Yu, Dong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.09401 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Video-to-Audio Generation with Hidden Alignment
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning
by: Xu, Le, et al.
Published: (2025)
by: Xu, Le, et al.
Published: (2025)
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
by: Li, Chenxing, et al.
Published: (2024)
by: Li, Chenxing, et al.
Published: (2024)
Prompt-guided Precise Audio Editing with Diffusion Models
by: Xu, Manjie, et al.
Published: (2024)
by: Xu, Manjie, et al.
Published: (2024)
Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning
by: Xu, Manjie, et al.
Published: (2026)
by: Xu, Manjie, et al.
Published: (2026)
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)
by: Ren, Yong, et al.
Published: (2024)
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025)
by: Wu, Shu, et al.
Published: (2025)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
Information-Theoretic Complementary Prompts for Improved Continual Text Classification
by: Zhang, Duzhen, et al.
Published: (2025)
by: Zhang, Duzhen, et al.
Published: (2025)
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
by: He, Xiang, et al.
Published: (2026)
by: He, Xiang, et al.
Published: (2026)
Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning
by: Shao, Chenyang, et al.
Published: (2025)
by: Shao, Chenyang, et al.
Published: (2025)
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
by: Hu, Jiliang, et al.
Published: (2025)
by: Hu, Jiliang, et al.
Published: (2025)
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
by: Liu, Jizhong, et al.
Published: (2024)
by: Liu, Jizhong, et al.
Published: (2024)
Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images
by: Yu, Xiaofei, et al.
Published: (2024)
by: Yu, Xiaofei, et al.
Published: (2024)
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
by: Xiao, Cihan, et al.
Published: (2026)
by: Xiao, Cihan, et al.
Published: (2026)
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Federated Incremental Named Entity Recognition
by: Zhang, Duzhen, et al.
Published: (2024)
by: Zhang, Duzhen, et al.
Published: (2024)
SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
by: Liu, Zheng, et al.
Published: (2024)
by: Liu, Zheng, et al.
Published: (2024)
Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
by: Tian, Jinchuan, et al.
Published: (2026)
by: Tian, Jinchuan, et al.
Published: (2026)
BRACE: A Benchmark for Robust Audio Caption Quality Evaluation
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
MM-LLMs: Recent Advances in MultiModal Large Language Models
by: Zhang, Duzhen, et al.
Published: (2024)
by: Zhang, Duzhen, et al.
Published: (2024)
Exploring Stability-Plasticity Trade-offs for Continual Named Entity Recognition
by: Zhang, Duzhen, et al.
Published: (2025)
by: Zhang, Duzhen, et al.
Published: (2025)
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA
by: Zhang, Duzhen, et al.
Published: (2025)
by: Zhang, Duzhen, et al.
Published: (2025)
Improving Text-To-Audio Models with Synthetic Captions
by: Kong, Zhifeng, et al.
Published: (2024)
by: Kong, Zhifeng, et al.
Published: (2024)
MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
by: Govindarajan, Vijay, et al.
Published: (2025)
by: Govindarajan, Vijay, et al.
Published: (2025)
LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation
by: Chen, Jizheng, et al.
Published: (2026)
by: Chen, Jizheng, et al.
Published: (2026)
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)
by: Hai, Jiarui, et al.
Published: (2024)
DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models
by: He, Shaokai, et al.
Published: (2026)
by: He, Shaokai, et al.
Published: (2026)
Learning to Plan with Personalized Preferences
by: Xu, Manjie, et al.
Published: (2025)
by: Xu, Manjie, et al.
Published: (2025)
RECAP: Retrieval-Augmented Audio Captioning
by: Ghosh, Sreyan, et al.
Published: (2023)
by: Ghosh, Sreyan, et al.
Published: (2023)
Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning
by: Yan, Kaiying, et al.
Published: (2025)
by: Yan, Kaiying, et al.
Published: (2025)
Flexora: Flexible Low Rank Adaptation for Large Language Models
by: Wei, Chenxing, et al.
Published: (2024)
by: Wei, Chenxing, et al.
Published: (2024)
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
by: Lavoie, Samuel, et al.
Published: (2024)
by: Lavoie, Samuel, et al.
Published: (2024)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
by: Wu, Tsung-Han, et al.
Published: (2024)
by: Wu, Tsung-Han, et al.
Published: (2024)
Video Summarization: Towards Entity-Aware Captions
by: Ayyubi, Hammad A., et al.
Published: (2023)
by: Ayyubi, Hammad A., et al.
Published: (2023)
CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
by: Munakata, Hokuto, et al.
Published: (2025)
by: Munakata, Hokuto, et al.
Published: (2025)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
by: Lu, Ke-Han, et al.
Published: (2025)
by: Lu, Ke-Han, et al.
Published: (2025)
HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol
by: Mou, Xinyi, et al.
Published: (2026)
by: Mou, Xinyi, et al.
Published: (2026)
Similar Items
-
Video-to-Audio Generation with Hidden Alignment
by: Xu, Manjie, et al.
Published: (2024) -
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning
by: Xu, Le, et al.
Published: (2025) -
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
by: Li, Chenxing, et al.
Published: (2024) -
Prompt-guided Precise Audio Editing with Diffusion Models
by: Xu, Manjie, et al.
Published: (2024) -
Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning
by: Xu, Manjie, et al.
Published: (2026)