Saved in:
| Main Authors: | Zhang, Dengming, You, Weitao, Li, Jingxiong, Lin, Weishen, Shi, Wenda, Zhao, Xue, Zuo, Heda, Wu, Junxian, Sun, Lingyun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.12077 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
by: Wu, Junxian, et al.
Published: (2025)
by: Wu, Junxian, et al.
Published: (2025)
Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning
by: Zhang, Dengming, et al.
Published: (2024)
by: Zhang, Dengming, et al.
Published: (2024)
GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
by: Zuo, Heda, et al.
Published: (2025)
by: Zuo, Heda, et al.
Published: (2025)
Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
by: Jia, Yanhao, et al.
Published: (2025)
by: Jia, Yanhao, et al.
Published: (2025)
FonTS: Text Rendering with Typography and Style Controls
by: Shi, Wenda, et al.
Published: (2024)
by: Shi, Wenda, et al.
Published: (2024)
WordCon: Word-level Typography Control in Scene Text Rendering
by: Shi, Wenda, et al.
Published: (2025)
by: Shi, Wenda, et al.
Published: (2025)
AnySurf: Any Surface Generation with Directed Edge
by: Shi, Wenda, et al.
Published: (2026)
by: Shi, Wenda, et al.
Published: (2026)
Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms
by: Loakman, Tyler, et al.
Published: (2025)
by: Loakman, Tyler, et al.
Published: (2025)
With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models
by: Loakman, Tyler, et al.
Published: (2024)
by: Loakman, Tyler, et al.
Published: (2024)
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
by: Chen, Kai, et al.
Published: (2024)
by: Chen, Kai, et al.
Published: (2024)
Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration
by: Teng, Zhuyu, et al.
Published: (2026)
by: Teng, Zhuyu, et al.
Published: (2026)
Efficient and Scalable Chinese Vector Font Generation via Component Composition
by: Song, Jinyu, et al.
Published: (2024)
by: Song, Jinyu, et al.
Published: (2024)
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)
Green Energy and State Power: The Case of Zhanatas Wind Power Project in Kazakhstan
by: Weishen Zeng
Published: (2025)
by: Weishen Zeng
Published: (2025)
EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation
by: Zhang, Cheng, et al.
Published: (2025)
by: Zhang, Cheng, et al.
Published: (2025)
Multilevel constructions of constant dimension codes based on one-factorization of complete graphs
by: Xu, Dengming, et al.
Published: (2025)
by: Xu, Dengming, et al.
Published: (2025)
Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
by: Limpijankit, Marvin, et al.
Published: (2026)
by: Limpijankit, Marvin, et al.
Published: (2026)
The Audio-Visual BatVision Dataset for Research on Sight and Sound
by: Brunetto, Amandine, et al.
Published: (2023)
by: Brunetto, Amandine, et al.
Published: (2023)
It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models
by: Zhao, Xiangyu, et al.
Published: (2025)
by: Zhao, Xiangyu, et al.
Published: (2025)
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
by: Park, Sooyoung, et al.
Published: (2025)
by: Park, Sooyoung, et al.
Published: (2025)
Viscometric investigations and molecular interactions of some derivatives of 5-substituted indole dihydropyrimidines in mixed organic solvents
by: L. C. Heda
Published: (2010)
by: L. C. Heda
Published: (2010)
SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
by: Wang, Qiaolin, et al.
Published: (2025)
by: Wang, Qiaolin, et al.
Published: (2025)
Spray Coating of Thick Perovskite Films for Photodetectors: The Aerosol–Liquid–Solid Mechanisms and Sensing Applications
by: Wei Qian, et al.
Published: (2026)
by: Wei Qian, et al.
Published: (2026)
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
by: Wang, Haozhe, et al.
Published: (2026)
by: Wang, Haozhe, et al.
Published: (2026)
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
by: Deng, Shijian, et al.
Published: (2024)
by: Deng, Shijian, et al.
Published: (2024)
HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation
by: Yang, Hongji, et al.
Published: (2026)
by: Yang, Hongji, et al.
Published: (2026)
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
by: Lyu, Zesen, et al.
Published: (2025)
by: Lyu, Zesen, et al.
Published: (2025)
Large Language Models Implicitly Learn to See and Hear Just By Reading
by: Verma, Prateek, et al.
Published: (2025)
by: Verma, Prateek, et al.
Published: (2025)
Do Audio-Visual Large Language Models Really See and Hear?
by: Selvakumar, Ramaneswaran, et al.
Published: (2026)
by: Selvakumar, Ramaneswaran, et al.
Published: (2026)
ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
by: Yuan, Zhengqing, et al.
Published: (2023)
by: Yuan, Zhengqing, et al.
Published: (2023)
Vision Language Models See What You Want but not What You See
by: Gao, Qingying, et al.
Published: (2024)
by: Gao, Qingying, et al.
Published: (2024)
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026)
by: Sun, Boyuan, et al.
Published: (2026)
See Me, Hear Me: Skype in the Classroom
by: Foote, Carolyn
Published: (2008)
by: Foote, Carolyn
Published: (2008)
v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
by: Shi, Zhengpeng, et al.
Published: (2025)
by: Shi, Zhengpeng, et al.
Published: (2025)
UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
by: Chen, Fengjiao, et al.
Published: (2025)
by: Chen, Fengjiao, et al.
Published: (2025)
Meta-aware Learning in text-to-SQL Large Language Model
by: Zhang, Wenda
Published: (2025)
by: Zhang, Wenda
Published: (2025)
BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
by: Srikrishnan, Tharun Adithya, et al.
Published: (2025)
by: Srikrishnan, Tharun Adithya, et al.
Published: (2025)
Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models
by: Zhang, Wenda, et al.
Published: (2026)
by: Zhang, Wenda, et al.
Published: (2026)
How Self‐Congruity Elicits Tourists' Country Attachment, Patriotism, and Intention to Continuous Participation in Red Tourism
by: Dengming Xie, et al.
Published: (2025)
by: Dengming Xie, et al.
Published: (2025)
InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding
by: Zhang, Huaxiang, et al.
Published: (2024)
by: Zhang, Huaxiang, et al.
Published: (2024)
Similar Items
-
Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
by: Wu, Junxian, et al.
Published: (2025) -
Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning
by: Zhang, Dengming, et al.
Published: (2024) -
GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
by: Zuo, Heda, et al.
Published: (2025) -
Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
by: Jia, Yanhao, et al.
Published: (2025) -
FonTS: Text Rendering with Typography and Style Controls
by: Shi, Wenda, et al.
Published: (2024)