:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Dengming, You, Weitao, Li, Jingxiong, Lin, Weishen, Shi, Wenda, Zhao, Xue, Zuo, Heda, Wu, Junxian, Sun, Lingyun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.12077
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
by: Wu, Junxian, et al.
Published: (2025)

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning
by: Zhang, Dengming, et al.
Published: (2024)

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
by: Zuo, Heda, et al.
Published: (2025)

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
by: Jia, Yanhao, et al.
Published: (2025)

FonTS: Text Rendering with Typography and Style Controls
by: Shi, Wenda, et al.
Published: (2024)

WordCon: Word-level Typography Control in Scene Text Rendering
by: Shi, Wenda, et al.
Published: (2025)

AnySurf: Any Surface Generation with Directed Edge
by: Shi, Wenda, et al.
Published: (2026)

Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms
by: Loakman, Tyler, et al.
Published: (2025)

With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models
by: Loakman, Tyler, et al.
Published: (2024)

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
by: Chen, Kai, et al.
Published: (2024)

Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration
by: Teng, Zhuyu, et al.
Published: (2026)

Efficient and Scalable Chinese Vector Font Generation via Component Composition
by: Song, Jinyu, et al.
Published: (2024)

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)

Green Energy and State Power: The Case of Zhanatas Wind Power Project in Kazakhstan
by: Weishen Zeng
Published: (2025)

EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation
by: Zhang, Cheng, et al.
Published: (2025)

Multilevel constructions of constant dimension codes based on one-factorization of complete graphs
by: Xu, Dengming, et al.
Published: (2025)

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
by: Limpijankit, Marvin, et al.
Published: (2026)

The Audio-Visual BatVision Dataset for Research on Sight and Sound
by: Brunetto, Amandine, et al.
Published: (2023)

It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models
by: Zhao, Xiangyu, et al.
Published: (2025)

Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
by: Park, Sooyoung, et al.
Published: (2025)

Viscometric investigations and molecular interactions of some derivatives of 5-substituted indole dihydropyrimidines in mixed organic solvents
by: L. C. Heda
Published: (2010)

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
by: Wang, Qiaolin, et al.
Published: (2025)

Spray Coating of Thick Perovskite Films for Photodetectors: The Aerosol–Liquid–Solid Mechanisms and Sensing Applications
by: Wei Qian, et al.
Published: (2026)

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
by: Wang, Haozhe, et al.
Published: (2026)

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
by: Deng, Shijian, et al.
Published: (2024)

HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation
by: Yang, Hongji, et al.
Published: (2026)

Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
by: Lyu, Zesen, et al.
Published: (2025)

Large Language Models Implicitly Learn to See and Hear Just By Reading
by: Verma, Prateek, et al.
Published: (2025)

Do Audio-Visual Large Language Models Really See and Hear?
by: Selvakumar, Ramaneswaran, et al.
Published: (2026)

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
by: Yuan, Zhengqing, et al.
Published: (2023)

Vision Language Models See What You Want but not What You See
by: Gao, Qingying, et al.
Published: (2024)

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026)

See Me, Hear Me: Skype in the Classroom
by: Foote, Carolyn
Published: (2008)

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
by: Shi, Zhengpeng, et al.
Published: (2025)

UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
by: Chen, Fengjiao, et al.
Published: (2025)

Meta-aware Learning in text-to-SQL Large Language Model
by: Zhang, Wenda
Published: (2025)

BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
by: Srikrishnan, Tharun Adithya, et al.
Published: (2025)

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models
by: Zhang, Wenda, et al.
Published: (2026)

How Self‐Congruity Elicits Tourists' Country Attachment, Patriotism, and Intention to Continuous Participation in Red Tourism
by: Dengming Xie, et al.
Published: (2025)

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding
by: Zhang, Huaxiang, et al.
Published: (2024)