Saved in:
| Main Authors: | Zhao, Zijian, Jin, Dian, Zhou, Zijing, Zhang, Xiaoyu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.03660 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?
by: Zhao, Zijian, et al.
Published: (2025)
by: Zhao, Zijian, et al.
Published: (2025)
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
by: Zhao, Zijian, et al.
Published: (2025)
by: Zhao, Zijian, et al.
Published: (2025)
Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
by: Jin, Zeyu, et al.
Published: (2026)
by: Jin, Zeyu, et al.
Published: (2026)
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
by: Wang, Xiaohan, et al.
Published: (2026)
by: Wang, Xiaohan, et al.
Published: (2026)
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
by: Li, Haitian, et al.
Published: (2026)
by: Li, Haitian, et al.
Published: (2026)
PlanLLM: Video Procedure Planning with Refinable Large Language Models
by: Yang, Dejie, et al.
Published: (2024)
by: Yang, Dejie, et al.
Published: (2024)
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
by: Zheng, Sixiao, et al.
Published: (2025)
by: Zheng, Sixiao, et al.
Published: (2025)
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
by: Cai, Minghong, et al.
Published: (2024)
by: Cai, Minghong, et al.
Published: (2024)
SimLabel: Similarity-Weighted Iterative Framework for Multi-annotator Learning with Missing Annotations
by: Zhang, Liyun, et al.
Published: (2025)
by: Zhang, Liyun, et al.
Published: (2025)
ELF: A Family of Encoder-Free ECG-Language Models
by: Han, William, et al.
Published: (2026)
by: Han, William, et al.
Published: (2026)
Adaptive Social Metaverse Streaming based on Federated Multi-Agent Deep Reinforcement Learning
by: Long, Zijian, et al.
Published: (2025)
by: Long, Zijian, et al.
Published: (2025)
Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation
by: Lee, Yubeen, et al.
Published: (2026)
by: Lee, Yubeen, et al.
Published: (2026)
LightThinker: Thinking Step-by-Step Compression
by: Zhang, Jintian, et al.
Published: (2025)
by: Zhang, Jintian, et al.
Published: (2025)
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
by: Chen, Jianhao, et al.
Published: (2026)
by: Chen, Jianhao, et al.
Published: (2026)
TIP and Polish: Text-Image-Prototype Guided Multi-Modal Generation via Commonality-Discrepancy Modeling and Refinement
by: Ma, Zhiyong, et al.
Published: (2025)
by: Ma, Zhiyong, et al.
Published: (2025)
EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction
by: Jing, Chong, et al.
Published: (2026)
by: Jing, Chong, et al.
Published: (2026)
Sequence-to-Sequence Multi-Modal Speech In-Painting
by: Elyaderani, Mahsa Kadkhodaei, et al.
Published: (2024)
by: Elyaderani, Mahsa Kadkhodaei, et al.
Published: (2024)
LightThinker++: From Reasoning Compression to Memory Management
by: Zhu, Yuqi, et al.
Published: (2026)
by: Zhu, Yuqi, et al.
Published: (2026)
A Split-Window Transformer for Multi-Model Sequence Spammer Detection using Multi-Model Variational Autoencoder
by: Yang, Zhou, et al.
Published: (2025)
by: Yang, Zhou, et al.
Published: (2025)
AniME: Adaptive Multi-Agent Planning for Long Animation Generation
by: Zhang, Lisai, et al.
Published: (2025)
by: Zhang, Lisai, et al.
Published: (2025)
Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation
by: Xie, Bingyan, et al.
Published: (2025)
by: Xie, Bingyan, et al.
Published: (2025)
Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
by: Zhao, Jinghua, et al.
Published: (2025)
by: Zhao, Jinghua, et al.
Published: (2025)
CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition
by: Peng, Cheng, et al.
Published: (2023)
by: Peng, Cheng, et al.
Published: (2023)
M3TR: Temporal Retrieval Enhanced Multi-Modal Micro-video Popularity Prediction
by: Lu, Jiacheng, et al.
Published: (2024)
by: Lu, Jiacheng, et al.
Published: (2024)
Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios
by: Zhang, Yuan, et al.
Published: (2024)
by: Zhang, Yuan, et al.
Published: (2024)
A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects
by: Ruan, Shulan, et al.
Published: (2025)
by: Ruan, Shulan, et al.
Published: (2025)
Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data
by: Kumar, Puneet, et al.
Published: (2024)
by: Kumar, Puneet, et al.
Published: (2024)
Retrieval-Augmented Generation for Electrocardiogram-Language Models
by: Song, Xiaoyu, et al.
Published: (2025)
by: Song, Xiaoyu, et al.
Published: (2025)
Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach
by: Elyaderani, Mahsa Kadkhodaei, et al.
Published: (2024)
by: Elyaderani, Mahsa Kadkhodaei, et al.
Published: (2024)
HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning
by: Zheng, Chuhang, et al.
Published: (2025)
by: Zheng, Chuhang, et al.
Published: (2025)
Semantic-Guided Unsupervised Video Summarization
by: Liu, Haizhou, et al.
Published: (2026)
by: Liu, Haizhou, et al.
Published: (2026)
Morphe: High-Fidelity Generative Video Streaming with Vision Foundation Model
by: Gong, Tianyi, et al.
Published: (2026)
by: Gong, Tianyi, et al.
Published: (2026)
MM-HSD: Multi-Modal Hate Speech Detection in Videos
by: Céspedes-Sarrias, Berta, et al.
Published: (2025)
by: Céspedes-Sarrias, Berta, et al.
Published: (2025)
OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
by: Chen, Junzhe, et al.
Published: (2025)
by: Chen, Junzhe, et al.
Published: (2025)
FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detection
by: Zhou, Ziyi, et al.
Published: (2024)
by: Zhou, Ziyi, et al.
Published: (2024)
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education
by: Jia, Yanhao, et al.
Published: (2025)
by: Jia, Yanhao, et al.
Published: (2025)
Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification
by: Nguyen, Truong Thanh Hung, et al.
Published: (2026)
by: Nguyen, Truong Thanh Hung, et al.
Published: (2026)
BlobCtrl: Taming Controllable Blob for Element-level Image Editing
by: Li, Yaowei, et al.
Published: (2025)
by: Li, Yaowei, et al.
Published: (2025)
A Multi-modal Fusion Network for Terrain Perception Based on Illumination Aware
by: Wang, Rui, et al.
Published: (2025)
by: Wang, Rui, et al.
Published: (2025)
Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms
by: Wen, Penghui, et al.
Published: (2023)
by: Wen, Penghui, et al.
Published: (2023)
Similar Items
-
Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?
by: Zhao, Zijian, et al.
Published: (2025) -
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
by: Zhao, Zijian, et al.
Published: (2025) -
Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
by: Jin, Zeyu, et al.
Published: (2026) -
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
by: Wang, Xiaohan, et al.
Published: (2026) -
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
by: Li, Haitian, et al.
Published: (2026)