Saved in:
| Main Authors: | Wang, Liping, Ye, Cheng, Chen, Weidong, Song, Peipei, Hu, Bo, Mao, Zhendong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.18988 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning
by: Chen, Weidong, et al.
Published: (2026)
by: Chen, Weidong, et al.
Published: (2026)
Dual-path Collaborative Generation Network for Emotional Video Captioning
by: Ye, Cheng, et al.
Published: (2024)
by: Ye, Cheng, et al.
Published: (2024)
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
by: Qin, Zheng, et al.
Published: (2025)
by: Qin, Zheng, et al.
Published: (2025)
Beyond Emotion Recognition: A Multi-Turn Multimodal Emotion Understanding and Reasoning Benchmark
by: Hu, Jinpeng, et al.
Published: (2025)
by: Hu, Jinpeng, et al.
Published: (2025)
HRR: Hierarchical Retrospection Refinement for Generated Image Detection
by: Yuan, Peipei, et al.
Published: (2025)
by: Yuan, Peipei, et al.
Published: (2025)
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
by: Chen, Weidong, et al.
Published: (2026)
by: Chen, Weidong, et al.
Published: (2026)
Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
by: Fu, Fengyi, et al.
Published: (2024)
by: Fu, Fengyi, et al.
Published: (2024)
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
by: Jiang, Haichao, et al.
Published: (2026)
by: Jiang, Haichao, et al.
Published: (2026)
MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing
by: Qiu, Zhaoyuan, et al.
Published: (2026)
by: Qiu, Zhaoyuan, et al.
Published: (2026)
Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation
by: Wang, Junjie, et al.
Published: (2026)
by: Wang, Junjie, et al.
Published: (2026)
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis
by: Guo, Yijie, et al.
Published: (2025)
by: Guo, Yijie, et al.
Published: (2025)
Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing
by: Li, Hao, et al.
Published: (2024)
by: Li, Hao, et al.
Published: (2024)
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
by: Shao, Shitong, et al.
Published: (2025)
by: Shao, Shitong, et al.
Published: (2025)
MCM: Multi-condition Motion Synthesis Framework for Multi-scenario
by: Ling, Zeyu, et al.
Published: (2023)
by: Ling, Zeyu, et al.
Published: (2023)
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
by: Chen, Nan, et al.
Published: (2025)
by: Chen, Nan, et al.
Published: (2025)
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
by: Wang, Zeheng, et al.
Published: (2026)
by: Wang, Zeheng, et al.
Published: (2026)
Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering
by: Song, Jiayi, et al.
Published: (2025)
by: Song, Jiayi, et al.
Published: (2025)
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
by: Lin, Ronghao, et al.
Published: (2025)
by: Lin, Ronghao, et al.
Published: (2025)
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
by: Du, Hao, et al.
Published: (2025)
by: Du, Hao, et al.
Published: (2025)
LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning
by: Fu, Fengyi, et al.
Published: (2025)
by: Fu, Fengyi, et al.
Published: (2025)
Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering
by: Song, Peipei, et al.
Published: (2025)
by: Song, Peipei, et al.
Published: (2025)
MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding
by: Cao, Shiwen, et al.
Published: (2025)
by: Cao, Shiwen, et al.
Published: (2025)
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
by: Shi, Changyue, et al.
Published: (2025)
by: Shi, Changyue, et al.
Published: (2025)
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
by: Khalid, Umar, et al.
Published: (2024)
by: Khalid, Umar, et al.
Published: (2024)
D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation
by: Jia, Weinan, et al.
Published: (2025)
by: Jia, Weinan, et al.
Published: (2025)
NativeTok: Native Visual Tokenization for Improved Image Generation
by: Wu, Bin, et al.
Published: (2026)
by: Wu, Bin, et al.
Published: (2026)
VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
by: Kelly, Chris, et al.
Published: (2024)
by: Kelly, Chris, et al.
Published: (2024)
EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation
by: Mao, Qi, et al.
Published: (2025)
by: Mao, Qi, et al.
Published: (2025)
Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling
by: Yu, Xinlei, et al.
Published: (2025)
by: Yu, Xinlei, et al.
Published: (2025)
GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning
by: Jiang, Kaixun, et al.
Published: (2026)
by: Jiang, Kaixun, et al.
Published: (2026)
MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences
by: Wang, Shijian, et al.
Published: (2026)
by: Wang, Shijian, et al.
Published: (2026)
Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
by: Tang, Song, et al.
Published: (2023)
by: Tang, Song, et al.
Published: (2023)
CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation
by: Du, Yuxuan, et al.
Published: (2025)
by: Du, Yuxuan, et al.
Published: (2025)
Scene-Agnostic Traversability Labeling and Estimation via a Multimodal Self-supervised Framework
by: Fang, Zipeng, et al.
Published: (2025)
by: Fang, Zipeng, et al.
Published: (2025)
All in One Framework for Multimodal Re-identification in the Wild
by: Li, He, et al.
Published: (2024)
by: Li, He, et al.
Published: (2024)
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
by: Wang, Shengqin, et al.
Published: (2026)
by: Wang, Shengqin, et al.
Published: (2026)
DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation
by: Huang, Mengqi, et al.
Published: (2022)
by: Huang, Mengqi, et al.
Published: (2022)
Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection
by: Luo, Dengyan, et al.
Published: (2024)
by: Luo, Dengyan, et al.
Published: (2024)
Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents
by: Wang, Zhao, et al.
Published: (2025)
by: Wang, Zhao, et al.
Published: (2025)
HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models
by: Zhuang, Shuhan, et al.
Published: (2025)
by: Zhuang, Shuhan, et al.
Published: (2025)
Similar Items
-
FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning
by: Chen, Weidong, et al.
Published: (2026) -
Dual-path Collaborative Generation Network for Emotional Video Captioning
by: Ye, Cheng, et al.
Published: (2024) -
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
by: Qin, Zheng, et al.
Published: (2025) -
Beyond Emotion Recognition: A Multi-Turn Multimodal Emotion Understanding and Reasoning Benchmark
by: Hu, Jinpeng, et al.
Published: (2025) -
HRR: Hierarchical Retrospection Refinement for Generated Image Detection
by: Yuan, Peipei, et al.
Published: (2025)