Saved in:
| Main Authors: | Zhu, Warren, Ramezani, Aida, Xu, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.11473 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling
by: Jing, Liqi, et al.
Published: (2026)
by: Jing, Liqi, et al.
Published: (2026)
Unsupervised Audio-Visual Segmentation with Modality Alignment
by: Bhosale, Swapnil, et al.
Published: (2024)
by: Bhosale, Swapnil, et al.
Published: (2024)
MITracker: Multi-View Integration for Visual Object Tracking
by: Xu, Mengjie, et al.
Published: (2025)
by: Xu, Mengjie, et al.
Published: (2025)
Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification
by: Xu, Qin, et al.
Published: (2025)
by: Xu, Qin, et al.
Published: (2025)
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)
by: Liu, Xiaolin, et al.
Published: (2026)
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
by: Dai, Ming, et al.
Published: (2024)
by: Dai, Ming, et al.
Published: (2024)
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
by: Chen, Pingyi, et al.
Published: (2024)
by: Chen, Pingyi, et al.
Published: (2024)
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)
by: Xu, Beining, et al.
Published: (2025)
Compensating Visual Insufficiency with Stratified Language Guidance for Long-Tail Class Incremental Learning
by: Wang, Xi, et al.
Published: (2026)
by: Wang, Xi, et al.
Published: (2026)
Bridge then Begin Anew: Generating Target-relevant Intermediate Model for Source-free Visual Emotion Adaptation
by: Zhu, Jiankun, et al.
Published: (2024)
by: Zhu, Jiankun, et al.
Published: (2024)
DTL: Disentangled Transfer Learning for Visual Recognition
by: Fu, Minghao, et al.
Published: (2023)
by: Fu, Minghao, et al.
Published: (2023)
Learning Physical Dynamics for Object-centric Visual Prediction
by: Xu, Huilin, et al.
Published: (2024)
by: Xu, Huilin, et al.
Published: (2024)
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
by: Zhan, Yufei, et al.
Published: (2025)
by: Zhan, Yufei, et al.
Published: (2025)
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
by: Zhu, Xuanyu, et al.
Published: (2026)
by: Zhu, Xuanyu, et al.
Published: (2026)
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
by: Gao, Mingjian, et al.
Published: (2026)
by: Gao, Mingjian, et al.
Published: (2026)
DOGR: Towards Versatile Visual Document Grounding and Referring
by: Zhou, Yinan, et al.
Published: (2024)
by: Zhou, Yinan, et al.
Published: (2024)
Contextual inference from single objects in Vision-Language models
by: Vilas, Martina G., et al.
Published: (2026)
by: Vilas, Martina G., et al.
Published: (2026)
Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking
by: Tang, Zhangyong, et al.
Published: (2025)
by: Tang, Zhangyong, et al.
Published: (2025)
Exploring Task-Level Optimal Prompts for Visual In-Context Learning
by: Zhu, Yan, et al.
Published: (2025)
by: Zhu, Yan, et al.
Published: (2025)
Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
by: Yu, Yinfeng, et al.
Published: (2025)
by: Yu, Yinfeng, et al.
Published: (2025)
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
by: Xu, Haoran, et al.
Published: (2026)
by: Xu, Haoran, et al.
Published: (2026)
Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments
by: Yang, Xiao, et al.
Published: (2025)
by: Yang, Xiao, et al.
Published: (2025)
Visual-ERM: Reward Modeling for Visual Equivalence
by: Liu, Ziyu, et al.
Published: (2026)
by: Liu, Ziyu, et al.
Published: (2026)
STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
by: Ding, Zijun, et al.
Published: (2025)
by: Ding, Zijun, et al.
Published: (2025)
MemoNav: Working Memory Model for Visual Navigation
by: Li, Hongxin, et al.
Published: (2024)
by: Li, Hongxin, et al.
Published: (2024)
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
by: Suo, Yucheng, et al.
Published: (2025)
by: Suo, Yucheng, et al.
Published: (2025)
An adversarial feature learning based semantic communication method for Human 3D Reconstruction
by: Liu, Shaojiang, et al.
Published: (2024)
by: Liu, Shaojiang, et al.
Published: (2024)
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
by: Zhan, Yufei, et al.
Published: (2024)
by: Zhan, Yufei, et al.
Published: (2024)
GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
by: Zheng, Shurong, et al.
Published: (2026)
by: Zheng, Shurong, et al.
Published: (2026)
Mutual Information guided Visual Contrastive Learning
by: Chen, Hanyang, et al.
Published: (2025)
by: Chen, Hanyang, et al.
Published: (2025)
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)
by: Li, Kailing, et al.
Published: (2025)
Visual Space Optimization for Zero-shot Learning
by: Wang, Xinsheng, et al.
Published: (2019)
by: Wang, Xinsheng, et al.
Published: (2019)
ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
by: Yang, Ziteng, et al.
Published: (2025)
by: Yang, Ziteng, et al.
Published: (2025)
Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning
by: Lu, Wenting, et al.
Published: (2026)
by: Lu, Wenting, et al.
Published: (2026)
VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes
by: Chen, Jingru, et al.
Published: (2026)
by: Chen, Jingru, et al.
Published: (2026)
Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement
by: Ma, Wei, et al.
Published: (2026)
by: Ma, Wei, et al.
Published: (2026)
Dual Latent Memory for Visual Multi-agent System
by: Yu, Xinlei, et al.
Published: (2026)
by: Yu, Xinlei, et al.
Published: (2026)
Adversarial Error Correction for Visual Autoregressive Generation
by: Bi, Ligong, et al.
Published: (2026)
by: Bi, Ligong, et al.
Published: (2026)
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
by: Xi, Suyang, et al.
Published: (2026)
by: Xi, Suyang, et al.
Published: (2026)
Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis
by: Chen, Weiming, et al.
Published: (2025)
by: Chen, Weiming, et al.
Published: (2025)
Similar Items
-
StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling
by: Jing, Liqi, et al.
Published: (2026) -
Unsupervised Audio-Visual Segmentation with Modality Alignment
by: Bhosale, Swapnil, et al.
Published: (2024) -
MITracker: Multi-View Integration for Visual Object Tracking
by: Xu, Mengjie, et al.
Published: (2025) -
Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification
by: Xu, Qin, et al.
Published: (2025) -
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)