Saved in:
| Main Authors: | Sun, Ye, Zhang, Hao, Ding, Henghui, Zhang, Tiehua, Ma, Xingjun, Jiang, Yu-Gang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.18812 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UnSeg: One Universal Unlearnable Example Generator is Enough against All Image Segmentation
by: Sun, Ye, et al.
Published: (2024)
by: Sun, Ye, et al.
Published: (2024)
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
by: Sun, Ye, et al.
Published: (2026)
by: Sun, Ye, et al.
Published: (2026)
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
by: Mou, Tingshu, et al.
Published: (2026)
by: Mou, Tingshu, et al.
Published: (2026)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models
by: Shuai, Xincheng, et al.
Published: (2024)
by: Shuai, Xincheng, et al.
Published: (2024)
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
by: Shuai, Xincheng, et al.
Published: (2025)
by: Shuai, Xincheng, et al.
Published: (2025)
Artemis: Towards Referential Understanding in Complex Videos
by: Qiu, Jihao, et al.
Published: (2024)
by: Qiu, Jihao, et al.
Published: (2024)
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
White-box Multimodal Jailbreaks Against Large Vision-Language Models
by: Wang, Ruofan, et al.
Published: (2024)
by: Wang, Ruofan, et al.
Published: (2024)
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
by: Zhang, Xinyao, et al.
Published: (2026)
by: Zhang, Xinyao, et al.
Published: (2026)
FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models
by: Zhai, Kun, et al.
Published: (2025)
by: Zhai, Kun, et al.
Published: (2025)
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
by: Saxena, Pranav, et al.
Published: (2025)
by: Saxena, Pranav, et al.
Published: (2025)
SafeVid: Toward Safety Aligned Video Large Multimodal Models
by: Wang, Yixu, et al.
Published: (2025)
by: Wang, Yixu, et al.
Published: (2025)
SegPoint: Segment Any Point Cloud via Large Language Model
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
by: Ding, Henghui, et al.
Published: (2025)
by: Ding, Henghui, et al.
Published: (2025)
Grounding Language in Multi-Perspective Referential Communication
by: Tang, Zineng, et al.
Published: (2024)
by: Tang, Zineng, et al.
Published: (2024)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
by: Qu, Mengxue, et al.
Published: (2024)
by: Qu, Mengxue, et al.
Published: (2024)
BadPatch: Diffusion-Based Generation of Physical Adversarial Patches
by: Wang, Zhixiang, et al.
Published: (2024)
by: Wang, Zhixiang, et al.
Published: (2024)
ROSE: Retrieval-Oriented Segmentation Enhancement
by: Tang, Song, et al.
Published: (2026)
by: Tang, Song, et al.
Published: (2026)
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models
by: Zhang, Jiaming, et al.
Published: (2025)
by: Zhang, Jiaming, et al.
Published: (2025)
SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
by: Shen, Ruiqi, et al.
Published: (2026)
by: Shen, Ruiqi, et al.
Published: (2026)
AIM: Additional Image Guided Generation of Transferable Adversarial Attacks
by: Li, Teng, et al.
Published: (2025)
by: Li, Teng, et al.
Published: (2025)
Adversarial Prompt Tuning for Vision-Language Models
by: Zhang, Jiaming, et al.
Published: (2023)
by: Zhang, Jiaming, et al.
Published: (2023)
GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
by: Ding, Henghui, et al.
Published: (2026)
by: Ding, Henghui, et al.
Published: (2026)
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
by: Yue, Tongtian, et al.
Published: (2024)
by: Yue, Tongtian, et al.
Published: (2024)
Adversarial Prompt Distillation for Vision-Language Models
by: Luo, Lin, et al.
Published: (2024)
by: Luo, Lin, et al.
Published: (2024)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery
by: Wang, Guankun, et al.
Published: (2025)
by: Wang, Guankun, et al.
Published: (2025)
PixelSmile: Toward Fine-Grained Facial Expression Editing
by: Hua, Jiabin, et al.
Published: (2026)
by: Hua, Jiabin, et al.
Published: (2026)
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
by: Zhang, Haochen, et al.
Published: (2025)
by: Zhang, Haochen, et al.
Published: (2025)
IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting
by: Fu, Hao, et al.
Published: (2025)
by: Fu, Hao, et al.
Published: (2025)
Extracting Training Data from Unconditional Diffusion Models
by: Chen, Yunhao, et al.
Published: (2024)
by: Chen, Yunhao, et al.
Published: (2024)
MOVE: Motion-Guided Few-Shot Video Object Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
by: Ding, Henghui, et al.
Published: (2025)
by: Ding, Henghui, et al.
Published: (2025)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
by: Wang, Ruofan, et al.
Published: (2024)
by: Wang, Ruofan, et al.
Published: (2024)
TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
by: Wang, Xin, et al.
Published: (2026)
by: Wang, Xin, et al.
Published: (2026)
T2UE: Generating Unlearnable Examples from Text Descriptions
by: Ma, Xingjun, et al.
Published: (2025)
by: Ma, Xingjun, et al.
Published: (2025)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model
by: Sun, Yiming, et al.
Published: (2024)
by: Sun, Yiming, et al.
Published: (2024)
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
by: Maaz, Muhammad, et al.
Published: (2023)
by: Maaz, Muhammad, et al.
Published: (2023)
PECTP: Parameter-Efficient Cross-Task Prompts for Incremental Vision Transformer
by: Feng, Qian, et al.
Published: (2024)
by: Feng, Qian, et al.
Published: (2024)
Similar Items
-
UnSeg: One Universal Unlearnable Example Generator is Enough against All Image Segmentation
by: Sun, Ye, et al.
Published: (2024) -
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
by: Sun, Ye, et al.
Published: (2026) -
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
by: Mou, Tingshu, et al.
Published: (2026) -
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models
by: Shuai, Xincheng, et al.
Published: (2024) -
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
by: Shuai, Xincheng, et al.
Published: (2025)