:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yichi, Chen, Gongwei, Zhu, Jun, Wan, Jia, Nie, Liqiang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.24372
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
by: Zhang, Renshan, et al.
Published: (2025)

Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
by: Chen, Yanda, et al.
Published: (2025)

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024)

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
by: Lyu, Yibo, et al.
Published: (2025)

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
by: Zhang, Renshan, et al.
Published: (2024)

DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation
by: Shi, Haoxiang, et al.
Published: (2025)

PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
by: Lyu, Yibo, et al.
Published: (2026)

Reliable Representation Learning for Incomplete Multi-View Missing Multi-Label Classification
by: Liu, Chengliang, et al.
Published: (2023)

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
by: Li, Zaijing, et al.
Published: (2026)

Less is More: Empowering GUI Agent with Context-Aware Simplification
by: Chen, Gongwei, et al.
Published: (2025)

Object-Shot Enhanced Grounding Network for Egocentric Video
by: Feng, Yisen, et al.
Published: (2025)

Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling
by: Zou, Lexiao, et al.
Published: (2025)

MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning
by: Wu, Chengfei, et al.
Published: (2025)

Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration
by: Wu, Gang, et al.
Published: (2025)

A Survey on Video Temporal Grounding with Multimodal Large Language Model
by: Wu, Jianlong, et al.
Published: (2025)

Embodied Crowd Counting
by: Long, Runling, et al.
Published: (2025)

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
by: Li, Zhenyang, et al.
Published: (2024)

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition
by: Tan, Hao, et al.
Published: (2024)

Text-promptable Object Counting via Quantity Awareness Enhancement
by: Shi, Miaojing, et al.
Published: (2025)

CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion
by: Li, Yanyu, et al.
Published: (2025)

Quantity-Aware Coarse-to-Fine Correspondence for Image-to-Point Cloud Registration
by: Yao, Gongxin, et al.
Published: (2023)

Multimodal Reference Visual Grounding
by: Lu, Yangxiao, et al.
Published: (2025)

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
by: Shao, Rui, et al.
Published: (2026)

UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
by: Wen, Haokun, et al.
Published: (2026)

Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
by: Pei, Gensheng, et al.
Published: (2024)

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels
by: Ruan, Haoxian, et al.
Published: (2024)

Rethinking Model Ensemble in Transfer-based Adversarial Attacks
by: Chen, Huanran, et al.
Published: (2023)

EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026
by: Fu, Zhiheng, et al.
Published: (2026)

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
by: Zhang, Yichi, et al.
Published: (2024)

Beyond Referring Expressions: Scenario Comprehension Visual Grounding
by: He, Ruozhen, et al.
Published: (2026)

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries
by: Zhu, Yijie, et al.
Published: (2025)

Scaling Laws for Black box Adversarial Attacks
by: Liu, Chuan, et al.
Published: (2024)

Stage-wise Adaptive Label Distribution for Facial Age Estimation
by: Wu, Bo, et al.
Published: (2025)

Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
by: Wang, Yuchen, et al.
Published: (2025)

RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios
by: Zhao, Tianyi, et al.
Published: (2025)

Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding
by: Guo, Hao, et al.
Published: (2025)

OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs
by: Zhang, Yichi, et al.
Published: (2025)

DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers
by: Ren, Li, et al.
Published: (2025)

Beyond Pixel-Wise Supervision for Medical Image Segmentation: From Traditional Models to Foundation Models
by: Shi, Yuyan, et al.
Published: (2024)

Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images
by: Li, Xuchen, et al.
Published: (2026)