Saved in:
| Main Authors: | Zhang, Yichi, Chen, Gongwei, Zhu, Jun, Wan, Jia, Nie, Liqiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.24372 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
by: Zhang, Renshan, et al.
Published: (2025)
by: Zhang, Renshan, et al.
Published: (2025)
Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
by: Chen, Yanda, et al.
Published: (2025)
by: Chen, Yanda, et al.
Published: (2025)
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024)
by: Shen, Leyang, et al.
Published: (2024)
PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
by: Lyu, Yibo, et al.
Published: (2025)
by: Lyu, Yibo, et al.
Published: (2025)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
by: Zhang, Renshan, et al.
Published: (2024)
by: Zhang, Renshan, et al.
Published: (2024)
DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation
by: Shi, Haoxiang, et al.
Published: (2025)
by: Shi, Haoxiang, et al.
Published: (2025)
PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
by: Lyu, Yibo, et al.
Published: (2026)
by: Lyu, Yibo, et al.
Published: (2026)
Reliable Representation Learning for Incomplete Multi-View Missing Multi-Label Classification
by: Liu, Chengliang, et al.
Published: (2023)
by: Liu, Chengliang, et al.
Published: (2023)
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
by: Li, Zaijing, et al.
Published: (2026)
by: Li, Zaijing, et al.
Published: (2026)
Less is More: Empowering GUI Agent with Context-Aware Simplification
by: Chen, Gongwei, et al.
Published: (2025)
by: Chen, Gongwei, et al.
Published: (2025)
Object-Shot Enhanced Grounding Network for Egocentric Video
by: Feng, Yisen, et al.
Published: (2025)
by: Feng, Yisen, et al.
Published: (2025)
Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling
by: Zou, Lexiao, et al.
Published: (2025)
by: Zou, Lexiao, et al.
Published: (2025)
MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning
by: Wu, Chengfei, et al.
Published: (2025)
by: Wu, Chengfei, et al.
Published: (2025)
Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration
by: Wu, Gang, et al.
Published: (2025)
by: Wu, Gang, et al.
Published: (2025)
A Survey on Video Temporal Grounding with Multimodal Large Language Model
by: Wu, Jianlong, et al.
Published: (2025)
by: Wu, Jianlong, et al.
Published: (2025)
Embodied Crowd Counting
by: Long, Runling, et al.
Published: (2025)
by: Long, Runling, et al.
Published: (2025)
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
by: Li, Zhenyang, et al.
Published: (2024)
by: Li, Zhenyang, et al.
Published: (2024)
PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition
by: Tan, Hao, et al.
Published: (2024)
by: Tan, Hao, et al.
Published: (2024)
Text-promptable Object Counting via Quantity Awareness Enhancement
by: Shi, Miaojing, et al.
Published: (2025)
by: Shi, Miaojing, et al.
Published: (2025)
CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion
by: Li, Yanyu, et al.
Published: (2025)
by: Li, Yanyu, et al.
Published: (2025)
Quantity-Aware Coarse-to-Fine Correspondence for Image-to-Point Cloud Registration
by: Yao, Gongxin, et al.
Published: (2023)
by: Yao, Gongxin, et al.
Published: (2023)
Multimodal Reference Visual Grounding
by: Lu, Yangxiao, et al.
Published: (2025)
by: Lu, Yangxiao, et al.
Published: (2025)
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
by: Shao, Rui, et al.
Published: (2026)
by: Shao, Rui, et al.
Published: (2026)
UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
by: Wen, Haokun, et al.
Published: (2026)
by: Wen, Haokun, et al.
Published: (2026)
Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
by: Pei, Gensheng, et al.
Published: (2024)
by: Pei, Gensheng, et al.
Published: (2024)
Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels
by: Ruan, Haoxian, et al.
Published: (2024)
by: Ruan, Haoxian, et al.
Published: (2024)
Rethinking Model Ensemble in Transfer-based Adversarial Attacks
by: Chen, Huanran, et al.
Published: (2023)
by: Chen, Huanran, et al.
Published: (2023)
EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026
by: Fu, Zhiheng, et al.
Published: (2026)
by: Fu, Zhiheng, et al.
Published: (2026)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
by: Zhang, Yichi, et al.
Published: (2024)
by: Zhang, Yichi, et al.
Published: (2024)
Beyond Referring Expressions: Scenario Comprehension Visual Grounding
by: He, Ruozhen, et al.
Published: (2026)
by: He, Ruozhen, et al.
Published: (2026)
UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries
by: Zhu, Yijie, et al.
Published: (2025)
by: Zhu, Yijie, et al.
Published: (2025)
Scaling Laws for Black box Adversarial Attacks
by: Liu, Chuan, et al.
Published: (2024)
by: Liu, Chuan, et al.
Published: (2024)
Stage-wise Adaptive Label Distribution for Facial Age Estimation
by: Wu, Bo, et al.
Published: (2025)
by: Wu, Bo, et al.
Published: (2025)
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
by: Wang, Yuchen, et al.
Published: (2025)
by: Wang, Yuchen, et al.
Published: (2025)
RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios
by: Zhao, Tianyi, et al.
Published: (2025)
by: Zhao, Tianyi, et al.
Published: (2025)
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding
by: Guo, Hao, et al.
Published: (2025)
by: Guo, Hao, et al.
Published: (2025)
OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers
by: Ren, Li, et al.
Published: (2025)
by: Ren, Li, et al.
Published: (2025)
Beyond Pixel-Wise Supervision for Medical Image Segmentation: From Traditional Models to Foundation Models
by: Shi, Yuyan, et al.
Published: (2024)
by: Shi, Yuyan, et al.
Published: (2024)
Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images
by: Li, Xuchen, et al.
Published: (2026)
by: Li, Xuchen, et al.
Published: (2026)
Similar Items
-
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
by: Zhang, Renshan, et al.
Published: (2025) -
Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
by: Chen, Yanda, et al.
Published: (2025) -
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024) -
PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
by: Lyu, Yibo, et al.
Published: (2025) -
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
by: Zhang, Renshan, et al.
Published: (2024)