Saved in:
| Main Authors: | Zhang, Yunkai, Li, Linda, Cui, Yingxin, Ruan, Xiyuan, Zheng, Zeyu, Chen, Kezhen, Zhang, Yi, Yang, Diji |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.09687 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
by: Yu, Haorui, et al.
Published: (2026)
by: Yu, Haorui, et al.
Published: (2026)
Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models
by: Lu, Jiaying, et al.
Published: (2023)
by: Lu, Jiaying, et al.
Published: (2023)
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
by: Gao, Chongyang, et al.
Published: (2026)
by: Gao, Chongyang, et al.
Published: (2026)
EvoVLA: Self-Evolving Vision-Language-Action Model
by: Liu, Zeting, et al.
Published: (2025)
by: Liu, Zeting, et al.
Published: (2025)
GenIR: Generative Visual Feedback for Mental Image Retrieval
by: Yang, Diji, et al.
Published: (2025)
by: Yang, Diji, et al.
Published: (2025)
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models
by: Thapa, Rahul, et al.
Published: (2024)
by: Thapa, Rahul, et al.
Published: (2024)
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
by: Ye, Angen, et al.
Published: (2025)
by: Ye, Angen, et al.
Published: (2025)
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
by: Bao, Wenxuan, et al.
Published: (2026)
by: Bao, Wenxuan, et al.
Published: (2026)
Dynamic Execution Commitment of Vision-Language-Action Models
by: Chen, Feng, et al.
Published: (2026)
by: Chen, Feng, et al.
Published: (2026)
HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones
by: Ruan, Hao, et al.
Published: (2025)
by: Ruan, Hao, et al.
Published: (2025)
Hybrid Primal Sketch: Combining Analogy, Qualitative Representations, and Computer Vision for Scene Understanding
by: Forbus, Kenneth D., et al.
Published: (2024)
by: Forbus, Kenneth D., et al.
Published: (2024)
Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
by: Wang, Xiyuan, et al.
Published: (2025)
by: Wang, Xiyuan, et al.
Published: (2025)
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
by: Lai, Yingxin, et al.
Published: (2026)
by: Lai, Yingxin, et al.
Published: (2026)
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
by: Qian, Zhaofang, et al.
Published: (2025)
by: Qian, Zhaofang, et al.
Published: (2025)
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
by: Dang, Yunkai, et al.
Published: (2025)
by: Dang, Yunkai, et al.
Published: (2025)
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
by: Yin, Xiangyu, et al.
Published: (2025)
by: Yin, Xiangyu, et al.
Published: (2025)
Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback
by: Yang, Diji, et al.
Published: (2024)
by: Yang, Diji, et al.
Published: (2024)
VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models
by: Ren, Yufan, et al.
Published: (2025)
by: Ren, Yufan, et al.
Published: (2025)
Cascade Prompt Learning for Vision-Language Model Adaptation
by: Wu, Ge, et al.
Published: (2024)
by: Wu, Ge, et al.
Published: (2024)
Image Fusion via Vision-Language Model
by: Zhao, Zixiang, et al.
Published: (2024)
by: Zhao, Zixiang, et al.
Published: (2024)
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
by: Zhou, Yuchen, et al.
Published: (2025)
by: Zhou, Yuchen, et al.
Published: (2025)
MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model
by: Wang, Xinyang, et al.
Published: (2024)
by: Wang, Xinyang, et al.
Published: (2024)
TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models
by: Yin, Xiangyu, et al.
Published: (2025)
by: Yin, Xiangyu, et al.
Published: (2025)
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
by: Li, Zheng, et al.
Published: (2024)
by: Li, Zheng, et al.
Published: (2024)
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
by: Deng, Yunlong, et al.
Published: (2025)
by: Deng, Yunlong, et al.
Published: (2025)
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
by: Huang, Ting, et al.
Published: (2025)
by: Huang, Ting, et al.
Published: (2025)
GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
by: Ma, Guoqing, et al.
Published: (2026)
by: Ma, Guoqing, et al.
Published: (2026)
LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
by: Zhao, Zihui, et al.
Published: (2025)
by: Zhao, Zihui, et al.
Published: (2025)
Conceptual Codebook Learning for Vision-Language Models
by: Zhang, Yi, et al.
Published: (2024)
by: Zhang, Yi, et al.
Published: (2024)
Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models
by: Cao, Sihan, et al.
Published: (2026)
by: Cao, Sihan, et al.
Published: (2026)
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
by: Dang, Yunkai, et al.
Published: (2026)
by: Dang, Yunkai, et al.
Published: (2026)
Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
by: Yang, Yunkai, et al.
Published: (2025)
by: Yang, Yunkai, et al.
Published: (2025)
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
by: Liu, Zheng, et al.
Published: (2025)
by: Liu, Zheng, et al.
Published: (2025)
Concept-Guided Prompt Learning for Generalization in Vision-Language Models
by: Zhang, Yi, et al.
Published: (2024)
by: Zhang, Yi, et al.
Published: (2024)
VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation
by: Zhang, Ruiyang, et al.
Published: (2024)
by: Zhang, Ruiyang, et al.
Published: (2024)
In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive Manufacturing
by: Zheng, Qiaojie, et al.
Published: (2025)
by: Zheng, Qiaojie, et al.
Published: (2025)
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
by: Pan, Yu, et al.
Published: (2026)
by: Pan, Yu, et al.
Published: (2026)
MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models
by: Zhao, Shiji, et al.
Published: (2025)
by: Zhao, Shiji, et al.
Published: (2025)
Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution
by: Wei, Timothy, et al.
Published: (2024)
by: Wei, Timothy, et al.
Published: (2024)
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
by: Ruan, Jiacheng, et al.
Published: (2025)
by: Ruan, Jiacheng, et al.
Published: (2025)
Similar Items
-
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
by: Yu, Haorui, et al.
Published: (2026) -
Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models
by: Lu, Jiaying, et al.
Published: (2023) -
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
by: Gao, Chongyang, et al.
Published: (2026) -
EvoVLA: Self-Evolving Vision-Language-Action Model
by: Liu, Zeting, et al.
Published: (2025) -
GenIR: Generative Visual Feedback for Mental Image Retrieval
by: Yang, Diji, et al.
Published: (2025)