Saved in:
| Main Authors: | Zhang, Bo, Li, Shuo, Tian, Runhe, Yang, Yang, Tang, Jixin, Zhou, Jinhao, Ma, Lin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.09498 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AgriGPT-VL: Agricultural Vision-Language Understanding Suite
by: Yang, Bo, et al.
Published: (2025)
by: Yang, Bo, et al.
Published: (2025)
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
by: Tang, Yixin, et al.
Published: (2026)
by: Tang, Yixin, et al.
Published: (2026)
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
by: Cui, Cheng, et al.
Published: (2025)
by: Cui, Cheng, et al.
Published: (2025)
VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer
by: Zhong, Humen, et al.
Published: (2024)
by: Zhong, Humen, et al.
Published: (2024)
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
by: Chen, Jiuhai, et al.
Published: (2024)
by: Chen, Jiuhai, et al.
Published: (2024)
Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism
by: Zhang, Siyu, et al.
Published: (2025)
by: Zhang, Siyu, et al.
Published: (2025)
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
by: Trinh, Quoc-Huy, et al.
Published: (2026)
by: Trinh, Quoc-Huy, et al.
Published: (2026)
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
by: Wang, Peng, et al.
Published: (2024)
by: Wang, Peng, et al.
Published: (2024)
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
by: Dong, Daxiang, et al.
Published: (2025)
by: Dong, Daxiang, et al.
Published: (2025)
DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models
by: Miao, Hongyi, et al.
Published: (2026)
by: Miao, Hongyi, et al.
Published: (2026)
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
by: Gao, Kuofeng, et al.
Published: (2024)
by: Gao, Kuofeng, et al.
Published: (2024)
FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
by: Lu, Xinhua, et al.
Published: (2025)
by: Lu, Xinhua, et al.
Published: (2025)
BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion
by: Tian, Qingyao, et al.
Published: (2026)
by: Tian, Qingyao, et al.
Published: (2026)
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
by: Wei, Zhixiang, et al.
Published: (2026)
by: Wei, Zhixiang, et al.
Published: (2026)
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
by: He, Yefei, et al.
Published: (2024)
by: He, Yefei, et al.
Published: (2024)
A-VL: Adaptive Attention for Large Vision-Language Models
by: Zhang, Junyang, et al.
Published: (2024)
by: Zhang, Junyang, et al.
Published: (2024)
Skeleton Detection Using Dual Radars with Integration of Dual-View CNN Models and mmPose
by: Kodama, Masaharu, et al.
Published: (2024)
by: Kodama, Masaharu, et al.
Published: (2024)
Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception
by: Thushara, Rusiru, et al.
Published: (2026)
by: Thushara, Rusiru, et al.
Published: (2026)
3VL: Using Trees to Improve Vision-Language Models' Interpretability
by: Yellinek, Nir, et al.
Published: (2023)
by: Yellinek, Nir, et al.
Published: (2023)
VL4Gaze: Unleashing Vision-Language Models for Gaze Following
by: Wang, Shijing, et al.
Published: (2025)
by: Wang, Shijing, et al.
Published: (2025)
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
by: Zeng, Lunbin, et al.
Published: (2025)
by: Zeng, Lunbin, et al.
Published: (2025)
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
by: Gao, Mingjian, et al.
Published: (2026)
by: Gao, Mingjian, et al.
Published: (2026)
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
by: Yao, Yuan, et al.
Published: (2026)
by: Yao, Yuan, et al.
Published: (2026)
SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
by: Cheng, Shuang, et al.
Published: (2025)
by: Cheng, Shuang, et al.
Published: (2025)
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
by: Lu, Jinghui, et al.
Published: (2026)
by: Lu, Jinghui, et al.
Published: (2026)
TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
by: Ye, Jinlun, et al.
Published: (2026)
by: Ye, Jinlun, et al.
Published: (2026)
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
by: Tao, Hongyuan, et al.
Published: (2025)
by: Tao, Hongyuan, et al.
Published: (2025)
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
by: Ye, Jiacheng, et al.
Published: (2025)
by: Ye, Jiacheng, et al.
Published: (2025)
EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework
by: Wang, Junjue, et al.
Published: (2026)
by: Wang, Junjue, et al.
Published: (2026)
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
by: Zhou, Yuchen, et al.
Published: (2025)
by: Zhou, Yuchen, et al.
Published: (2025)
Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection
by: Lai, Runhe, et al.
Published: (2025)
by: Lai, Runhe, et al.
Published: (2025)
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
by: Li, Tianbin, et al.
Published: (2024)
by: Li, Tianbin, et al.
Published: (2024)
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
by: Zhu, Xingyu, et al.
Published: (2026)
by: Zhu, Xingyu, et al.
Published: (2026)
FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
by: Li, Jianjian, et al.
Published: (2025)
by: Li, Jianjian, et al.
Published: (2025)
ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
by: Yi, Jingwei, et al.
Published: (2025)
by: Yi, Jingwei, et al.
Published: (2025)
FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
by: Wu, Zekai, et al.
Published: (2026)
by: Wu, Zekai, et al.
Published: (2026)
LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
by: Huang, Jiangyong, et al.
Published: (2025)
by: Huang, Jiangyong, et al.
Published: (2025)
VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation
by: Zhang, Ruiyang, et al.
Published: (2024)
by: Zhang, Ruiyang, et al.
Published: (2024)
VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection
by: Zhong, Liangyu, et al.
Published: (2024)
by: Zhong, Liangyu, et al.
Published: (2024)
Similar Items
-
AgriGPT-VL: Agricultural Vision-Language Understanding Suite
by: Yang, Bo, et al.
Published: (2025) -
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
by: Tang, Yixin, et al.
Published: (2026) -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
by: Cui, Cheng, et al.
Published: (2025) -
VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer
by: Zhong, Humen, et al.
Published: (2024) -
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
by: Chen, Jiuhai, et al.
Published: (2024)