Saved in:
| Main Authors: | Park, Junghyun, Nguyen, Tuan Anh, Min, Dugki |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.05333 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025)
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025)
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
by: Kang, Donggoo, et al.
Published: (2024)
by: Kang, Donggoo, et al.
Published: (2024)
GeoWorld-VLM: Geometry from World Models for Vision-Language Models
by: Gu, Renjie, et al.
Published: (2026)
by: Gu, Renjie, et al.
Published: (2026)
VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition
by: Sellam, Abdellah Zakaria, et al.
Published: (2025)
by: Sellam, Abdellah Zakaria, et al.
Published: (2025)
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
by: Saxena, Rohit, et al.
Published: (2026)
by: Saxena, Rohit, et al.
Published: (2026)
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
by: Zhu, William Yicheng, et al.
Published: (2024)
by: Zhu, William Yicheng, et al.
Published: (2024)
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
by: Zhang, Jianke, et al.
Published: (2026)
by: Zhang, Jianke, et al.
Published: (2026)
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding
by: Nguyen-Truong, Hai, et al.
Published: (2024)
by: Nguyen-Truong, Hai, et al.
Published: (2024)
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
by: Collins, Brandon, et al.
Published: (2026)
by: Collins, Brandon, et al.
Published: (2026)
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
by: Zhou, Shijie, et al.
Published: (2025)
by: Zhou, Shijie, et al.
Published: (2025)
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
by: Cao, Tri, et al.
Published: (2026)
by: Cao, Tri, et al.
Published: (2026)
Facial Expression Recognition Using Residual Masking Network
by: Pham, Luan, et al.
Published: (2026)
by: Pham, Luan, et al.
Published: (2026)
Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues
by: Vu, Tuan-Anh, et al.
Published: (2025)
by: Vu, Tuan-Anh, et al.
Published: (2025)
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving
by: Fu, Yongjie, et al.
Published: (2024)
by: Fu, Yongjie, et al.
Published: (2024)
RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
by: Liao, Zijun, et al.
Published: (2025)
by: Liao, Zijun, et al.
Published: (2025)
SpecVLM: Fast Speculative Decoding in Vision-Language Models
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
by: Xi, Gongli, et al.
Published: (2026)
by: Xi, Gongli, et al.
Published: (2026)
A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
by: Le, Anh Duy, et al.
Published: (2026)
by: Le, Anh Duy, et al.
Published: (2026)
FastVLM: Efficient Vision Encoding for Vision Language Models
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024)
TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
by: Wilson, Bibin
Published: (2026)
by: Wilson, Bibin
Published: (2026)
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
by: Liu, Hanqing, et al.
Published: (2026)
by: Liu, Hanqing, et al.
Published: (2026)
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
by: Xu, Wanting, et al.
Published: (2024)
by: Xu, Wanting, et al.
Published: (2024)
RadVLM: A Multitask Conversational Vision-Language Model for Radiology
by: Deperrois, Nicolas, et al.
Published: (2025)
by: Deperrois, Nicolas, et al.
Published: (2025)
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
by: Zhang, Jusheng, et al.
Published: (2025)
by: Zhang, Jusheng, et al.
Published: (2025)
HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection
by: Song, Harris, et al.
Published: (2025)
by: Song, Harris, et al.
Published: (2025)
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge
by: Le, Huy, et al.
Published: (2023)
by: Le, Huy, et al.
Published: (2023)
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)
by: Chen, Pingyi, et al.
Published: (2025)
WalkVLM:Aid Visually Impaired People Walking by Vision Language Model
by: Yuan, Zhiqiang, et al.
Published: (2024)
by: Yuan, Zhiqiang, et al.
Published: (2024)
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
by: Huang, Mingzhe, et al.
Published: (2026)
by: Huang, Mingzhe, et al.
Published: (2026)
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
by: Chu, Xiangxiang, et al.
Published: (2024)
by: Chu, Xiangxiang, et al.
Published: (2024)
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
by: Hyeon-Woo, Nam, et al.
Published: (2024)
by: Hyeon-Woo, Nam, et al.
Published: (2024)
GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding
by: Mathew, Athul M., et al.
Published: (2025)
by: Mathew, Athul M., et al.
Published: (2025)
Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
by: Tang, Yin, et al.
Published: (2025)
by: Tang, Yin, et al.
Published: (2025)
KRAST: Knowledge-Augmented Robotic Action Recognition with Structured Text for Vision-Language Models
by: Nguyen, Son Hai, et al.
Published: (2025)
by: Nguyen, Son Hai, et al.
Published: (2025)
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
by: Lee, Junghyun, et al.
Published: (2026)
by: Lee, Junghyun, et al.
Published: (2026)
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
by: Sun, Fan-Yun, et al.
Published: (2024)
by: Sun, Fan-Yun, et al.
Published: (2024)
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
by: Sarowar, Md Selim, et al.
Published: (2025)
by: Sarowar, Md Selim, et al.
Published: (2025)
Novel 3D Binary Indexed Tree for Volume Computation of 3D Reconstructed Models from Volumetric Data
by: Nguyen-Le, Quoc-Bao, et al.
Published: (2024)
by: Nguyen-Le, Quoc-Bao, et al.
Published: (2024)
Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection
by: Jung, Min Jae, et al.
Published: (2023)
by: Jung, Min Jae, et al.
Published: (2023)
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
by: Nguyen, Thong, et al.
Published: (2025)
by: Nguyen, Thong, et al.
Published: (2025)
Similar Items
-
STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025) -
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
by: Kang, Donggoo, et al.
Published: (2024) -
GeoWorld-VLM: Geometry from World Models for Vision-Language Models
by: Gu, Renjie, et al.
Published: (2026) -
VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition
by: Sellam, Abdellah Zakaria, et al.
Published: (2025) -
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
by: Saxena, Rohit, et al.
Published: (2026)