Saved in:
| Main Authors: | Mishra, Abhijit, Li, Mingda, Fu, Hsiang, Noh, Richard, Kim, Minji |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.14780 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
by: Abaskohi, Amirhossein, et al.
Published: (2026)
by: Abaskohi, Amirhossein, et al.
Published: (2026)
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
by: Xu, Zhiyang, et al.
Published: (2024)
by: Xu, Zhiyang, et al.
Published: (2024)
Improved Baselines with Visual Instruction Tuning
by: Liu, Haotian, et al.
Published: (2023)
by: Liu, Haotian, et al.
Published: (2023)
ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025)
by: Liu, Qihao, et al.
Published: (2025)
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
by: Wan, Zifu, et al.
Published: (2025)
by: Wan, Zifu, et al.
Published: (2025)
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
by: Fan, Zhiwen, et al.
Published: (2025)
by: Fan, Zhiwen, et al.
Published: (2025)
PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis
by: Lokesh, K, et al.
Published: (2026)
by: Lokesh, K, et al.
Published: (2026)
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
by: Kim, Donghoon, et al.
Published: (2025)
by: Kim, Donghoon, et al.
Published: (2025)
GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder
by: Cho, Seunghyuk, et al.
Published: (2025)
by: Cho, Seunghyuk, et al.
Published: (2025)
SentinelLMs: Encrypted Input Adaptation and Fine-tuning of Language Models for Private and Secure Inference
by: Mishra, Abhijit, et al.
Published: (2023)
by: Mishra, Abhijit, et al.
Published: (2023)
Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer
by: Li, Mingda, et al.
Published: (2024)
by: Li, Mingda, et al.
Published: (2024)
Re:Verse -- Can Your VLM Read a Manga?
by: Baranwal, Aaditya, et al.
Published: (2025)
by: Baranwal, Aaditya, et al.
Published: (2025)
SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
by: Liu, Zheng, et al.
Published: (2024)
by: Liu, Zheng, et al.
Published: (2024)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
by: Wei, Ziming, et al.
Published: (2025)
by: Wei, Ziming, et al.
Published: (2025)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
by: Jiang, Ziyan, et al.
Published: (2024)
by: Jiang, Ziyan, et al.
Published: (2024)
Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline
by: Jia, Qi, et al.
Published: (2024)
by: Jia, Qi, et al.
Published: (2024)
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
by: Tian, Yu, et al.
Published: (2024)
by: Tian, Yu, et al.
Published: (2024)
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
by: Park, ChaeHun, et al.
Published: (2024)
by: Park, ChaeHun, et al.
Published: (2024)
Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models
by: Liu, Zikang, et al.
Published: (2025)
by: Liu, Zikang, et al.
Published: (2025)
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
by: Du, Yifan, et al.
Published: (2023)
by: Du, Yifan, et al.
Published: (2023)
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
by: Choi, Juhwan, et al.
Published: (2024)
by: Choi, Juhwan, et al.
Published: (2024)
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
by: Meng, Rui, et al.
Published: (2025)
by: Meng, Rui, et al.
Published: (2025)
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
by: Li, Bin, et al.
Published: (2025)
by: Li, Bin, et al.
Published: (2025)
OViP: Online Vision-Language Preference Learning for VLM Hallucination
by: Liu, Shujun, et al.
Published: (2025)
by: Liu, Shujun, et al.
Published: (2025)
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
by: Tanaka, Ryota, et al.
Published: (2024)
by: Tanaka, Ryota, et al.
Published: (2024)
PersonaVLM: Long-Term Personalized Multimodal LLMs
by: Nie, Chang, et al.
Published: (2026)
by: Nie, Chang, et al.
Published: (2026)
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
by: Liang, Jiafeng, et al.
Published: (2024)
by: Liang, Jiafeng, et al.
Published: (2024)
A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation
by: Zhao, Yi, et al.
Published: (2026)
by: Zhao, Yi, et al.
Published: (2026)
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
by: Atuhurra, Jesse, et al.
Published: (2024)
by: Atuhurra, Jesse, et al.
Published: (2024)
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
by: Sheta, Hala, et al.
Published: (2025)
by: Sheta, Hala, et al.
Published: (2025)
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning
by: Hu, Zhe, et al.
Published: (2025)
by: Hu, Zhe, et al.
Published: (2025)
LLaVA-OneVision: Easy Visual Task Transfer
by: Li, Bo, et al.
Published: (2024)
by: Li, Bo, et al.
Published: (2024)
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks
by: Wang, Yanan, et al.
Published: (2025)
by: Wang, Yanan, et al.
Published: (2025)
Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions
by: Jang, Jihyoung, et al.
Published: (2025)
by: Jang, Jihyoung, et al.
Published: (2025)
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
by: Fu, Xingyu, et al.
Published: (2025)
by: Fu, Xingyu, et al.
Published: (2025)
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
by: Lim, Gyubeum, et al.
Published: (2025)
by: Lim, Gyubeum, et al.
Published: (2025)
Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning
by: Singh, Ayush, et al.
Published: (2024)
by: Singh, Ayush, et al.
Published: (2024)
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
by: Zhang, Di, et al.
Published: (2024)
by: Zhang, Di, et al.
Published: (2024)
Annotation-Free Reinforcement Learning Query Rewriting via Verifiable Search Reward
by: Cha, Sungguk, et al.
Published: (2025)
by: Cha, Sungguk, et al.
Published: (2025)
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
by: Li, Junxian, et al.
Published: (2025)
by: Li, Junxian, et al.
Published: (2025)
Similar Items
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
by: Abaskohi, Amirhossein, et al.
Published: (2026) -
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
by: Xu, Zhiyang, et al.
Published: (2024) -
Improved Baselines with Visual Instruction Tuning
by: Liu, Haotian, et al.
Published: (2023) -
ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025) -
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
by: Wan, Zifu, et al.
Published: (2025)