:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Brock, James, Zhang, Ce, Anantrasirichai, Nantheera
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2601.04497
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)

Zero-TIG: Temporal Consistency-Aware Zero-Shot Illumination-Guided Low-light Video Enhancement
by: Li, Yini, et al.
Published: (2025)

An InSAR Phase Unwrapping Framework for Large-scale and Complex Events
by: Song, Yijia, et al.
Published: (2026)

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
by: Ma, Tianyi, et al.
Published: (2025)

Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns
by: Kim, Yunsoo, et al.
Published: (2024)

Cost-effective Instruction Learning for Pathology Vision and Language Analysis
by: Chen, Kaitao, et al.
Published: (2024)

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
by: Schumann, Raphael, et al.
Published: (2023)

A Comprehensive Study of Object Tracking in Low-Light Environments
by: Yi, Anqi, et al.
Published: (2023)

RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator
by: Liu, Zhiming, et al.
Published: (2025)

DeTurb: Atmospheric Turbulence Mitigation with Deformable 3D Convolutions and 3D Swin Transformers
by: Zou, Zhicheng, et al.
Published: (2024)

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?
by: Perincherry, Akhil, et al.
Published: (2025)

Revisiting the Role of Language Priors in Vision-Language Models
by: Lin, Zhiqiu, et al.
Published: (2023)

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation
by: Liang, Xiwen, et al.
Published: (2023)

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models
by: Sarch, Gabriel, et al.
Published: (2024)

Cross-Modal Adapter for Vision-Language Retrieval
by: Jiang, Haojun, et al.
Published: (2022)

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)

Multi-Object Hallucination in Vision-Language Models
by: Chen, Xuweiyi, et al.
Published: (2024)

Enhancing Vision Models for Text-Heavy Content Understanding and Interaction
by: TG, Adithya, et al.
Published: (2024)

Data Metabolism: An Efficient Data Design Schema For Vision Language Model
by: Zhang, Jingyuan, et al.
Published: (2025)

Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
by: Lu, Yujie, et al.
Published: (2024)

Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)

Superpixel Semantics Representation and Pre-training for Vision-Language Task
by: Zhang, Siyu, et al.
Published: (2023)

Panoptic Vision-Language Feature Fields
by: Chen, Haoran, et al.
Published: (2023)

Survey on Vision-Language-Action Models
by: Adilkhanov, Adilzhan, et al.
Published: (2025)

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning
by: Gu, Zishan, et al.
Published: (2024)

VLCD: Vision-Language Contrastive Distillation for Accurate and Efficient Automatic Placenta Analysis
by: Mehta, Manas, et al.
Published: (2025)

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
by: Du, Yiyang, et al.
Published: (2026)

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
by: Li, Yanwei, et al.
Published: (2024)

General Scene Adaptation for Vision-and-Language Navigation
by: Hong, Haodong, et al.
Published: (2025)

Benchmarking Vision Language Models for Cultural Understanding
by: Nayak, Shravan, et al.
Published: (2024)

VLind-Bench: Measuring Language Priors in Large Vision-Language Models
by: Lee, Kang-il, et al.
Published: (2024)

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
by: Li, Hongxing, et al.
Published: (2025)

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
by: Cahyawijaya, Samuel, et al.
Published: (2026)

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)

PyVision: Agentic Vision with Dynamic Tooling
by: Zhao, Shitian, et al.
Published: (2025)