:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kim, Keon, Chelikavada, Krish
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.15376
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AttZoom: Attention Zoom for Better Visual Features
by: DeAlcala, Daniel, et al.
Published: (2025)

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
by: Pei, Siqi, et al.
Published: (2026)

Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
by: Jiang, Zhiyuan, et al.
Published: (2025)

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
by: Zhou, Yue, et al.
Published: (2026)

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
by: Dai, Ming, et al.
Published: (2025)

Training-Free Consistency Pipeline for Fashion Repose
by: Aghilar, Potito, et al.
Published: (2025)

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
by: Tang, Fei, et al.
Published: (2026)

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
by: Li, Zejun, et al.
Published: (2024)

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming
by: Erzurumlu, Yunus Talha, et al.
Published: (2026)

SFUOD: Source-Free Unknown Object Detection
by: Park, Keon-Hee, et al.
Published: (2025)

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models
by: Thapa, Rahul, et al.
Published: (2024)

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)

Zoom and Shift are All You Need
by: Qin, Jiahao
Published: (2024)

STEVE: A Step Verification Pipeline for Computer-use Agent Training
by: Lu, Fanbin, et al.
Published: (2025)

WonderZoom: Multi-Scale 3D World Generation
by: Cao, Jin, et al.
Published: (2025)

MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery
by: Li, Yansheng, et al.
Published: (2025)

Seeing the Unseen: Zooming in the Dark with Event Cameras
by: Kai, Dachun, et al.
Published: (2026)

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding
by: Wang, Jingchao, et al.
Published: (2025)

A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage
by: Harris, Levi
Published: (2024)

LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering
by: Chen, Yuhan, et al.
Published: (2024)

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
by: Rahman, Md Ashikur, et al.
Published: (2026)

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
by: Wu, Qianhui, et al.
Published: (2025)

Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories
by: Jain, Chayan, et al.
Published: (2025)

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
by: Kim, Bryan Sangwoo, et al.
Published: (2025)

AnatomicalNets: A Multi-Structure Segmentation and Contour-Based Distance Estimation Pipeline for Clinically Grounded Lung Cancer T-Staging
by: Chowdhury, Saniah Kayenat, et al.
Published: (2025)

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
by: Lim, Byeonggeuk, et al.
Published: (2026)

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
by: Kang, Seil, et al.
Published: (2025)

Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
by: Li, Chenglin, et al.
Published: (2025)

Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement
by: Xu, Jian, et al.
Published: (2025)

GreenEye: Development of Real-Time Traffic Signal Recognition System for Visual Impairments
by: Kim, Danu
Published: (2024)

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
by: Li, Haocheng, et al.
Published: (2026)

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
by: Dai, Ming, et al.
Published: (2024)

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
by: Wei, Lai, et al.
Published: (2026)

YOLO-Based Pipeline Monitoring in Challenging Visual Environments
by: Dhungana, Pragya, et al.
Published: (2025)

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
by: Lv, Guannan, et al.
Published: (2026)

PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency
by: Chen, Minbing, et al.
Published: (2026)

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
by: Yao, Zhengjian, et al.
Published: (2026)

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
by: Kim, Keuntae, et al.
Published: (2026)

A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders
by: Wang, Zhongying, et al.
Published: (2026)

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
by: Sun, Yujing, et al.
Published: (2025)