Saved in:
| Main Authors: | Kim, Keon, Chelikavada, Krish |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.15376 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AttZoom: Attention Zoom for Better Visual Features
by: DeAlcala, Daniel, et al.
Published: (2025)
by: DeAlcala, Daniel, et al.
Published: (2025)
AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
by: Pei, Siqi, et al.
Published: (2026)
by: Pei, Siqi, et al.
Published: (2026)
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
by: Jiang, Zhiyuan, et al.
Published: (2025)
by: Jiang, Zhiyuan, et al.
Published: (2025)
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
by: Zhou, Yue, et al.
Published: (2026)
by: Zhou, Yue, et al.
Published: (2026)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
by: Dai, Ming, et al.
Published: (2025)
by: Dai, Ming, et al.
Published: (2025)
Training-Free Consistency Pipeline for Fashion Repose
by: Aghilar, Potito, et al.
Published: (2025)
by: Aghilar, Potito, et al.
Published: (2025)
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
by: Tang, Fei, et al.
Published: (2026)
by: Tang, Fei, et al.
Published: (2026)
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
by: Li, Zejun, et al.
Published: (2024)
by: Li, Zejun, et al.
Published: (2024)
Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming
by: Erzurumlu, Yunus Talha, et al.
Published: (2026)
by: Erzurumlu, Yunus Talha, et al.
Published: (2026)
SFUOD: Source-Free Unknown Object Detection
by: Park, Keon-Hee, et al.
Published: (2025)
by: Park, Keon-Hee, et al.
Published: (2025)
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models
by: Thapa, Rahul, et al.
Published: (2024)
by: Thapa, Rahul, et al.
Published: (2024)
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)
by: Zhang, Ruoxuan, et al.
Published: (2025)
Zoom and Shift are All You Need
by: Qin, Jiahao
Published: (2024)
by: Qin, Jiahao
Published: (2024)
STEVE: A Step Verification Pipeline for Computer-use Agent Training
by: Lu, Fanbin, et al.
Published: (2025)
by: Lu, Fanbin, et al.
Published: (2025)
WonderZoom: Multi-Scale 3D World Generation
by: Cao, Jin, et al.
Published: (2025)
by: Cao, Jin, et al.
Published: (2025)
MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery
by: Li, Yansheng, et al.
Published: (2025)
by: Li, Yansheng, et al.
Published: (2025)
Seeing the Unseen: Zooming in the Dark with Event Cameras
by: Kai, Dachun, et al.
Published: (2026)
by: Kai, Dachun, et al.
Published: (2026)
Progressive Language-guided Visual Learning for Multi-Task Visual Grounding
by: Wang, Jingchao, et al.
Published: (2025)
by: Wang, Jingchao, et al.
Published: (2025)
A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage
by: Harris, Levi
Published: (2024)
by: Harris, Levi
Published: (2024)
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering
by: Chen, Yuhan, et al.
Published: (2024)
by: Chen, Yuhan, et al.
Published: (2024)
Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
by: Rahman, Md Ashikur, et al.
Published: (2026)
by: Rahman, Md Ashikur, et al.
Published: (2026)
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
by: Wu, Qianhui, et al.
Published: (2025)
by: Wu, Qianhui, et al.
Published: (2025)
Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories
by: Jain, Chayan, et al.
Published: (2025)
by: Jain, Chayan, et al.
Published: (2025)
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
by: Kim, Bryan Sangwoo, et al.
Published: (2025)
by: Kim, Bryan Sangwoo, et al.
Published: (2025)
AnatomicalNets: A Multi-Structure Segmentation and Contour-Based Distance Estimation Pipeline for Clinically Grounded Lung Cancer T-Staging
by: Chowdhury, Saniah Kayenat, et al.
Published: (2025)
by: Chowdhury, Saniah Kayenat, et al.
Published: (2025)
VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
by: Lim, Byeonggeuk, et al.
Published: (2026)
by: Lim, Byeonggeuk, et al.
Published: (2026)
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
by: Kang, Seil, et al.
Published: (2025)
by: Kang, Seil, et al.
Published: (2025)
Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
by: Li, Chenglin, et al.
Published: (2025)
by: Li, Chenglin, et al.
Published: (2025)
Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement
by: Xu, Jian, et al.
Published: (2025)
by: Xu, Jian, et al.
Published: (2025)
GreenEye: Development of Real-Time Traffic Signal Recognition System for Visual Impairments
by: Kim, Danu
Published: (2024)
by: Kim, Danu
Published: (2024)
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
by: Li, Haocheng, et al.
Published: (2026)
by: Li, Haocheng, et al.
Published: (2026)
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
by: Dai, Ming, et al.
Published: (2024)
by: Dai, Ming, et al.
Published: (2024)
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
by: Wei, Lai, et al.
Published: (2026)
by: Wei, Lai, et al.
Published: (2026)
YOLO-Based Pipeline Monitoring in Challenging Visual Environments
by: Dhungana, Pragya, et al.
Published: (2025)
by: Dhungana, Pragya, et al.
Published: (2025)
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
by: Lv, Guannan, et al.
Published: (2026)
by: Lv, Guannan, et al.
Published: (2026)
PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency
by: Chen, Minbing, et al.
Published: (2026)
by: Chen, Minbing, et al.
Published: (2026)
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
by: Yao, Zhengjian, et al.
Published: (2026)
by: Yao, Zhengjian, et al.
Published: (2026)
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
by: Kim, Keuntae, et al.
Published: (2026)
by: Kim, Keuntae, et al.
Published: (2026)
A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders
by: Wang, Zhongying, et al.
Published: (2026)
by: Wang, Zhongying, et al.
Published: (2026)
One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
by: Sun, Yujing, et al.
Published: (2025)
by: Sun, Yujing, et al.
Published: (2025)
Similar Items
-
AttZoom: Attention Zoom for Better Visual Features
by: DeAlcala, Daniel, et al.
Published: (2025) -
AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
by: Pei, Siqi, et al.
Published: (2026) -
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
by: Jiang, Zhiyuan, et al.
Published: (2025) -
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
by: Zhou, Yue, et al.
Published: (2026) -
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
by: Dai, Ming, et al.
Published: (2025)