:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Salamatian, Ali, Fuller, Anthony, Sarkar, Pritam, Green, James R., Sigal, Leonid, Shelhamer, Evan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2605.06809
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
by: Fuller, Anthony, et al.
Published: (2025)

Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
by: Fuller, Anthony, et al.
Published: (2025)

LookSharp: Attention Entropy Minimization for Test-Time Adaptation
by: Mali, Yash, et al.
Published: (2025)

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
by: Li, Zefeng, et al.
Published: (2026)

Self-Distillation of Hidden Layers for Self-Supervised Representation Learning
by: Lowe, Scott C., et al.
Published: (2026)

Galileo: Learning Global & Local Features of Many Remote Sensing Modalities
by: Tseng, Gabriel, et al.
Published: (2025)

LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
by: Fuller, Anthony, et al.
Published: (2024)

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
by: Sarkar, Pritam, et al.
Published: (2025)

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
by: Sarkar, Pritam, et al.
Published: (2025)

ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
by: Salamatian, Ali, et al.
Published: (2025)

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
by: Luo, Jiayun, et al.
Published: (2024)

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection
by: Woo, Sangmin, et al.
Published: (2021)

Implicit and Explicit Commonsense for Multi-sentence Video Captioning
by: Chou, Shih-Han, et al.
Published: (2023)

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models
by: Shen, Yuxiang, et al.
Published: (2026)

No One Knows the State of the Art in Geospatial Foundation Models
by: Corley, Isaac, et al.
Published: (2026)

When to Think and When to Look: Uncertainty-Guided Lookback
by: Bi, Jing, et al.
Published: (2025)

StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)

Show Me When and Where: Towards Referring Video Object Segmentation in the Wild
by: Gao, Mingqi, et al.
Published: (2026)

Factorized Video Autoencoders for Efficient Generative Modelling
by: Suhail, Mohammed, et al.
Published: (2024)

What Happens When: Learning Temporal Orders of Events in Videos
by: Ahn, Daechul, et al.
Published: (2025)

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
by: Rahman, Tanzila, et al.
Published: (2026)

When and Where do Events Switch in Multi-Event Video Generation?
by: Liao, Ruotong, et al.
Published: (2025)

When Dance Video Archives Challenge Computer Vision
by: Colantoni, Philippe, et al.
Published: (2025)

Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
by: Bhatt, Gaurav, et al.
Published: (2024)

ProtoTTA: Prototype-Guided Test-Time Adaptation
by: Abootorabi, Mohammad Mahdi, et al.
Published: (2026)

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking
by: Goyal, Raghav, et al.
Published: (2023)

AI-Generated Images: What Humans and Machines See When They Look at the Same Image
by: Poletti, Silvia, et al.
Published: (2026)

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning
by: Yang, Siqi, et al.
Published: (2025)

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
by: Chandhok, Shivam, et al.
Published: (2025)

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
by: Mahdizadeh, Ailar, et al.
Published: (2026)

How Animals Dance (When You're Not Looking)
by: Wang, Xiaojuan, et al.
Published: (2025)

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
by: Fang, Pengcheng, et al.
Published: (2025)

When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models
by: Liao, Haicheng, et al.
Published: (2024)

Self-Soupervision: Cooking Model Soups without Labels
by: Fuller, Anthony, et al.
Published: (2026)

GUI Action Narrator: Where and When Did That Action Take Place?
by: Wu, Qinchen, et al.
Published: (2024)

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
by: Duan, Yuxiang, et al.
Published: (2025)

SPIKE-RL: Video-LLMs meet Bayesian Surprise
by: Ravi, Sahithya, et al.
Published: (2025)

ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains
by: Vray, Guillaume, et al.
Published: (2025)

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities
by: Chandhok, Shivam, et al.
Published: (2024)