Saved in:
| Main Authors: | Salamatian, Ali, Fuller, Anthony, Sarkar, Pritam, Green, James R., Sigal, Leonid, Shelhamer, Evan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.06809 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
by: Fuller, Anthony, et al.
Published: (2025)
by: Fuller, Anthony, et al.
Published: (2025)
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
by: Fuller, Anthony, et al.
Published: (2025)
by: Fuller, Anthony, et al.
Published: (2025)
LookSharp: Attention Entropy Minimization for Test-Time Adaptation
by: Mali, Yash, et al.
Published: (2025)
by: Mali, Yash, et al.
Published: (2025)
A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
by: Li, Zefeng, et al.
Published: (2026)
by: Li, Zefeng, et al.
Published: (2026)
Self-Distillation of Hidden Layers for Self-Supervised Representation Learning
by: Lowe, Scott C., et al.
Published: (2026)
by: Lowe, Scott C., et al.
Published: (2026)
Galileo: Learning Global & Local Features of Many Remote Sensing Modalities
by: Tseng, Gabriel, et al.
Published: (2025)
by: Tseng, Gabriel, et al.
Published: (2025)
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
by: Fuller, Anthony, et al.
Published: (2024)
by: Fuller, Anthony, et al.
Published: (2024)
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
by: Sarkar, Pritam, et al.
Published: (2025)
by: Sarkar, Pritam, et al.
Published: (2025)
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
by: Sarkar, Pritam, et al.
Published: (2025)
by: Sarkar, Pritam, et al.
Published: (2025)
ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
by: Salamatian, Ali, et al.
Published: (2025)
by: Salamatian, Ali, et al.
Published: (2025)
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
by: Luo, Jiayun, et al.
Published: (2024)
by: Luo, Jiayun, et al.
Published: (2024)
What and When to Look?: Temporal Span Proposal Network for Video Relation Detection
by: Woo, Sangmin, et al.
Published: (2021)
by: Woo, Sangmin, et al.
Published: (2021)
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
by: Chou, Shih-Han, et al.
Published: (2023)
by: Chou, Shih-Han, et al.
Published: (2023)
LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models
by: Shen, Yuxiang, et al.
Published: (2026)
by: Shen, Yuxiang, et al.
Published: (2026)
No One Knows the State of the Art in Geospatial Foundation Models
by: Corley, Isaac, et al.
Published: (2026)
by: Corley, Isaac, et al.
Published: (2026)
When to Think and When to Look: Uncertainty-Guided Lookback
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)
by: Azad, Shehreen, et al.
Published: (2026)
Show Me When and Where: Towards Referring Video Object Segmentation in the Wild
by: Gao, Mingqi, et al.
Published: (2026)
by: Gao, Mingqi, et al.
Published: (2026)
Factorized Video Autoencoders for Efficient Generative Modelling
by: Suhail, Mohammed, et al.
Published: (2024)
by: Suhail, Mohammed, et al.
Published: (2024)
What Happens When: Learning Temporal Orders of Events in Videos
by: Ahn, Daechul, et al.
Published: (2025)
by: Ahn, Daechul, et al.
Published: (2025)
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
by: Rahman, Tanzila, et al.
Published: (2026)
by: Rahman, Tanzila, et al.
Published: (2026)
When and Where do Events Switch in Multi-Event Video Generation?
by: Liao, Ruotong, et al.
Published: (2025)
by: Liao, Ruotong, et al.
Published: (2025)
When Dance Video Archives Challenge Computer Vision
by: Colantoni, Philippe, et al.
Published: (2025)
by: Colantoni, Philippe, et al.
Published: (2025)
Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
by: Bhatt, Gaurav, et al.
Published: (2024)
by: Bhatt, Gaurav, et al.
Published: (2024)
ProtoTTA: Prototype-Guided Test-Time Adaptation
by: Abootorabi, Mohammad Mahdi, et al.
Published: (2026)
by: Abootorabi, Mohammad Mahdi, et al.
Published: (2026)
TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking
by: Goyal, Raghav, et al.
Published: (2023)
by: Goyal, Raghav, et al.
Published: (2023)
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
by: Poletti, Silvia, et al.
Published: (2026)
by: Poletti, Silvia, et al.
Published: (2026)
Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning
by: Yang, Siqi, et al.
Published: (2025)
by: Yang, Siqi, et al.
Published: (2025)
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
by: Chandhok, Shivam, et al.
Published: (2025)
by: Chandhok, Shivam, et al.
Published: (2025)
Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)
by: Chinchure, Aditya, et al.
Published: (2025)
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
by: Mahdizadeh, Ailar, et al.
Published: (2026)
by: Mahdizadeh, Ailar, et al.
Published: (2026)
How Animals Dance (When You're Not Looking)
by: Wang, Xiaojuan, et al.
Published: (2025)
by: Wang, Xiaojuan, et al.
Published: (2025)
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
by: Fang, Pengcheng, et al.
Published: (2025)
by: Fang, Pengcheng, et al.
Published: (2025)
When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models
by: Liao, Haicheng, et al.
Published: (2024)
by: Liao, Haicheng, et al.
Published: (2024)
Self-Soupervision: Cooking Model Soups without Labels
by: Fuller, Anthony, et al.
Published: (2026)
by: Fuller, Anthony, et al.
Published: (2026)
GUI Action Narrator: Where and When Did That Action Take Place?
by: Wu, Qinchen, et al.
Published: (2024)
by: Wu, Qinchen, et al.
Published: (2024)
GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
by: Duan, Yuxiang, et al.
Published: (2025)
by: Duan, Yuxiang, et al.
Published: (2025)
SPIKE-RL: Video-LLMs meet Bayesian Surprise
by: Ravi, Sahithya, et al.
Published: (2025)
by: Ravi, Sahithya, et al.
Published: (2025)
ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains
by: Vray, Guillaume, et al.
Published: (2025)
by: Vray, Guillaume, et al.
Published: (2025)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities
by: Chandhok, Shivam, et al.
Published: (2024)
by: Chandhok, Shivam, et al.
Published: (2024)
Similar Items
-
LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
by: Fuller, Anthony, et al.
Published: (2025) -
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
by: Fuller, Anthony, et al.
Published: (2025) -
LookSharp: Attention Entropy Minimization for Test-Time Adaptation
by: Mali, Yash, et al.
Published: (2025) -
A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
by: Li, Zefeng, et al.
Published: (2026) -
Self-Distillation of Hidden Layers for Self-Supervised Representation Learning
by: Lowe, Scott C., et al.
Published: (2026)