:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ravi, Sahithya, Sarch, Gabriel, Vineet, Vibhav, Wilson, Andrew D., Kumaravel, Balasaravanan Thoravi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.24257
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Grounding Task Assistance with Multimodal Cues from a Single Demonstration
by: Sarch, Gabriel, et al.
Published: (2025)

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind
by: Plizzari, Chiara, et al.
Published: (2024)

Multi-Object Advertisement Creative Generation
by: Gao, Jialu, et al.
Published: (2026)

Doc To The Future: Infomorphs for Interactive, Multimodal Document Transformation and Generation
by: Kumaravel, Balasaravanan Thoravi
Published: (2025)

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)

Navigating Hallucinations for Reasoning of Unintentional Activities
by: Grover, Shresth, et al.
Published: (2024)

Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
by: Ma, Ziqi, et al.
Published: (2026)

SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending
by: Numan, Nels, et al.
Published: (2024)

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
by: Chen, Kaijin, et al.
Published: (2026)

LookOut: Real-World Humanoid Egocentric Navigation
by: Pan, Boxiao, et al.
Published: (2025)

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025)

StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)

Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
by: Bouzidi, Halima, et al.
Published: (2026)

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
by: Kumar, Akash, et al.
Published: (2025)

OmViD: Omni-supervised active learning for video action detection
by: Rana, Aayush, et al.
Published: (2025)

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
by: Chinchure, Aditya, et al.
Published: (2024)

LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models
by: Duan, Zicheng, et al.
Published: (2026)

BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI
by: Rajaram, Shwetha, et al.
Published: (2024)

Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution
by: Xu, Tianshuo, et al.
Published: (2026)

PEEKABOO: Interactive Video Generation via Masked-Diffusion
by: Jain, Yash, et al.
Published: (2023)

Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology
by: Sadman, Nafiz, et al.
Published: (2025)

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
by: Wang, Boyang, et al.
Published: (2025)

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)

Spatial-Conditioned Reasoning in Long-Egocentric Videos
by: Tribble, James, et al.
Published: (2026)

Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025)

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
by: Grover, Shresth, et al.
Published: (2025)

Understanding Depth and Height Perception in Large Visual-Language Models
by: Azad, Shehreen, et al.
Published: (2024)

OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising
by: Zhang, Haichao, et al.
Published: (2024)

Advancing Egocentric Video Question Answering with Multimodal Large Language Models
by: Patel, Alkesh, et al.
Published: (2025)

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
by: Moukheiber, Lama, et al.
Published: (2026)

SpatialTree: How Spatial Abilities Branch Out in MLLMs
by: Xiao, Yuxi, et al.
Published: (2025)

SPIKE-RL: Video-LLMs meet Bayesian Surprise
by: Ravi, Sahithya, et al.
Published: (2025)

Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)

Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models
by: Huang, Yixuan, et al.
Published: (2023)

Common Inpainted Objects In-N-Out of Context
by: Yang, Tianze, et al.
Published: (2025)

Placing Objects in Context via Inpainting for Out-of-distribution Segmentation
by: de Jorge, Pau, et al.
Published: (2024)

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
by: Pan, Zhenyu, et al.
Published: (2025)

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
by: Bhatia, Mehar, et al.
Published: (2024)