Saved in:
| Main Authors: | Ravi, Sahithya, Sarch, Gabriel, Vineet, Vibhav, Wilson, Andrew D., Kumaravel, Balasaravanan Thoravi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.24257 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
by: Sarch, Gabriel, et al.
Published: (2025)
by: Sarch, Gabriel, et al.
Published: (2025)
Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind
by: Plizzari, Chiara, et al.
Published: (2024)
by: Plizzari, Chiara, et al.
Published: (2024)
Multi-Object Advertisement Creative Generation
by: Gao, Jialu, et al.
Published: (2026)
by: Gao, Jialu, et al.
Published: (2026)
Doc To The Future: Infomorphs for Interactive, Multimodal Document Transformation and Generation
by: Kumaravel, Balasaravanan Thoravi
Published: (2025)
by: Kumaravel, Balasaravanan Thoravi
Published: (2025)
Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)
by: Chinchure, Aditya, et al.
Published: (2025)
Navigating Hallucinations for Reasoning of Unintentional Activities
by: Grover, Shresth, et al.
Published: (2024)
by: Grover, Shresth, et al.
Published: (2024)
Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)
by: Bagdonaviciute, Ieva, et al.
Published: (2025)
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
by: Ma, Ziqi, et al.
Published: (2026)
by: Ma, Ziqi, et al.
Published: (2026)
SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending
by: Numan, Nels, et al.
Published: (2024)
by: Numan, Nels, et al.
Published: (2024)
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
by: Chen, Kaijin, et al.
Published: (2026)
by: Chen, Kaijin, et al.
Published: (2026)
LookOut: Real-World Humanoid Egocentric Navigation
by: Pan, Boxiao, et al.
Published: (2025)
by: Pan, Boxiao, et al.
Published: (2025)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025)
by: Azad, Shehreen, et al.
Published: (2025)
StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)
by: Azad, Shehreen, et al.
Published: (2026)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)
by: Wang, Jiayu, et al.
Published: (2024)
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
by: Bouzidi, Halima, et al.
Published: (2026)
by: Bouzidi, Halima, et al.
Published: (2026)
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
by: Kumar, Akash, et al.
Published: (2025)
by: Kumar, Akash, et al.
Published: (2025)
OmViD: Omni-supervised active learning for video action detection
by: Rana, Aayush, et al.
Published: (2025)
by: Rana, Aayush, et al.
Published: (2025)
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
by: Chinchure, Aditya, et al.
Published: (2024)
by: Chinchure, Aditya, et al.
Published: (2024)
LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models
by: Duan, Zicheng, et al.
Published: (2026)
by: Duan, Zicheng, et al.
Published: (2026)
BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI
by: Rajaram, Shwetha, et al.
Published: (2024)
by: Rajaram, Shwetha, et al.
Published: (2024)
Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution
by: Xu, Tianshuo, et al.
Published: (2026)
by: Xu, Tianshuo, et al.
Published: (2026)
PEEKABOO: Interactive Video Generation via Masked-Diffusion
by: Jain, Yash, et al.
Published: (2023)
by: Jain, Yash, et al.
Published: (2023)
Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology
by: Sadman, Nafiz, et al.
Published: (2025)
by: Sadman, Nafiz, et al.
Published: (2025)
Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
by: Wang, Boyang, et al.
Published: (2025)
by: Wang, Boyang, et al.
Published: (2025)
On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)
by: Modi, Rajat, et al.
Published: (2024)
Spatial-Conditioned Reasoning in Long-Egocentric Videos
by: Tribble, James, et al.
Published: (2026)
by: Tribble, James, et al.
Published: (2026)
Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025)
by: Sarch, Gabriel, et al.
Published: (2025)
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
by: Grover, Shresth, et al.
Published: (2025)
by: Grover, Shresth, et al.
Published: (2025)
Understanding Depth and Height Perception in Large Visual-Language Models
by: Azad, Shehreen, et al.
Published: (2024)
by: Azad, Shehreen, et al.
Published: (2024)
OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising
by: Zhang, Haichao, et al.
Published: (2024)
by: Zhang, Haichao, et al.
Published: (2024)
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
by: Patel, Alkesh, et al.
Published: (2025)
by: Patel, Alkesh, et al.
Published: (2025)
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
by: Moukheiber, Lama, et al.
Published: (2026)
by: Moukheiber, Lama, et al.
Published: (2026)
SpatialTree: How Spatial Abilities Branch Out in MLLMs
by: Xiao, Yuxi, et al.
Published: (2025)
by: Xiao, Yuxi, et al.
Published: (2025)
SPIKE-RL: Video-LLMs meet Bayesian Surprise
by: Ravi, Sahithya, et al.
Published: (2025)
by: Ravi, Sahithya, et al.
Published: (2025)
Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)
by: Sarch, Gabriel, et al.
Published: (2026)
Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models
by: Huang, Yixuan, et al.
Published: (2023)
by: Huang, Yixuan, et al.
Published: (2023)
Common Inpainted Objects In-N-Out of Context
by: Yang, Tianze, et al.
Published: (2025)
by: Yang, Tianze, et al.
Published: (2025)
Placing Objects in Context via Inpainting for Out-of-distribution Segmentation
by: de Jorge, Pau, et al.
Published: (2024)
by: de Jorge, Pau, et al.
Published: (2024)
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
by: Pan, Zhenyu, et al.
Published: (2025)
by: Pan, Zhenyu, et al.
Published: (2025)
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
by: Bhatia, Mehar, et al.
Published: (2024)
by: Bhatia, Mehar, et al.
Published: (2024)
Similar Items
-
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
by: Sarch, Gabriel, et al.
Published: (2025) -
Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind
by: Plizzari, Chiara, et al.
Published: (2024) -
Multi-Object Advertisement Creative Generation
by: Gao, Jialu, et al.
Published: (2026) -
Doc To The Future: Infomorphs for Interactive, Multimodal Document Transformation and Generation
by: Kumaravel, Balasaravanan Thoravi
Published: (2025) -
Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)