:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sarch, Gabriel, Kumaravel, Balasaravanan Thoravi, Ravi, Sahithya, Vineet, Vibhav, Wilson, Andrew D.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.01578
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
by: Ravi, Sahithya, et al.
Published: (2025)

Doc To The Future: Infomorphs for Interactive, Multimodal Document Transformation and Generation
by: Kumaravel, Balasaravanan Thoravi
Published: (2025)

Multi-Object Advertisement Creative Generation
by: Gao, Jialu, et al.
Published: (2026)

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025)

Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)

Navigating Hallucinations for Reasoning of Unintentional Activities
by: Grover, Shresth, et al.
Published: (2024)

StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
by: Joshi, Siddharth, et al.
Published: (2025)

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
by: Kumar, Akash, et al.
Published: (2025)

OmViD: Omni-supervised active learning for video action detection
by: Rana, Aayush, et al.
Published: (2025)

Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025)

SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending
by: Numan, Nels, et al.
Published: (2024)

BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI
by: Rajaram, Shwetha, et al.
Published: (2024)

OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
by: Wu, Jing, et al.
Published: (2026)

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)

SPIKE-RL: Video-LLMs meet Bayesian Surprise
by: Ravi, Sahithya, et al.
Published: (2025)

PEEKABOO: Interactive Video Generation via Masked-Diffusion
by: Jain, Yash, et al.
Published: (2023)

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
by: Grover, Shresth, et al.
Published: (2025)

Understanding Depth and Height Perception in Large Visual-Language Models
by: Azad, Shehreen, et al.
Published: (2024)

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
by: Bhatia, Mehar, et al.
Published: (2024)

MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
by: Doss, Tamil Sudaravan Mohan, et al.
Published: (2026)

What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection
by: Gothe, Sourabh Vasant, et al.
Published: (2024)

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
by: Chinchure, Aditya, et al.
Published: (2024)

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues
by: Girmaji, Rohit, et al.
Published: (2025)

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
by: Hasani, Hosein, et al.
Published: (2025)

Robustness Analysis on Foundational Segmentation Models
by: Schiappa, Madeline Chantry, et al.
Published: (2023)

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)

Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)

PhyGaP: Physically-Grounded Gaussians with Polarization Cues
by: Wu, Jiale, et al.
Published: (2026)

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation
by: Li, Xiang, et al.
Published: (2025)

From Videos to Conversations: Egocentric Instructions for Task Assistance
by: Aggarwal, Lavisha, et al.
Published: (2026)

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models
by: Sarch, Gabriel, et al.
Published: (2024)

DreamDistribution: Learning Prompt Distribution for Diverse In-distribution Generation
by: Zhao, Brian Nlong, et al.
Published: (2023)

ODIN: A Single Model for 2D and 3D Segmentation
by: Jain, Ayush, et al.
Published: (2024)

Reanimating Images using Neural Representations of Dynamic Stimuli
by: Yeung, Jacob, et al.
Published: (2024)

Advancing Egocentric Video Question Answering with Multimodal Large Language Models
by: Patel, Alkesh, et al.
Published: (2025)

Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
by: Zhang, Yue, et al.
Published: (2025)

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
by: Wake, Naoki, et al.
Published: (2023)

Generalizable Entity Grounding via Assistance of Large Language Model
by: Qi, Lu, et al.
Published: (2024)