Saved in:
| Main Authors: | Sarch, Gabriel, Kumaravel, Balasaravanan Thoravi, Ravi, Sahithya, Vineet, Vibhav, Wilson, Andrew D. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.01578 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
by: Ravi, Sahithya, et al.
Published: (2025)
by: Ravi, Sahithya, et al.
Published: (2025)
Doc To The Future: Infomorphs for Interactive, Multimodal Document Transformation and Generation
by: Kumaravel, Balasaravanan Thoravi
Published: (2025)
by: Kumaravel, Balasaravanan Thoravi
Published: (2025)
Multi-Object Advertisement Creative Generation
by: Gao, Jialu, et al.
Published: (2026)
by: Gao, Jialu, et al.
Published: (2026)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025)
by: Azad, Shehreen, et al.
Published: (2025)
Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)
by: Bagdonaviciute, Ieva, et al.
Published: (2025)
Navigating Hallucinations for Reasoning of Unintentional Activities
by: Grover, Shresth, et al.
Published: (2024)
by: Grover, Shresth, et al.
Published: (2024)
StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)
by: Azad, Shehreen, et al.
Published: (2026)
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
by: Joshi, Siddharth, et al.
Published: (2025)
by: Joshi, Siddharth, et al.
Published: (2025)
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
by: Kumar, Akash, et al.
Published: (2025)
by: Kumar, Akash, et al.
Published: (2025)
OmViD: Omni-supervised active learning for video action detection
by: Rana, Aayush, et al.
Published: (2025)
by: Rana, Aayush, et al.
Published: (2025)
Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025)
by: Sarch, Gabriel, et al.
Published: (2025)
SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending
by: Numan, Nels, et al.
Published: (2024)
by: Numan, Nels, et al.
Published: (2024)
BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI
by: Rajaram, Shwetha, et al.
Published: (2024)
by: Rajaram, Shwetha, et al.
Published: (2024)
OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
by: Wu, Jing, et al.
Published: (2026)
by: Wu, Jing, et al.
Published: (2026)
Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
by: Chinchure, Aditya, et al.
Published: (2025)
by: Chinchure, Aditya, et al.
Published: (2025)
On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)
by: Modi, Rajat, et al.
Published: (2024)
SPIKE-RL: Video-LLMs meet Bayesian Surprise
by: Ravi, Sahithya, et al.
Published: (2025)
by: Ravi, Sahithya, et al.
Published: (2025)
PEEKABOO: Interactive Video Generation via Masked-Diffusion
by: Jain, Yash, et al.
Published: (2023)
by: Jain, Yash, et al.
Published: (2023)
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
by: Grover, Shresth, et al.
Published: (2025)
by: Grover, Shresth, et al.
Published: (2025)
Understanding Depth and Height Perception in Large Visual-Language Models
by: Azad, Shehreen, et al.
Published: (2024)
by: Azad, Shehreen, et al.
Published: (2024)
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
by: Bhatia, Mehar, et al.
Published: (2024)
by: Bhatia, Mehar, et al.
Published: (2024)
MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
by: Doss, Tamil Sudaravan Mohan, et al.
Published: (2026)
by: Doss, Tamil Sudaravan Mohan, et al.
Published: (2026)
What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection
by: Gothe, Sourabh Vasant, et al.
Published: (2024)
by: Gothe, Sourabh Vasant, et al.
Published: (2024)
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
by: Chinchure, Aditya, et al.
Published: (2024)
by: Chinchure, Aditya, et al.
Published: (2024)
Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues
by: Girmaji, Rohit, et al.
Published: (2025)
by: Girmaji, Rohit, et al.
Published: (2025)
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
by: Hasani, Hosein, et al.
Published: (2025)
by: Hasani, Hosein, et al.
Published: (2025)
Robustness Analysis on Foundational Segmentation Models
by: Schiappa, Madeline Chantry, et al.
Published: (2023)
by: Schiappa, Madeline Chantry, et al.
Published: (2023)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)
by: Wang, Jiayu, et al.
Published: (2024)
Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)
by: Sarch, Gabriel, et al.
Published: (2026)
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
by: Wu, Jiale, et al.
Published: (2026)
by: Wu, Jiale, et al.
Published: (2026)
Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation
by: Li, Xiang, et al.
Published: (2025)
by: Li, Xiang, et al.
Published: (2025)
From Videos to Conversations: Egocentric Instructions for Task Assistance
by: Aggarwal, Lavisha, et al.
Published: (2026)
by: Aggarwal, Lavisha, et al.
Published: (2026)
HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models
by: Sarch, Gabriel, et al.
Published: (2024)
by: Sarch, Gabriel, et al.
Published: (2024)
DreamDistribution: Learning Prompt Distribution for Diverse In-distribution Generation
by: Zhao, Brian Nlong, et al.
Published: (2023)
by: Zhao, Brian Nlong, et al.
Published: (2023)
ODIN: A Single Model for 2D and 3D Segmentation
by: Jain, Ayush, et al.
Published: (2024)
by: Jain, Ayush, et al.
Published: (2024)
Reanimating Images using Neural Representations of Dynamic Stimuli
by: Yeung, Jacob, et al.
Published: (2024)
by: Yeung, Jacob, et al.
Published: (2024)
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
by: Patel, Alkesh, et al.
Published: (2025)
by: Patel, Alkesh, et al.
Published: (2025)
Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
by: Wake, Naoki, et al.
Published: (2023)
by: Wake, Naoki, et al.
Published: (2023)
Generalizable Entity Grounding via Assistance of Large Language Model
by: Qi, Lu, et al.
Published: (2024)
by: Qi, Lu, et al.
Published: (2024)
Similar Items
-
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
by: Ravi, Sahithya, et al.
Published: (2025) -
Doc To The Future: Infomorphs for Interactive, Multimodal Document Transformation and Generation
by: Kumaravel, Balasaravanan Thoravi
Published: (2025) -
Multi-Object Advertisement Creative Generation
by: Gao, Jialu, et al.
Published: (2026) -
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025) -
Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)