:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Routray, Sandeep, Pan, Hengkai, Jain, Unnat, Bahl, Shikhar, Pathak, Deepak
Format:	Preprint
Published:	2025
Subjects:	Robotics Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2511.07732
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
by: Patel, Shivansh, et al.
Published: (2025)

CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
by: Lin, Leo, et al.
Published: (2026)

From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
by: Zhao, Zhida, et al.
Published: (2025)

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)

HRP: Human Affordances for Robotic Pre-Training
by: Srirama, Mohan Kumar, et al.
Published: (2024)

Video Diffusion Alignment via Reward Gradients
by: Prabhudesai, Mihir, et al.
Published: (2024)

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)

CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model
by: Yu, Zhuoyuan, et al.
Published: (2025)

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
by: Alakuijala, Minttu, et al.
Published: (2024)

3D-VLA: A 3D Vision-Language-Action Generative World Model
by: Zhen, Haoyu, et al.
Published: (2024)

Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
by: Sun, Qi, et al.
Published: (2024)

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
by: Lian, Shijie, et al.
Published: (2026)

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
by: Wang, Jun, et al.
Published: (2026)

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models
by: Guo, Dingkun, et al.
Published: (2024)

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills
by: Wang, Yunfei, et al.
Published: (2026)

Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces
by: Kåsene, Vebjørn Haug, et al.
Published: (2025)

Learning from Massive Human Videos for Universal Humanoid Pose Control
by: Mao, Jiageng, et al.
Published: (2024)

REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation
by: Yuan, Puzhen, et al.
Published: (2025)

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
by: Liu, Yibin, et al.
Published: (2026)

Towards Predicting Any Human Trajectory In Context
by: Fujii, Ryo, et al.
Published: (2025)

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
by: Cui, Zichen Jeff, et al.
Published: (2024)

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
by: Schroeder, Philip, et al.
Published: (2025)

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
by: Zhang, Shiduo, et al.
Published: (2024)

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
by: Song, Chan Hee, et al.
Published: (2024)

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
by: Xu, Yunzhe, et al.
Published: (2024)

Generative Image as Action Models
by: Shridhar, Mohit, et al.
Published: (2024)

SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting
by: Park, Sung-Yeon, et al.
Published: (2025)

VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
by: Bartoccioni, Florent, et al.
Published: (2025)

Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Expert's Feedback
by: Raina, Deepak, et al.
Published: (2024)

Vidar: Embodied Video Diffusion Model for Generalist Manipulation
by: Feng, Yao, et al.
Published: (2025)

LangNav: Language as a Perceptual Representation for Navigation
by: Pan, Bowen, et al.
Published: (2023)

Recognizing Actions from Robotic View for Natural Human-Robot Interaction
by: Wang, Ziyi, et al.
Published: (2025)

Aligning Text-to-Image Diffusion Models with Reward Backpropagation
by: Prabhudesai, Mihir, et al.
Published: (2023)

J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
by: Atuhurra, Jesse, et al.
Published: (2025)

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
by: Hou, Yuchen, et al.
Published: (2026)

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
by: Hong, Yining, et al.
Published: (2026)

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
by: Tavella, Federico, et al.
Published: (2025)

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies
by: Tao, Tony, et al.
Published: (2025)

HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
by: Li, Yi, et al.
Published: (2025)