Saved in:
| Main Authors: | Routray, Sandeep, Pan, Hengkai, Jain, Unnat, Bahl, Shikhar, Pathak, Deepak |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.07732 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
by: Patel, Shivansh, et al.
Published: (2025)
by: Patel, Shivansh, et al.
Published: (2025)
CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
by: Lin, Leo, et al.
Published: (2026)
by: Lin, Leo, et al.
Published: (2026)
From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
by: Zhao, Zhida, et al.
Published: (2025)
by: Zhao, Zhida, et al.
Published: (2025)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)
by: Li, Qixiu, et al.
Published: (2024)
HRP: Human Affordances for Robotic Pre-Training
by: Srirama, Mohan Kumar, et al.
Published: (2024)
by: Srirama, Mohan Kumar, et al.
Published: (2024)
Video Diffusion Alignment via Reward Gradients
by: Prabhudesai, Mihir, et al.
Published: (2024)
by: Prabhudesai, Mihir, et al.
Published: (2024)
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model
by: Yu, Zhuoyuan, et al.
Published: (2025)
by: Yu, Zhuoyuan, et al.
Published: (2025)
Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
by: Alakuijala, Minttu, et al.
Published: (2024)
by: Alakuijala, Minttu, et al.
Published: (2024)
3D-VLA: A 3D Vision-Language-Action Generative World Model
by: Zhen, Haoyu, et al.
Published: (2024)
by: Zhen, Haoyu, et al.
Published: (2024)
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
by: Sun, Qi, et al.
Published: (2024)
by: Sun, Qi, et al.
Published: (2024)
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
by: Lian, Shijie, et al.
Published: (2026)
by: Lian, Shijie, et al.
Published: (2026)
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)
by: Chen, Yi, et al.
Published: (2024)
Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models
by: Guo, Dingkun, et al.
Published: (2024)
by: Guo, Dingkun, et al.
Published: (2024)
When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills
by: Wang, Yunfei, et al.
Published: (2026)
by: Wang, Yunfei, et al.
Published: (2026)
Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces
by: Kåsene, Vebjørn Haug, et al.
Published: (2025)
by: Kåsene, Vebjørn Haug, et al.
Published: (2025)
Learning from Massive Human Videos for Universal Humanoid Pose Control
by: Mao, Jiageng, et al.
Published: (2024)
by: Mao, Jiageng, et al.
Published: (2024)
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation
by: Yuan, Puzhen, et al.
Published: (2025)
by: Yuan, Puzhen, et al.
Published: (2025)
From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
by: Liu, Yibin, et al.
Published: (2026)
by: Liu, Yibin, et al.
Published: (2026)
Towards Predicting Any Human Trajectory In Context
by: Fujii, Ryo, et al.
Published: (2025)
by: Fujii, Ryo, et al.
Published: (2025)
DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
by: Cui, Zichen Jeff, et al.
Published: (2024)
by: Cui, Zichen Jeff, et al.
Published: (2024)
ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
by: Schroeder, Philip, et al.
Published: (2025)
by: Schroeder, Philip, et al.
Published: (2025)
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
by: Zhang, Shiduo, et al.
Published: (2024)
by: Zhang, Shiduo, et al.
Published: (2024)
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
by: Song, Chan Hee, et al.
Published: (2024)
by: Song, Chan Hee, et al.
Published: (2024)
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
by: Xu, Yunzhe, et al.
Published: (2024)
by: Xu, Yunzhe, et al.
Published: (2024)
Generative Image as Action Models
by: Shridhar, Mohit, et al.
Published: (2024)
by: Shridhar, Mohit, et al.
Published: (2024)
SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting
by: Park, Sung-Yeon, et al.
Published: (2025)
by: Park, Sung-Yeon, et al.
Published: (2025)
VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
by: Bartoccioni, Florent, et al.
Published: (2025)
by: Bartoccioni, Florent, et al.
Published: (2025)
Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Expert's Feedback
by: Raina, Deepak, et al.
Published: (2024)
by: Raina, Deepak, et al.
Published: (2024)
Vidar: Embodied Video Diffusion Model for Generalist Manipulation
by: Feng, Yao, et al.
Published: (2025)
by: Feng, Yao, et al.
Published: (2025)
LangNav: Language as a Perceptual Representation for Navigation
by: Pan, Bowen, et al.
Published: (2023)
by: Pan, Bowen, et al.
Published: (2023)
Recognizing Actions from Robotic View for Natural Human-Robot Interaction
by: Wang, Ziyi, et al.
Published: (2025)
by: Wang, Ziyi, et al.
Published: (2025)
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
by: Prabhudesai, Mihir, et al.
Published: (2023)
by: Prabhudesai, Mihir, et al.
Published: (2023)
J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
by: Atuhurra, Jesse, et al.
Published: (2025)
by: Atuhurra, Jesse, et al.
Published: (2025)
LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
by: Hou, Yuchen, et al.
Published: (2026)
by: Hou, Yuchen, et al.
Published: (2026)
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
by: Hong, Yining, et al.
Published: (2026)
by: Hong, Yining, et al.
Published: (2026)
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
by: Tavella, Federico, et al.
Published: (2025)
by: Tavella, Federico, et al.
Published: (2025)
DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies
by: Tao, Tony, et al.
Published: (2025)
by: Tao, Tony, et al.
Published: (2025)
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
by: Li, Yi, et al.
Published: (2025)
by: Li, Yi, et al.
Published: (2025)
Similar Items
-
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
by: Patel, Shivansh, et al.
Published: (2025) -
CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
by: Lin, Leo, et al.
Published: (2026) -
From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
by: Zhao, Zhida, et al.
Published: (2025) -
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024) -
HRP: Human Affordances for Robotic Pre-Training
by: Srirama, Mohan Kumar, et al.
Published: (2024)