Saved in:
| Main Authors: | Shridhar, Mohit, Lo, Yat Long, James, Stephen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.07875 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks
by: Grotz, Markus, et al.
Published: (2024)
by: Grotz, Markus, et al.
Published: (2024)
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024)
by: Zhou, Gengze, et al.
Published: (2024)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)
by: Li, Qixiu, et al.
Published: (2024)
GenSim: Generating Robotic Simulation Tasks via Large Language Models
by: Wang, Lirui, et al.
Published: (2023)
by: Wang, Lirui, et al.
Published: (2023)
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
by: Hou, Yuchen, et al.
Published: (2026)
by: Hou, Yuchen, et al.
Published: (2026)
Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning
by: Vosylius, Vitalis, et al.
Published: (2024)
by: Vosylius, Vitalis, et al.
Published: (2024)
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
ViPRA: Video Prediction for Robot Actions
by: Routray, Sandeep, et al.
Published: (2025)
by: Routray, Sandeep, et al.
Published: (2025)
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
by: Hong, Yining, et al.
Published: (2026)
by: Hong, Yining, et al.
Published: (2026)
Redundancy-aware Action Spaces for Robot Learning
by: Mazzaglia, Pietro, et al.
Published: (2024)
by: Mazzaglia, Pietro, et al.
Published: (2024)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
by: Gupta, Gunshi, et al.
Published: (2024)
by: Gupta, Gunshi, et al.
Published: (2024)
EMMA: End-to-End Multimodal Model for Autonomous Driving
by: Hwang, Jyh-Jing, et al.
Published: (2024)
by: Hwang, Jyh-Jing, et al.
Published: (2024)
Critiques of World Models
by: Xing, Eric, et al.
Published: (2025)
by: Xing, Eric, et al.
Published: (2025)
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
by: Sha, Hao, et al.
Published: (2023)
by: Sha, Hao, et al.
Published: (2023)
Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
by: Gkanatsios, Nikolaos, et al.
Published: (2023)
by: Gkanatsios, Nikolaos, et al.
Published: (2023)
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
by: Ahn, Michael, et al.
Published: (2024)
by: Ahn, Michael, et al.
Published: (2024)
Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
by: Li, Ming, et al.
Published: (2024)
by: Li, Ming, et al.
Published: (2024)
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
by: Lisondra, Matthew, et al.
Published: (2025)
by: Lisondra, Matthew, et al.
Published: (2025)
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025)
by: Chow, Wei, et al.
Published: (2025)
Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models
by: Mansour, Malak, et al.
Published: (2025)
by: Mansour, Malak, et al.
Published: (2025)
Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations
by: Grover, Shresth, et al.
Published: (2025)
by: Grover, Shresth, et al.
Published: (2025)
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
by: Li, Zongxia, et al.
Published: (2025)
by: Li, Zongxia, et al.
Published: (2025)
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
by: Zhang, Zhengshen, et al.
Published: (2025)
by: Zhang, Zhengshen, et al.
Published: (2025)
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
by: Cho, Jaemin, et al.
Published: (2023)
by: Cho, Jaemin, et al.
Published: (2023)
Hybrid Training for Vision-Language-Action Models
by: Mazzaglia, Pietro, et al.
Published: (2025)
by: Mazzaglia, Pietro, et al.
Published: (2025)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by: Li, Jialu, et al.
Published: (2024)
by: Li, Jialu, et al.
Published: (2024)
Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
by: Alakuijala, Minttu, et al.
Published: (2024)
by: Alakuijala, Minttu, et al.
Published: (2024)
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)
by: Chen, Yi, et al.
Published: (2024)
Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
by: Xi, Jiajun, et al.
Published: (2024)
by: Xi, Jiajun, et al.
Published: (2024)
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
by: Li, Jianxiong, et al.
Published: (2024)
by: Li, Jianxiong, et al.
Published: (2024)
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)
by: Jia, Baoxiong, et al.
Published: (2024)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
by: Byun, Ye Won, et al.
Published: (2024)
by: Byun, Ye Won, et al.
Published: (2024)
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024)
by: Werby, Abdelrhman, et al.
Published: (2024)
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)
by: Yang, Jianing, et al.
Published: (2024)
RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation
by: Nasiriany, Soroush, et al.
Published: (2024)
by: Nasiriany, Soroush, et al.
Published: (2024)
LEGENT: Open Platform for Embodied Agents
by: Cheng, Zhili, et al.
Published: (2024)
by: Cheng, Zhili, et al.
Published: (2024)
Similar Items
-
PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks
by: Grotz, Markus, et al.
Published: (2024) -
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024) -
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024) -
GenSim: Generating Robotic Simulation Tasks via Large Language Models
by: Wang, Lirui, et al.
Published: (2023) -
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)