:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shridhar, Mohit, Lo, Yat Long, James, Stephen
Format:	Preprint
Published:	2024
Subjects:	Robotics Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2407.07875
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks
by: Grotz, Markus, et al.
Published: (2024)

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024)

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)

GenSim: Generating Robotic Simulation Tasks via Large Language Models
by: Wang, Lirui, et al.
Published: (2023)

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
by: Hou, Yuchen, et al.
Published: (2026)

Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning
by: Vosylius, Vitalis, et al.
Published: (2024)

MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
by: Huang, Chengyue, et al.
Published: (2025)

ViPRA: Video Prediction for Robot Actions
by: Routray, Sandeep, et al.
Published: (2025)

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
by: Hong, Yining, et al.
Published: (2026)

Redundancy-aware Action Spaces for Robot Learning
by: Mazzaglia, Pietro, et al.
Published: (2024)

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
by: Gupta, Gunshi, et al.
Published: (2024)

EMMA: End-to-End Multimodal Model for Autonomous Driving
by: Hwang, Jyh-Jing, et al.
Published: (2024)

Critiques of World Models
by: Xing, Eric, et al.
Published: (2025)

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
by: Sha, Hao, et al.
Published: (2023)

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
by: Gkanatsios, Nikolaos, et al.
Published: (2023)

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
by: Ahn, Michael, et al.
Published: (2024)

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
by: Li, Ming, et al.
Published: (2024)

Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
by: Lisondra, Matthew, et al.
Published: (2025)

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025)

Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models
by: Mansour, Malak, et al.
Published: (2025)

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations
by: Grover, Shresth, et al.
Published: (2025)

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
by: Li, Zongxia, et al.
Published: (2025)

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
by: Hong, Yining, et al.
Published: (2024)

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
by: Zhang, Zhengshen, et al.
Published: (2025)

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
by: Cho, Jaemin, et al.
Published: (2023)

Hybrid Training for Vision-Language-Action Models
by: Mazzaglia, Pietro, et al.
Published: (2025)

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by: Li, Jialu, et al.
Published: (2024)

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
by: Alakuijala, Minttu, et al.
Published: (2024)

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)

Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
by: Xi, Jiajun, et al.
Published: (2024)

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
by: Li, Jianxiong, et al.
Published: (2024)

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
by: Li, Xiang, et al.
Published: (2024)

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
by: Byun, Ye Won, et al.
Published: (2024)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024)

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)

RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation
by: Nasiriany, Soroush, et al.
Published: (2024)

LEGENT: Open Platform for Embodied Agents
by: Cheng, Zhili, et al.
Published: (2024)