:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guruprasad, Pranav, Sikka, Harshvardhan, Song, Jaewoo, Wang, Yangyue, Liang, Paul Pu
Format:	Preprint
Published:	2024
Subjects:	Robotics Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2411.05821
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
by: Guruprasad, Pranav, et al.
Published: (2025)

An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models
by: Guruprasad, Pranav, et al.
Published: (2025)

GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
by: Wang, Yangyue, et al.
Published: (2026)

Benchmarking the Generality of Vision-Language-Action Models
by: Guruprasad, Pranav, et al.
Published: (2025)

Improving Vision-Language-Action Model with Online Reinforcement Learning
by: Guo, Yanjiang, et al.
Published: (2025)

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
by: Zhou, Zhongyi, et al.
Published: (2025)

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
by: Niu, Dantong, et al.
Published: (2024)

PVI: Plug-in Visual Injection for Vision-Language-Action Models
by: Zhang, Zezhou, et al.
Published: (2026)

A Survey on Efficient Vision-Language-Action Models
by: Yu, Zhaoshu, et al.
Published: (2025)

Tactile Modality Fusion for Vision-Language-Action Models
by: Morissette, Charlotte, et al.
Published: (2026)

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
by: Li, Qixiu, et al.
Published: (2025)

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
by: Wang, Beichen, et al.
Published: (2024)

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
by: Liang, Zhixuan, et al.
Published: (2025)

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
by: Kim, Moo Jin, et al.
Published: (2025)

Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
by: Kawaharazuka, Kento, et al.
Published: (2025)

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
by: Yang, Ruihan, et al.
Published: (2025)

Test-Time Training for Visual Foresight Vision-Language-Action Models
by: Park, Sangwu, et al.
Published: (2026)

Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking
by: Chib, Pranav Singh, et al.
Published: (2024)

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
by: An, Xinyuan, et al.
Published: (2026)

PointVLA: Injecting the 3D World into Vision-Language-Action Models
by: Li, Chengmeng, et al.
Published: (2025)

AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving
by: Xing, Shuo, et al.
Published: (2024)

Hybrid Training for Vision-Language-Action Models
by: Mazzaglia, Pietro, et al.
Published: (2025)

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
by: Shiba, Takuya
Published: (2026)

Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs
by: Taherin, Amir, et al.
Published: (2025)

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
by: Lin, Haitao, et al.
Published: (2026)

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
by: Zhang, Zhengshen, et al.
Published: (2025)

PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks
by: Grotz, Markus, et al.
Published: (2024)

MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
by: Zhou, Xunlan, et al.
Published: (2026)

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
by: Zhao, Yanpeng, et al.
Published: (2026)

Robustness Evaluation of Machine Learning Models for Robot Arm Action Recognition in Noisy Environments
by: Motamedi, Elaheh, et al.
Published: (2024)

Interactive Post-Training for Vision-Language-Action Models
by: Tan, Shuhan, et al.
Published: (2025)

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
by: Zhang, Wenbo, et al.
Published: (2025)

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching
by: Xu, Siyu, et al.
Published: (2025)

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
by: Cheang, Chi-Lam, et al.
Published: (2024)

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
by: Zhao, Qingqing, et al.
Published: (2025)

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
by: Xiao, Lei, et al.
Published: (2025)

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
by: Larchenko, Ilia, et al.
Published: (2025)

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
by: Luo, Hao, et al.
Published: (2025)

GenSim: Generating Robotic Simulation Tasks via Large Language Models
by: Wang, Lirui, et al.
Published: (2023)