:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Feng, Qi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning Robotics
Online Access:	https://arxiv.org/abs/2505.12363
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Visuospatial Cognitive Assistant
by: Feng, Qi
Published: (2025)

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024)

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
by: Hong, Yining, et al.
Published: (2026)

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
by: Li, Jianxiong, et al.
Published: (2024)

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
by: Li, Ming, et al.
Published: (2024)

MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
by: Huang, Chengyue, et al.
Published: (2025)

Audio-3DVG: Unified Audio -- Point Cloud Fusion for 3D Visual Grounding
by: Cao-Dinh, Duc, et al.
Published: (2025)

Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
by: Lisondra, Matthew, et al.
Published: (2025)

Critiques of World Models
by: Xing, Eric, et al.
Published: (2025)

OceanGym: A Benchmark Environment for Underwater Embodied Agents
by: Xue, Yida, et al.
Published: (2025)

TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
by: Rajabi, Navid, et al.
Published: (2025)

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025)

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
by: Tavella, Federico, et al.
Published: (2025)

Temporal Preference Optimization for Long-Form Video Understanding
by: Li, Rui, et al.
Published: (2025)

Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection
by: Sah, Chandan Kumar, et al.
Published: (2025)

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
by: Li, Zongxia, et al.
Published: (2025)

Neuro-Symbolic Concepts
by: Mao, Jiayuan, et al.
Published: (2025)

ViPRA: Video Prediction for Robot Actions
by: Routray, Sandeep, et al.
Published: (2025)

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
by: Lu, Xiaoya, et al.
Published: (2025)

Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models
by: Mansour, Malak, et al.
Published: (2025)

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
by: Hu, Chih Yao, et al.
Published: (2025)

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
by: Hou, Yuchen, et al.
Published: (2026)

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
by: Alakuijala, Minttu, et al.
Published: (2024)

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
by: Hong, Yining, et al.
Published: (2024)

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)

Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
by: Xi, Jiajun, et al.
Published: (2024)

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
by: Sha, Hao, et al.
Published: (2023)

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
by: Gkanatsios, Nikolaos, et al.
Published: (2023)

EMMA: End-to-End Multimodal Model for Autonomous Driving
by: Hwang, Jyh-Jing, et al.
Published: (2024)

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
by: Li, Xiang, et al.
Published: (2024)

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
by: Byun, Ye Won, et al.
Published: (2024)

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
by: Ahn, Michael, et al.
Published: (2024)

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
by: Padhan, Swagat, et al.
Published: (2026)

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)

RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation
by: Nasiriany, Soroush, et al.
Published: (2024)