Saved in:
| Main Author: | Feng, Qi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.12363 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Visuospatial Cognitive Assistant
by: Feng, Qi
Published: (2025)
by: Feng, Qi
Published: (2025)
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024)
by: Zhou, Gengze, et al.
Published: (2024)
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024)
by: Werby, Abdelrhman, et al.
Published: (2024)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024)
by: Li, Qixiu, et al.
Published: (2024)
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
by: Hong, Yining, et al.
Published: (2026)
by: Hong, Yining, et al.
Published: (2026)
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
by: Li, Jianxiong, et al.
Published: (2024)
by: Li, Jianxiong, et al.
Published: (2024)
Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
by: Li, Ming, et al.
Published: (2024)
by: Li, Ming, et al.
Published: (2024)
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
Audio-3DVG: Unified Audio -- Point Cloud Fusion for 3D Visual Grounding
by: Cao-Dinh, Duc, et al.
Published: (2025)
by: Cao-Dinh, Duc, et al.
Published: (2025)
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
by: Lisondra, Matthew, et al.
Published: (2025)
by: Lisondra, Matthew, et al.
Published: (2025)
Critiques of World Models
by: Xing, Eric, et al.
Published: (2025)
by: Xing, Eric, et al.
Published: (2025)
OceanGym: A Benchmark Environment for Underwater Embodied Agents
by: Xue, Yida, et al.
Published: (2025)
by: Xue, Yida, et al.
Published: (2025)
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
by: Rajabi, Navid, et al.
Published: (2025)
by: Rajabi, Navid, et al.
Published: (2025)
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
by: Chow, Wei, et al.
Published: (2025)
by: Chow, Wei, et al.
Published: (2025)
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
by: Tavella, Federico, et al.
Published: (2025)
by: Tavella, Federico, et al.
Published: (2025)
Temporal Preference Optimization for Long-Form Video Understanding
by: Li, Rui, et al.
Published: (2025)
by: Li, Rui, et al.
Published: (2025)
Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection
by: Sah, Chandan Kumar, et al.
Published: (2025)
by: Sah, Chandan Kumar, et al.
Published: (2025)
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
by: Li, Zongxia, et al.
Published: (2025)
by: Li, Zongxia, et al.
Published: (2025)
Neuro-Symbolic Concepts
by: Mao, Jiayuan, et al.
Published: (2025)
by: Mao, Jiayuan, et al.
Published: (2025)
ViPRA: Video Prediction for Robot Actions
by: Routray, Sandeep, et al.
Published: (2025)
by: Routray, Sandeep, et al.
Published: (2025)
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
by: Lu, Xiaoya, et al.
Published: (2025)
by: Lu, Xiaoya, et al.
Published: (2025)
Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models
by: Mansour, Malak, et al.
Published: (2025)
by: Mansour, Malak, et al.
Published: (2025)
See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
by: Hu, Chih Yao, et al.
Published: (2025)
by: Hu, Chih Yao, et al.
Published: (2025)
LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
by: Hou, Yuchen, et al.
Published: (2026)
by: Hou, Yuchen, et al.
Published: (2026)
Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
by: Alakuijala, Minttu, et al.
Published: (2024)
by: Alakuijala, Minttu, et al.
Published: (2024)
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
by: Chen, Yi, et al.
Published: (2024)
by: Chen, Yi, et al.
Published: (2024)
Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
by: Xi, Jiajun, et al.
Published: (2024)
by: Xi, Jiajun, et al.
Published: (2024)
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
by: Sha, Hao, et al.
Published: (2023)
by: Sha, Hao, et al.
Published: (2023)
Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
by: Gkanatsios, Nikolaos, et al.
Published: (2023)
by: Gkanatsios, Nikolaos, et al.
Published: (2023)
EMMA: End-to-End Multimodal Model for Autonomous Driving
by: Hwang, Jyh-Jing, et al.
Published: (2024)
by: Hwang, Jyh-Jing, et al.
Published: (2024)
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)
by: Jia, Baoxiong, et al.
Published: (2024)
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
by: Byun, Ye Won, et al.
Published: (2024)
by: Byun, Ye Won, et al.
Published: (2024)
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
by: Ahn, Michael, et al.
Published: (2024)
by: Ahn, Michael, et al.
Published: (2024)
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
by: Padhan, Swagat, et al.
Published: (2026)
by: Padhan, Swagat, et al.
Published: (2026)
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)
by: Yang, Jianing, et al.
Published: (2024)
RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation
by: Nasiriany, Soroush, et al.
Published: (2024)
by: Nasiriany, Soroush, et al.
Published: (2024)
Similar Items
-
Visuospatial Cognitive Assistant
by: Feng, Qi
Published: (2025) -
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024) -
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024) -
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
by: Li, Qixiu, et al.
Published: (2024) -
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)