Saved in:
| Main Authors: | Xing, Shuo, Sun, Zezhou, Xie, Shuangyu, Chen, Kaiyuan, Huang, Yanjia, Wang, Yuping, Li, Jiachen, Song, Dezhen, Tu, Zhengzhong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.14607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VISTAv2: World Imagination for Indoor Vision-and-Language Navigation
by: Huang, Yanjia, et al.
Published: (2025)
by: Huang, Yanjia, et al.
Published: (2025)
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
by: Wang, Yuping, et al.
Published: (2025)
by: Wang, Yuping, et al.
Published: (2025)
VISTA: Generative Visual Imagination for Vision-and-Language Navigation
by: Huang, Yanjia, et al.
Published: (2025)
by: Huang, Yanjia, et al.
Published: (2025)
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
by: Xing, Shuo, et al.
Published: (2025)
by: Xing, Shuo, et al.
Published: (2025)
Energy Efficient Planning for Repetitive Heterogeneous Tasks in Precision Agriculture
by: Xie, Shuangyu, et al.
Published: (2025)
by: Xie, Shuangyu, et al.
Published: (2025)
GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model
by: Abouzeid, Ali, et al.
Published: (2025)
by: Abouzeid, Ali, et al.
Published: (2025)
OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving
by: Xing, Shuo, et al.
Published: (2024)
by: Xing, Shuo, et al.
Published: (2024)
PANDORA: Diffusion Policy Learning for Dexterous Robotic Piano Playing
by: Huang, Yanjia, et al.
Published: (2025)
by: Huang, Yanjia, et al.
Published: (2025)
Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models
by: Zhu, Tinghui, et al.
Published: (2024)
by: Zhu, Tinghui, et al.
Published: (2024)
Demystifying the Visual Quality Paradox in Multimodal Large Language Models
by: Xing, Shuo, et al.
Published: (2025)
by: Xing, Shuo, et al.
Published: (2025)
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving
by: Xing, Shuo, et al.
Published: (2024)
by: Xing, Shuo, et al.
Published: (2024)
MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
by: Li, Guanzhen, et al.
Published: (2024)
by: Li, Guanzhen, et al.
Published: (2024)
DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
by: Qian, Chengxuan, et al.
Published: (2025)
by: Qian, Chengxuan, et al.
Published: (2025)
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
by: Pan, Chenbin, et al.
Published: (2025)
by: Pan, Chenbin, et al.
Published: (2025)
AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition
by: Huang, Ying, et al.
Published: (2025)
by: Huang, Ying, et al.
Published: (2025)
LAVQA: A Latency-Aware Visual Question Answering Framework for Shared Autonomy in Self-Driving Vehicles
by: Xie, Shuangyu, et al.
Published: (2025)
by: Xie, Shuangyu, et al.
Published: (2025)
STAMP: Scalable Task And Model-agnostic Collaborative Perception
by: Gao, Xiangbo, et al.
Published: (2025)
by: Gao, Xiangbo, et al.
Published: (2025)
Data Cleaning Using Large Language Models
by: Zhang, Shuo, et al.
Published: (2024)
by: Zhang, Shuo, et al.
Published: (2024)
A Large Vision-Language Model based Environment Perception System for Visually Impaired People
by: Chen, Zezhou, et al.
Published: (2025)
by: Chen, Zezhou, et al.
Published: (2025)
Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery
by: Li, Peiran, et al.
Published: (2026)
by: Li, Peiran, et al.
Published: (2026)
CoMamba: Real-time Cooperative Perception Unlocked with State Space Models
by: Li, Jinlong, et al.
Published: (2024)
by: Li, Jinlong, et al.
Published: (2024)
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
by: Tian, Kexin, et al.
Published: (2025)
by: Tian, Kexin, et al.
Published: (2025)
Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization
by: Xing, Shuo, et al.
Published: (2025)
by: Xing, Shuo, et al.
Published: (2025)
Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models
by: Shen, Minghe, et al.
Published: (2025)
by: Shen, Minghe, et al.
Published: (2025)
FORGE-Tree: Diffusion-Forcing Tree Search for Long-Horizon Robot Manipulation
by: Huang, Yanjia, et al.
Published: (2025)
by: Huang, Yanjia, et al.
Published: (2025)
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
by: Wang, Zehao, et al.
Published: (2026)
by: Wang, Zehao, et al.
Published: (2026)
Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models
by: Li, Yueyan, et al.
Published: (2025)
by: Li, Yueyan, et al.
Published: (2025)
PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models
by: Meng, Yu, et al.
Published: (2025)
by: Meng, Yu, et al.
Published: (2025)
MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training
by: Yin, Zhenhan, et al.
Published: (2025)
by: Yin, Zhenhan, et al.
Published: (2025)
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
by: Jiang, Huaide, et al.
Published: (2026)
by: Jiang, Huaide, et al.
Published: (2026)
PVI: Plug-in Visual Injection for Vision-Language-Action Models
by: Zhang, Zezhou, et al.
Published: (2026)
by: Zhang, Zezhou, et al.
Published: (2026)
Dynamic Pyramid Network for Efficient Multimodal Large Language Model
by: Ai, Hao, et al.
Published: (2025)
by: Ai, Hao, et al.
Published: (2025)
JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models
by: Qian, Jiachen
Published: (2026)
by: Qian, Jiachen
Published: (2026)
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
by: Wu, Mingyang, et al.
Published: (2026)
by: Wu, Mingyang, et al.
Published: (2026)
3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation
by: He, Yunhong, et al.
Published: (2025)
by: He, Yunhong, et al.
Published: (2025)
Imagination at Inference: Synthesizing In-Hand Views for Robust Visuomotor Policy Inference
by: Ding, Haoran, et al.
Published: (2025)
by: Ding, Haoran, et al.
Published: (2025)
CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences
by: Lin, Fangzhou, et al.
Published: (2026)
by: Lin, Fangzhou, et al.
Published: (2026)
TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
by: Yang, Fan, et al.
Published: (2026)
by: Yang, Fan, et al.
Published: (2026)
Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking
by: Hu, Chan-Wei, et al.
Published: (2026)
by: Hu, Chan-Wei, et al.
Published: (2026)
TRINS: Towards Multimodal Language Models that Can Read
by: Zhang, Ruiyi, et al.
Published: (2024)
by: Zhang, Ruiyi, et al.
Published: (2024)
Similar Items
-
VISTAv2: World Imagination for Indoor Vision-and-Language Navigation
by: Huang, Yanjia, et al.
Published: (2025) -
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
by: Wang, Yuping, et al.
Published: (2025) -
VISTA: Generative Visual Imagination for Vision-and-Language Navigation
by: Huang, Yanjia, et al.
Published: (2025) -
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
by: Xing, Shuo, et al.
Published: (2025) -
Energy Efficient Planning for Repetitive Heterogeneous Tasks in Precision Agriculture
by: Xie, Shuangyu, et al.
Published: (2025)