Saved in:
| Main Authors: | Dong, Shaoqi, Fu, Chaoyou, Gao, Haihan, Zhang, Yi-Fan, Yan, Chi, Wu, Chu, Liu, Xiaoyu, Shen, Yunhang, Huo, Jing, Jiang, Deqiang, Cao, Haoyu, Gao, Yang, Sun, Xing, He, Ran, Shan, Caifeng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.09607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
by: Liu, Xiaoyu, et al.
Published: (2025)
by: Liu, Xiaoyu, et al.
Published: (2025)
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
by: Fu, Chaoyou, et al.
Published: (2025)
by: Fu, Chaoyou, et al.
Published: (2025)
VITA: Towards Open-Source Interactive Omni Multimodal LLM
by: Fu, Chaoyou, et al.
Published: (2024)
by: Fu, Chaoyou, et al.
Published: (2024)
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
by: Li, Lijiang, et al.
Published: (2026)
by: Li, Lijiang, et al.
Published: (2026)
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
by: Shen, Yunhang, et al.
Published: (2025)
by: Shen, Yunhang, et al.
Published: (2025)
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
by: Yang, Ruoliu, et al.
Published: (2026)
by: Yang, Ruoliu, et al.
Published: (2026)
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
by: Long, Zuwei, et al.
Published: (2025)
by: Long, Zuwei, et al.
Published: (2025)
VITA: Vision-to-Action Flow Matching Policy
by: Gao, Dechen, et al.
Published: (2025)
by: Gao, Dechen, et al.
Published: (2025)
PersonaVLM: Long-Term Personalized Multimodal LLMs
by: Nie, Chang, et al.
Published: (2026)
by: Nie, Chang, et al.
Published: (2026)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
by: Zhou, Chenyu, et al.
Published: (2024)
by: Zhou, Chenyu, et al.
Published: (2024)
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
by: Fu, Chaoyou, et al.
Published: (2023)
by: Fu, Chaoyou, et al.
Published: (2023)
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
by: Liu, Ruohan, et al.
Published: (2026)
by: Liu, Ruohan, et al.
Published: (2026)
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
by: Bu, Qingwen, et al.
Published: (2025)
by: Bu, Qingwen, et al.
Published: (2025)
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
by: Fu, Chaoyou, et al.
Published: (2026)
by: Fu, Chaoyou, et al.
Published: (2026)
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
by: Jiang, Yuhua, et al.
Published: (2026)
by: Jiang, Yuhua, et al.
Published: (2026)
LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models
by: Li, Zuolei, et al.
Published: (2025)
by: Li, Zuolei, et al.
Published: (2025)
VITA - Vocational Innovation through Teaching with AI
by: Ravotto, Pierfranco
Published: (2025)
by: Ravotto, Pierfranco
Published: (2025)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
by: Gao, Timin, et al.
Published: (2024)
by: Gao, Timin, et al.
Published: (2024)
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
by: Hu, Yutong, et al.
Published: (2026)
by: Hu, Yutong, et al.
Published: (2026)
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
by: Gao, Heting, et al.
Published: (2025)
by: Gao, Heting, et al.
Published: (2025)
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
by: Wang, Xiong, et al.
Published: (2024)
by: Wang, Xiong, et al.
Published: (2024)
HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies
by: Du, Zhiying, et al.
Published: (2025)
by: Du, Zhiying, et al.
Published: (2025)
EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
by: Ng, Chi Kit, et al.
Published: (2025)
by: Ng, Chi Kit, et al.
Published: (2025)
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
by: Jiang, Yuhua, et al.
Published: (2025)
by: Jiang, Yuhua, et al.
Published: (2025)
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
by: Ye, Wencheng, et al.
Published: (2025)
by: Ye, Wencheng, et al.
Published: (2025)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
by: Fu, Chaoyou, et al.
Published: (2024)
by: Fu, Chaoyou, et al.
Published: (2024)
Select before Act: Spatially Decoupled Action Repetition for Continuous Control
by: Nie, Buqing, et al.
Published: (2025)
by: Nie, Buqing, et al.
Published: (2025)
VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
by: Gao, Chongkai, et al.
Published: (2025)
by: Gao, Chongkai, et al.
Published: (2025)
An Effective End-to-End Solution for Multimodal Action Recognition
by: Wang, Songping, et al.
Published: (2025)
by: Wang, Songping, et al.
Published: (2025)
ST4VLA: Spatially Guided Training for Vision-Language-Action Models
by: Ye, Jinhui, et al.
Published: (2026)
by: Ye, Jinhui, et al.
Published: (2026)
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
by: Gao, Mingjian, et al.
Published: (2026)
by: Gao, Mingjian, et al.
Published: (2026)
FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models
by: Wang, Xin, et al.
Published: (2025)
by: Wang, Xin, et al.
Published: (2025)
Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
by: Bai, Shuanghao, et al.
Published: (2026)
by: Bai, Shuanghao, et al.
Published: (2026)
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
by: Bai, Zechen, et al.
Published: (2025)
by: Bai, Zechen, et al.
Published: (2025)
SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios
by: Gao, Tian, et al.
Published: (2026)
by: Gao, Tian, et al.
Published: (2026)
FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
by: Zhao, Ruiteng, et al.
Published: (2026)
by: Zhao, Ruiteng, et al.
Published: (2026)
CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance
by: Li, Jinming, et al.
Published: (2024)
by: Li, Jinming, et al.
Published: (2024)
Epidermal ET ‐1 signal induces activation of resting hair follicles by upregulating the PI3K / AKT pathway in the dermis
by: Ying Gao, et al.
Published: (2024)
by: Ying Gao, et al.
Published: (2024)
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
by: Zhang, Jianke, et al.
Published: (2026)
by: Zhang, Jianke, et al.
Published: (2026)
StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems
by: Ye, Jinhui, et al.
Published: (2026)
by: Ye, Jinhui, et al.
Published: (2026)
Similar Items
-
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
by: Liu, Xiaoyu, et al.
Published: (2025) -
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
by: Fu, Chaoyou, et al.
Published: (2025) -
VITA: Towards Open-Source Interactive Omni Multimodal LLM
by: Fu, Chaoyou, et al.
Published: (2024) -
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
by: Li, Lijiang, et al.
Published: (2026) -
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
by: Shen, Yunhang, et al.
Published: (2025)