Saved in:
| Main Authors: | Huang, Chi-Pin, Man, Yunze, Yu, Zhiding, Chen, Min-Hung, Kautz, Jan, Wang, Yu-Chiang Frank, Yang, Fu-En |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.09708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
by: Huang, Chi-Pin, et al.
Published: (2025)
by: Huang, Chi-Pin, et al.
Published: (2025)
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
by: Wang, Shihao, et al.
Published: (2026)
by: Wang, Shihao, et al.
Published: (2026)
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
by: Chang, Kai-Po, et al.
Published: (2025)
by: Chang, Kai-Po, et al.
Published: (2025)
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
by: Wu, Yen-Siang, et al.
Published: (2025)
by: Wu, Yen-Siang, et al.
Published: (2025)
Situational Awareness Matters in 3D Vision Language Reasoning
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
by: Bai, Shuanghao, et al.
Published: (2026)
by: Bai, Shuanghao, et al.
Published: (2026)
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
by: Huang, De-An, et al.
Published: (2025)
by: Huang, De-An, et al.
Published: (2025)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
by: Wang, Shihao, et al.
Published: (2025)
by: Wang, Shihao, et al.
Published: (2025)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
by: Wang, Shihao, et al.
Published: (2024)
by: Wang, Shihao, et al.
Published: (2024)
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
by: Yu, Yu-Chu, et al.
Published: (2024)
by: Yu, Yu-Chu, et al.
Published: (2024)
VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
by: Huang, Chi-Pin, et al.
Published: (2025)
by: Huang, Chi-Pin, et al.
Published: (2025)
TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
by: Cheng, Wei-Yuan, et al.
Published: (2026)
by: Cheng, Wei-Yuan, et al.
Published: (2026)
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
by: Ling, Yiran, et al.
Published: (2026)
by: Ling, Yiran, et al.
Published: (2026)
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers
by: Huang, Chi-Pin, et al.
Published: (2023)
by: Huang, Chi-Pin, et al.
Published: (2023)
QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models
by: Chi, Tien-Yu, et al.
Published: (2025)
by: Chi, Tien-Yu, et al.
Published: (2025)
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
by: Cao, Shiteng, et al.
Published: (2026)
by: Cao, Shiteng, et al.
Published: (2026)
TPA3D: Triplane Attention for Fast Text-to-3D Generation
by: Wu, Bin-Shih, et al.
Published: (2023)
by: Wu, Bin-Shih, et al.
Published: (2023)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models
by: Tan, Xudong, et al.
Published: (2025)
by: Tan, Xudong, et al.
Published: (2025)
Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning
by: Liu, Shih-Wen, et al.
Published: (2025)
by: Liu, Shih-Wen, et al.
Published: (2025)
ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces
by: Xu, Xin, et al.
Published: (2026)
by: Xu, Xin, et al.
Published: (2026)
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models
by: Lin, Ci-Siang, et al.
Published: (2025)
by: Lin, Ci-Siang, et al.
Published: (2025)
Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation
by: Yue, Junrong, et al.
Published: (2025)
by: Yue, Junrong, et al.
Published: (2025)
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
by: Zhang, Shaokun, et al.
Published: (2025)
by: Zhang, Shaokun, et al.
Published: (2025)
Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains
by: Tan, Wenhui, et al.
Published: (2025)
by: Tan, Wenhui, et al.
Published: (2025)
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
by: Fu, Tianyu, et al.
Published: (2025)
by: Fu, Tianyu, et al.
Published: (2025)
Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
by: Izzo, Riccardo Andrea, et al.
Published: (2026)
by: Izzo, Riccardo Andrea, et al.
Published: (2026)
R2SM: Referring and Reasoning for Selective Masks
by: Shih, Yu-Lin, et al.
Published: (2025)
by: Shih, Yu-Lin, et al.
Published: (2025)
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
by: Dong, Shaoqi, et al.
Published: (2025)
by: Dong, Shaoqi, et al.
Published: (2025)
LITA: Language Instructed Temporal-Localization Assistant
by: Huang, De-An, et al.
Published: (2024)
by: Huang, De-An, et al.
Published: (2024)
PaintScene4D: Consistent 4D Scene Generation from Text Prompts
by: Gupta, Vinayak, et al.
Published: (2024)
by: Gupta, Vinayak, et al.
Published: (2024)
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
by: Zhang, Haichao, et al.
Published: (2026)
by: Zhang, Haichao, et al.
Published: (2026)
VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
by: Wang, Chaoyang, et al.
Published: (2026)
by: Wang, Chaoyang, et al.
Published: (2026)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
by: Hatamizadeh, Ali, et al.
Published: (2024)
by: Hatamizadeh, Ali, et al.
Published: (2024)
VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
by: Xu, Changhua, et al.
Published: (2026)
by: Xu, Changhua, et al.
Published: (2026)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models
by: Tsai, Chung-Ting, et al.
Published: (2024)
by: Tsai, Chung-Ting, et al.
Published: (2024)
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech
by: Fu, Szu-Wei, et al.
Published: (2024)
by: Fu, Szu-Wei, et al.
Published: (2024)
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
by: Li, Zhiqi, et al.
Published: (2023)
by: Li, Zhiqi, et al.
Published: (2023)
V"Mean"ba: Visual State Space Models only need 1 hidden dimension
by: Chi, Tien-Yu, et al.
Published: (2024)
by: Chi, Tien-Yu, et al.
Published: (2024)
Similar Items
-
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
by: Huang, Chi-Pin, et al.
Published: (2025) -
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
by: Man, Yunze, et al.
Published: (2025) -
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
by: Wang, Shihao, et al.
Published: (2026) -
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
by: Man, Yunze, et al.
Published: (2025) -
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
by: Chang, Kai-Po, et al.
Published: (2025)