Saved in:
| Main Authors: | Wu, Fengyi, Dong, Yifei, Dai, Yilong, Chen, Guangyu, Wu, Qifeng, Huang, Huiting, Wang, Hang, Dai, Qi, Hauptmann, Alexander G., Cheng, Zhi-Qi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.09547 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Language-Conditioned World Modeling for Visual Navigation
by: Dong, Yifei, et al.
Published: (2026)
by: Dong, Yifei, et al.
Published: (2026)
Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
by: Dong, Yifei, et al.
Published: (2025)
by: Dong, Yifei, et al.
Published: (2025)
HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
by: Dong, Yifei, et al.
Published: (2025)
by: Dong, Yifei, et al.
Published: (2025)
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
by: Li, Heng, et al.
Published: (2024)
by: Li, Heng, et al.
Published: (2024)
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
by: Dong, Yifei, et al.
Published: (2025)
by: Dong, Yifei, et al.
Published: (2025)
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
by: Cheng, Zebang, et al.
Published: (2024)
by: Cheng, Zebang, et al.
Published: (2024)
Instruction-based Image Editing with Planning, Reasoning, and Generation
by: Ji, Liya, et al.
Published: (2026)
by: Ji, Liya, et al.
Published: (2026)
Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding
by: Peng, Xiaojiang, et al.
Published: (2026)
by: Peng, Xiaojiang, et al.
Published: (2026)
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
by: Wu, Linquan, et al.
Published: (2026)
by: Wu, Linquan, et al.
Published: (2026)
Think before Go: Hierarchical Reasoning for Image-goal Navigation
by: Li, Pengna, et al.
Published: (2026)
by: Li, Pengna, et al.
Published: (2026)
Multimodal Reranking for Knowledge-Intensive Visual Question Answering
by: Wen, Haoyang, et al.
Published: (2024)
by: Wen, Haoyang, et al.
Published: (2024)
Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation
by: Chen, Bolei, et al.
Published: (2025)
by: Chen, Bolei, et al.
Published: (2025)
SHIELD: LLM-Driven Schema Induction for Predictive Analytics in EV Battery Supply Chain Disruptions
by: Cheng, Zhi-Qi, et al.
Published: (2024)
by: Cheng, Zhi-Qi, et al.
Published: (2024)
Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments
by: Hong, Haodong, et al.
Published: (2024)
by: Hong, Haodong, et al.
Published: (2024)
WebNavigator: Global Web Navigation via Interaction Graph Retrieval
by: Zhang, Xuanwang, et al.
Published: (2026)
by: Zhang, Xuanwang, et al.
Published: (2026)
SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
by: Cheng, Zebang, et al.
Published: (2024)
by: Cheng, Zebang, et al.
Published: (2024)
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
by: Fang, Rongyao, et al.
Published: (2025)
by: Fang, Rongyao, et al.
Published: (2025)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)
by: Cheng, Zhi-Qi, et al.
Published: (2024)
ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation
by: Wang, Xuerui, et al.
Published: (2026)
by: Wang, Xuerui, et al.
Published: (2026)
ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
by: Lu, Yichen, et al.
Published: (2025)
by: Lu, Yichen, et al.
Published: (2025)
LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking
by: Wu, Junhong, et al.
Published: (2025)
by: Wu, Junhong, et al.
Published: (2025)
Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement
by: Dong, Yuran, et al.
Published: (2026)
by: Dong, Yuran, et al.
Published: (2026)
Taming Spontaneous Stop-and-Go Traffic Waves: A Computational Mechanism Design Perspective
by: Shen, Di, et al.
Published: (2025)
by: Shen, Di, et al.
Published: (2025)
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026)
by: Han, Haonan, et al.
Published: (2026)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
by: Xing, Zhen, et al.
Published: (2024)
by: Xing, Zhen, et al.
Published: (2024)
VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
by: Zhao, Xunyi, et al.
Published: (2025)
by: Zhao, Xunyi, et al.
Published: (2025)
Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
by: Yakun, Cui, et al.
Published: (2025)
by: Yakun, Cui, et al.
Published: (2025)
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)
by: Li, Kailing, et al.
Published: (2025)
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model
by: She, Yifei, et al.
Published: (2025)
by: She, Yifei, et al.
Published: (2025)
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
by: Wang, Lihong, et al.
Published: (2025)
by: Wang, Lihong, et al.
Published: (2025)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Spatial-Aware Conditioned Fusion for Audio-Visual Navigation
by: Wu, Shaohang, et al.
Published: (2026)
by: Wu, Shaohang, et al.
Published: (2026)
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
by: Zhou, Hao, et al.
Published: (2024)
by: Zhou, Hao, et al.
Published: (2024)
ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
by: Tong, Haoyu, et al.
Published: (2026)
by: Tong, Haoyu, et al.
Published: (2026)
Learning Visual-Semantic Subspace Representations
by: Moreira, Gabriel, et al.
Published: (2024)
by: Moreira, Gabriel, et al.
Published: (2024)
Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning
by: Wu, Rujie, et al.
Published: (2026)
by: Wu, Rujie, et al.
Published: (2026)
NavBench: Probing Multimodal Large Language Models for Embodied Navigation
by: Qiao, Yanyuan, et al.
Published: (2025)
by: Qiao, Yanyuan, et al.
Published: (2025)
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
by: Yin, Hang, et al.
Published: (2025)
by: Yin, Hang, et al.
Published: (2025)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
by: Xu, Ziqiang, et al.
Published: (2025)
by: Xu, Ziqiang, et al.
Published: (2025)
Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs
by: Shu, Yan, et al.
Published: (2025)
by: Shu, Yan, et al.
Published: (2025)
Similar Items
-
Language-Conditioned World Modeling for Visual Navigation
by: Dong, Yifei, et al.
Published: (2026) -
Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
by: Dong, Yifei, et al.
Published: (2025) -
HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
by: Dong, Yifei, et al.
Published: (2025) -
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
by: Li, Heng, et al.
Published: (2024) -
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
by: Dong, Yifei, et al.
Published: (2025)