:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Fengyi, Dong, Yifei, Dai, Yilong, Chen, Guangyu, Wu, Qifeng, Huang, Huiting, Wang, Hang, Dai, Qi, Hauptmann, Alexander G., Cheng, Zhi-Qi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.09547
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Language-Conditioned World Modeling for Visual Navigation
by: Dong, Yifei, et al.
Published: (2026)

Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
by: Dong, Yifei, et al.
Published: (2025)

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
by: Dong, Yifei, et al.
Published: (2025)

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
by: Li, Heng, et al.
Published: (2024)

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
by: Dong, Yifei, et al.
Published: (2025)

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
by: Cheng, Zebang, et al.
Published: (2024)

Instruction-based Image Editing with Planning, Reasoning, and Generation
by: Ji, Liya, et al.
Published: (2026)

Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding
by: Peng, Xiaojiang, et al.
Published: (2026)

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
by: Wu, Linquan, et al.
Published: (2026)

Think before Go: Hierarchical Reasoning for Image-goal Navigation
by: Li, Pengna, et al.
Published: (2026)

Multimodal Reranking for Knowledge-Intensive Visual Question Answering
by: Wen, Haoyang, et al.
Published: (2024)

Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation
by: Chen, Bolei, et al.
Published: (2025)

SHIELD: LLM-Driven Schema Induction for Predictive Analytics in EV Battery Supply Chain Disruptions
by: Cheng, Zhi-Qi, et al.
Published: (2024)

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments
by: Hong, Haodong, et al.
Published: (2024)

WebNavigator: Global Web Navigation via Interaction Graph Retrieval
by: Zhang, Xuanwang, et al.
Published: (2026)

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
by: Cheng, Zebang, et al.
Published: (2024)

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
by: Fang, Rongyao, et al.
Published: (2025)

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)

ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation
by: Wang, Xuerui, et al.
Published: (2026)

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
by: Lu, Yichen, et al.
Published: (2025)

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking
by: Wu, Junhong, et al.
Published: (2025)

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement
by: Dong, Yuran, et al.
Published: (2026)

Taming Spontaneous Stop-and-Go Traffic Waves: A Computational Mechanism Design Perspective
by: Shen, Di, et al.
Published: (2025)

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026)

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
by: Xing, Zhen, et al.
Published: (2024)

VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
by: Zhao, Xunyi, et al.
Published: (2025)

Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
by: Yakun, Cui, et al.
Published: (2025)

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)

Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model
by: She, Yifei, et al.
Published: (2025)

ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
by: Wang, Lihong, et al.
Published: (2025)

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
by: Bi, Jing, et al.
Published: (2025)

Spatial-Aware Conditioned Fusion for Audio-Visual Navigation
by: Wu, Shaohang, et al.
Published: (2026)

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
by: Zhou, Hao, et al.
Published: (2024)

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
by: Tong, Haoyu, et al.
Published: (2026)

Learning Visual-Semantic Subspace Representations
by: Moreira, Gabriel, et al.
Published: (2024)

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning
by: Wu, Rujie, et al.
Published: (2026)

NavBench: Probing Multimodal Large Language Models for Embodied Navigation
by: Qiao, Yanyuan, et al.
Published: (2025)

UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
by: Yin, Hang, et al.
Published: (2025)

ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
by: Xu, Ziqiang, et al.
Published: (2025)

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs
by: Shu, Yan, et al.
Published: (2025)