Saved in:
| Main Authors: | Zhang, Xu, Li, Danyang, Dong, Xiaohang, Wu, Tianhao, Yu, Hualong, Wang, Jianye, Li, Qicheng, Li, Xiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.02607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)
by: Jiang, Houcheng, et al.
Published: (2026)
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024)
by: Zheng, Sipeng, et al.
Published: (2024)
Text‐to‐3D City: Plan‐then‐Execute Urban Generation With LLM Planners and Procedural Synthesis
by: Xiaohang Dong, et al.
Published: (2026)
by: Xiaohang Dong, et al.
Published: (2026)
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
by: Xu, Shilin, et al.
Published: (2025)
by: Xu, Shilin, et al.
Published: (2025)
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
by: Li, Xiaohe, et al.
Published: (2026)
by: Li, Xiaohe, et al.
Published: (2026)
UniVS: Unified and Universal Video Segmentation with Prompts as Queries
by: Li, Minghan, et al.
Published: (2024)
by: Li, Minghan, et al.
Published: (2024)
Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training
by: Bawazir, Ameera, et al.
Published: (2024)
by: Bawazir, Ameera, et al.
Published: (2024)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
by: Baraldi, Lorenzo, et al.
Published: (2025)
by: Baraldi, Lorenzo, et al.
Published: (2025)
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
by: Xu, Xiao, et al.
Published: (2024)
by: Xu, Xiao, et al.
Published: (2024)
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer
by: Cai, Hengxing, et al.
Published: (2024)
by: Cai, Hengxing, et al.
Published: (2024)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
by: Chen, Jiaxing, et al.
Published: (2024)
by: Chen, Jiaxing, et al.
Published: (2024)
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
by: Wu, Yuhang, et al.
Published: (2024)
by: Wu, Yuhang, et al.
Published: (2024)
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
by: Pan, Xichen, et al.
Published: (2023)
by: Pan, Xichen, et al.
Published: (2023)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)
by: Li, Chenxu, et al.
Published: (2025)
MultiClimate: Multimodal Stance Detection on Climate Change Videos
by: Wang, Jiawen, et al.
Published: (2024)
by: Wang, Jiawen, et al.
Published: (2024)
LLAVADI: What Matters For Multimodal Large Language Models Distillation
by: Xu, Shilin, et al.
Published: (2024)
by: Xu, Shilin, et al.
Published: (2024)
StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model
by: Li, Zongrong, et al.
Published: (2024)
by: Li, Zongrong, et al.
Published: (2024)
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
by: Qin, Luozheng, et al.
Published: (2025)
by: Qin, Luozheng, et al.
Published: (2025)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023)
by: Liu, Yuliang, et al.
Published: (2023)
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
by: Lee, Segyu, et al.
Published: (2026)
by: Lee, Segyu, et al.
Published: (2026)
Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images
by: Yu, Xiaofei, et al.
Published: (2024)
by: Yu, Xiaofei, et al.
Published: (2024)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
by: Yu, Wenwen, et al.
Published: (2025)
by: Yu, Wenwen, et al.
Published: (2025)
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
by: HyperAI Team, et al.
Published: (2025)
by: HyperAI Team, et al.
Published: (2025)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
by: Qi, Ji, et al.
Published: (2023)
by: Qi, Ji, et al.
Published: (2023)
Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models
by: Liu, Dingning, et al.
Published: (2024)
by: Liu, Dingning, et al.
Published: (2024)
Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations
by: Zhu, Kangyu, et al.
Published: (2025)
by: Zhu, Kangyu, et al.
Published: (2025)
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
by: Xu, Jiacong, et al.
Published: (2025)
by: Xu, Jiacong, et al.
Published: (2025)
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
by: Hu, Jinyi, et al.
Published: (2023)
by: Hu, Jinyi, et al.
Published: (2023)
RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
by: Liu, Fanfan, et al.
Published: (2024)
by: Liu, Fanfan, et al.
Published: (2024)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
by: Li, Jinke, et al.
Published: (2025)
by: Li, Jinke, et al.
Published: (2025)
LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
by: Zhao, Zihui, et al.
Published: (2025)
by: Zhao, Zihui, et al.
Published: (2025)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
by: Niu, Yuwei, et al.
Published: (2025)
by: Niu, Yuwei, et al.
Published: (2025)
Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
by: Qi, Daiqing, et al.
Published: (2024)
by: Qi, Daiqing, et al.
Published: (2024)
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
by: You, Ling, et al.
Published: (2025)
by: You, Ling, et al.
Published: (2025)
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
by: Liu, Jianyu, et al.
Published: (2025)
by: Liu, Jianyu, et al.
Published: (2025)
Similar Items
-
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
by: Zhang, Xu, et al.
Published: (2026) -
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026) -
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024) -
Text‐to‐3D City: Plan‐then‐Execute Urban Generation With LLM Planners and Procedural Synthesis
by: Xiaohang Dong, et al.
Published: (2026) -
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
by: Xu, Shilin, et al.
Published: (2025)