:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Xu, Li, Danyang, Dong, Xiaohang, Wu, Tianhao, Yu, Hualong, Wang, Jianye, Li, Qicheng, Li, Xiang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2511.02607
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
by: Zhang, Xu, et al.
Published: (2026)

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)

UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024)

Text‐to‐3D City: Plan‐then‐Execute Urban Generation With LLM Planners and Procedural Synthesis
by: Xiaohang Dong, et al.
Published: (2026)

Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
by: Xu, Shilin, et al.
Published: (2025)

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
by: Li, Xiaohe, et al.
Published: (2026)

UniVS: Unified and Universal Video Segmentation with Prompts as Queries
by: Li, Minghan, et al.
Published: (2024)

Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training
by: Bawazir, Ameera, et al.
Published: (2024)

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
by: Baraldi, Lorenzo, et al.
Published: (2025)

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
by: Xu, Xiao, et al.
Published: (2024)

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer
by: Cai, Hengxing, et al.
Published: (2024)

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
by: Li, Yunxin, et al.
Published: (2024)

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
by: Chen, Jiaxing, et al.
Published: (2024)

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
by: Wu, Yuhang, et al.
Published: (2024)

Kosmos-G: Generating Images in Context with Multimodal Large Language Models
by: Pan, Xichen, et al.
Published: (2023)

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)

MultiClimate: Multimodal Stance Detection on Climate Change Videos
by: Wang, Jiawen, et al.
Published: (2024)

LLAVADI: What Matters For Multimodal Large Language Models Distillation
by: Xu, Shilin, et al.
Published: (2024)

StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model
by: Li, Zongrong, et al.
Published: (2024)

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
by: Qin, Luozheng, et al.
Published: (2025)

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023)

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
by: Lee, Segyu, et al.
Published: (2026)

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images
by: Yu, Xiaofei, et al.
Published: (2024)

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
by: Yu, Wenwen, et al.
Published: (2025)

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
by: HyperAI Team, et al.
Published: (2025)

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
by: Qi, Ji, et al.
Published: (2023)

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models
by: Liu, Dingning, et al.
Published: (2024)

Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations
by: Zhu, Kangyu, et al.
Published: (2025)

Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
by: Xu, Jiacong, et al.
Published: (2025)

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
by: Hu, Jinyi, et al.
Published: (2023)

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
by: Liu, Fanfan, et al.
Published: (2024)

Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)

UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
by: Li, Jinke, et al.
Published: (2025)

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
by: Zhao, Zihui, et al.
Published: (2025)

Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
by: Niu, Yuwei, et al.
Published: (2025)

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
by: Qi, Daiqing, et al.
Published: (2024)

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
by: Li, Lei, et al.
Published: (2024)

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
by: You, Ling, et al.
Published: (2025)

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
by: Liu, Jianyu, et al.
Published: (2025)