:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Tianxiang, Nie, Minxin, Cao, Ziqiang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2410.23089
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Visual Position Prompt for MLLM based Visual Grounding
by: Tang, Wei, et al.
Published: (2025)

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
by: Wu, Mingrui, et al.
Published: (2024)

Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings
by: Dai, Tianxiang, et al.
Published: (2026)

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
by: Jiang, Songtao, et al.
Published: (2024)

Elysium: Exploring Object-level Perception in Videos via MLLM
by: Wang, Han, et al.
Published: (2024)

IPCV: Information-Preserving Compression for MLLM Visual Encoders
by: Chen, Yuan, et al.
Published: (2025)

RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training
by: Nie, Yunshuang, et al.
Published: (2026)

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
by: Slyman, Eric, et al.
Published: (2025)

EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
by: Sun, Yueru, et al.
Published: (2026)

MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
by: Li, Xu, et al.
Published: (2025)

HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
by: Liu, Xinyu, et al.
Published: (2024)

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026)

InstructX: Towards Unified Visual Editing with MLLM Guidance
by: Mou, Chong, et al.
Published: (2025)

Structure Causal Models and LLMs Integration in Medical Visual Question Answering
by: Xu, Zibo, et al.
Published: (2025)

The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
by: Zhang, Yuchen, et al.
Published: (2025)

Robust MLLM Unlearning via Visual Knowledge Distillation
by: Wang, Yuhang, et al.
Published: (2025)

MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
by: Cao, Shuyu, et al.
Published: (2025)

EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting
by: Zhang, Wei, et al.
Published: (2025)

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
by: Wu, Yixuan, et al.
Published: (2024)

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
by: Munasinghe, Shehan, et al.
Published: (2024)

dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
by: Xin, Yi, et al.
Published: (2025)

AttriPrompter: Auto-Prompting with Attribute Semantics for Zero-shot Nuclei Detection via Visual-Language Pre-trained Models
by: Wu, Yongjian, et al.
Published: (2024)

MM-IFEngine: Towards Multimodal Instruction Following
by: Ding, Shengyuan, et al.
Published: (2025)

Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition
by: Choi, Jae Young, et al.
Published: (2026)

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
by: Peng, Taiying, et al.
Published: (2025)

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
by: Zhou, Yang, et al.
Published: (2024)

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
by: Fang, Rongyao, et al.
Published: (2024)

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs
by: Chang, Boyu, et al.
Published: (2026)

PIP: Prototypes-Injected Prompt for Federated Class Incremental Learning
by: Ma'sum, Muhammad Anwar, et al.
Published: (2024)

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
by: Zhang, Haoyu, et al.
Published: (2025)

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
by: Zhao, Pengfei, et al.
Published: (2025)

D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
by: Zhang, Evelyn, et al.
Published: (2025)

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
by: Yin, Xingyilang, et al.
Published: (2026)

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
by: Huang, Runhui, et al.
Published: (2025)

Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
by: Liu, Jinming, et al.
Published: (2025)

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought
by: Lu, Yi, et al.
Published: (2025)

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
by: Yao, Lei, et al.
Published: (2026)

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
by: Xu, Jingwei, et al.
Published: (2024)

Visual Hallucinations of Multi-modal Large Language Models
by: Huang, Wen, et al.
Published: (2024)

Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent
by: Wu, Junda, et al.
Published: (2025)