Saved in:
| Main Authors: | Zhang, Jun, Meng, Desen, Zhang, Zhengming, Huang, Zhenpeng, Wu, Tao, Wang, Limin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.04449 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)
by: Meng, Desen, et al.
Published: (2025)
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
by: Luo, Yaxin, et al.
Published: (2024)
by: Luo, Yaxin, et al.
Published: (2024)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
by: Wu, Shiwei, et al.
Published: (2024)
by: Wu, Shiwei, et al.
Published: (2024)
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
by: Chaubey, Ashutosh, et al.
Published: (2026)
by: Chaubey, Ashutosh, et al.
Published: (2026)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)
by: Shu, Fangxun, et al.
Published: (2024)
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
by: Huang, Yushi, et al.
Published: (2025)
by: Huang, Yushi, et al.
Published: (2025)
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
by: Chen, Zeren, et al.
Published: (2023)
by: Chen, Zeren, et al.
Published: (2023)
MoD-SLAM: Monocular Dense Mapping for Unbounded 3D Scene Reconstruction
by: Zhou, Heng, et al.
Published: (2024)
by: Zhou, Heng, et al.
Published: (2024)
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
by: Li, Shuo, et al.
Published: (2025)
by: Li, Shuo, et al.
Published: (2025)
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
by: Bao, Zhijie, et al.
Published: (2026)
by: Bao, Zhijie, et al.
Published: (2026)
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
by: Huang, Zhenpeng, et al.
Published: (2026)
by: Huang, Zhenpeng, et al.
Published: (2026)
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
by: Fang, I-Sheng, et al.
Published: (2025)
by: Fang, I-Sheng, et al.
Published: (2025)
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
by: Zhang, Huanyu, et al.
Published: (2025)
by: Zhang, Huanyu, et al.
Published: (2025)
Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
by: Ji, Yikun, et al.
Published: (2025)
by: Ji, Yikun, et al.
Published: (2025)
Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding
by: Guo, Pinxue, et al.
Published: (2025)
by: Guo, Pinxue, et al.
Published: (2025)
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
by: Shi, Baorong, et al.
Published: (2026)
by: Shi, Baorong, et al.
Published: (2026)
Unhackable Temporal Rewarding for Scalable Video MLLMs
by: Yu, En, et al.
Published: (2025)
by: Yu, En, et al.
Published: (2025)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)
by: Yao, Huanjin, et al.
Published: (2025)
MoPD: Mixture-of-Prompts Distillation for Vision-Language Models
by: Chen, Yang, et al.
Published: (2024)
by: Chen, Yang, et al.
Published: (2024)
Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)
by: Du, Yuetian, et al.
Published: (2026)
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
by: Yu, Tianyu, et al.
Published: (2023)
by: Yu, Tianyu, et al.
Published: (2023)
The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning
by: Chen, Renmiao, et al.
Published: (2026)
by: Chen, Renmiao, et al.
Published: (2026)
Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
by: Jiang, Songtao, et al.
Published: (2024)
by: Jiang, Songtao, et al.
Published: (2024)
MoDification: Mixture of Depths Made Easy
by: Zhang, Chen, et al.
Published: (2024)
by: Zhang, Chen, et al.
Published: (2024)
3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow
by: Ma, Yueen, et al.
Published: (2025)
by: Ma, Yueen, et al.
Published: (2025)
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025)
by: Kang, Caixin, et al.
Published: (2025)
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
by: Ma, Tianyi, et al.
Published: (2025)
by: Ma, Tianyi, et al.
Published: (2025)
D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs
by: Chang, Shuochen, et al.
Published: (2025)
by: Chang, Shuochen, et al.
Published: (2025)
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
by: Wang, Junyang, et al.
Published: (2023)
by: Wang, Junyang, et al.
Published: (2023)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
by: Huang, Jen-Tse, et al.
Published: (2025)
by: Huang, Jen-Tse, et al.
Published: (2025)
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
by: Su, Zhenpeng, et al.
Published: (2024)
by: Su, Zhenpeng, et al.
Published: (2024)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding
by: Zhang, Xi, et al.
Published: (2025)
by: Zhang, Xi, et al.
Published: (2025)
The Instinctive Bias: Spurious Images lead to Illusion in MLLMs
by: Han, Tianyang, et al.
Published: (2024)
by: Han, Tianyang, et al.
Published: (2024)
SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
by: Wang, Siting, et al.
Published: (2025)
by: Wang, Siting, et al.
Published: (2025)
MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers
by: Li, Sijia, et al.
Published: (2023)
by: Li, Sijia, et al.
Published: (2023)
NeMo: Needle in a Montage for Video-Language Understanding
by: Hu, Zi-Yuan, et al.
Published: (2025)
by: Hu, Zi-Yuan, et al.
Published: (2025)
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
by: Yang, Enneng, et al.
Published: (2024)
by: Yang, Enneng, et al.
Published: (2024)
Can MLLMs Understand the Deep Implication Behind Chinese Images?
by: Zhang, Chenhao, et al.
Published: (2024)
by: Zhang, Chenhao, et al.
Published: (2024)
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
by: Liang, Yiqing, et al.
Published: (2025)
by: Liang, Yiqing, et al.
Published: (2025)
Similar Items
-
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025) -
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
by: Luo, Yaxin, et al.
Published: (2024) -
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
by: Wu, Shiwei, et al.
Published: (2024) -
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
by: Chaubey, Ashutosh, et al.
Published: (2026) -
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)