:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Luo, Yang, Zheng, Zangwei, Zhu, Zirui, You, Yang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.12866
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
by: Qin, Libo, et al.
Published: (2024)

Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026)

Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
by: Jia, Sihang, et al.
Published: (2026)

Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space
by: Verma, Gaurav, et al.
Published: (2024)

Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
by: You, Liangliang, et al.
Published: (2025)

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
by: Yue, Yang, et al.
Published: (2025)

Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
by: Zheng, Ge, et al.
Published: (2025)

MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
by: Luo, Yang, et al.
Published: (2025)

Model Composition for Multimodal Large Language Models
by: Chen, Chi, et al.
Published: (2024)

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024)

Nearest Neighbor Normalization Improves Multimodal Retrieval
by: Chowdhury, Neil, et al.
Published: (2024)

See It from My Perspective: How Language Affects Cultural Bias in Image Understanding
by: Ananthram, Amith, et al.
Published: (2024)

Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations
by: Li, Yanshu
Published: (2025)

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
by: Zhao, Zihui, et al.
Published: (2025)

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
by: Huang, Brandon, et al.
Published: (2024)

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
by: Li, Yan, et al.
Published: (2026)

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)

Progressive Multimodal Reasoning via Active Retrieval
by: Dong, Guanting, et al.
Published: (2024)

MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)

How to Train Your Long-Context Visual Document Model
by: Veselka, Austin
Published: (2026)

Many-Shot In-Context Learning in Multimodal Foundation Models
by: Jiang, Yixing, et al.
Published: (2024)

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
by: Hu, Wenbo, et al.
Published: (2024)

Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
by: Karim, A H M Rezaul, et al.
Published: (2025)

Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
by: Mei, Jingbiao, et al.
Published: (2025)

Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
by: Shalabi, Fatma, et al.
Published: (2024)

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace
by: Du, Shian, et al.
Published: (2024)

Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)

MULTI: Multimodal Understanding Leaderboard with Text and Images
by: Zhu, Zichen, et al.
Published: (2024)

E$^2$AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models
by: Lu, Liming, et al.
Published: (2025)

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
by: Chen, Haonan, et al.
Published: (2025)

Recurrence Meets Transformers for Universal Multimodal Retrieval
by: Caffagni, Davide, et al.
Published: (2025)

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
by: Wang, Han, et al.
Published: (2026)

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
by: Zeng, Yu, et al.
Published: (2026)

Figuring out Figures: Using Textual References to Caption Scientific Figures
by: Cao, Stanley, et al.
Published: (2024)

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
by: Guo, Ziyu, et al.
Published: (2025)

UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024)

Generative Universal Verifier as Multimodal Meta-Reasoner
by: Zhang, Xinchen, et al.
Published: (2025)

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)

MLLM-CL: Continual Learning for Multimodal Large Language Models
by: Zhao, Hongbo, et al.
Published: (2025)

Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models
by: Liu, Shaonan, et al.
Published: (2026)