Saved in:
| Main Authors: | Luo, Yang, Zheng, Zangwei, Zhu, Zirui, You, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.12866 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
by: Qin, Libo, et al.
Published: (2024)
by: Qin, Libo, et al.
Published: (2024)
Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026)
by: Xiong, Guangzhi, et al.
Published: (2026)
Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
by: Jia, Sihang, et al.
Published: (2026)
by: Jia, Sihang, et al.
Published: (2026)
Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space
by: Verma, Gaurav, et al.
Published: (2024)
by: Verma, Gaurav, et al.
Published: (2024)
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
by: You, Liangliang, et al.
Published: (2025)
by: You, Liangliang, et al.
Published: (2025)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
by: Yue, Yang, et al.
Published: (2025)
by: Yue, Yang, et al.
Published: (2025)
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
by: Zheng, Ge, et al.
Published: (2025)
by: Zheng, Ge, et al.
Published: (2025)
MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
by: Luo, Yang, et al.
Published: (2025)
by: Luo, Yang, et al.
Published: (2025)
Model Composition for Multimodal Large Language Models
by: Chen, Chi, et al.
Published: (2024)
by: Chen, Chi, et al.
Published: (2024)
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024)
by: Yan, An, et al.
Published: (2024)
Nearest Neighbor Normalization Improves Multimodal Retrieval
by: Chowdhury, Neil, et al.
Published: (2024)
by: Chowdhury, Neil, et al.
Published: (2024)
See It from My Perspective: How Language Affects Cultural Bias in Image Understanding
by: Ananthram, Amith, et al.
Published: (2024)
by: Ananthram, Amith, et al.
Published: (2024)
Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations
by: Li, Yanshu
Published: (2025)
by: Li, Yanshu
Published: (2025)
LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
by: Zhao, Zihui, et al.
Published: (2025)
by: Zhao, Zihui, et al.
Published: (2025)
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
by: Huang, Brandon, et al.
Published: (2024)
by: Huang, Brandon, et al.
Published: (2024)
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
by: Li, Yan, et al.
Published: (2026)
by: Li, Yan, et al.
Published: (2026)
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)
Progressive Multimodal Reasoning via Active Retrieval
by: Dong, Guanting, et al.
Published: (2024)
by: Dong, Guanting, et al.
Published: (2024)
MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)
by: Liu, Yanqing, et al.
Published: (2023)
How to Train Your Long-Context Visual Document Model
by: Veselka, Austin
Published: (2026)
by: Veselka, Austin
Published: (2026)
Many-Shot In-Context Learning in Multimodal Foundation Models
by: Jiang, Yixing, et al.
Published: (2024)
by: Jiang, Yixing, et al.
Published: (2024)
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
by: Hu, Wenbo, et al.
Published: (2024)
by: Hu, Wenbo, et al.
Published: (2024)
Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
by: Karim, A H M Rezaul, et al.
Published: (2025)
by: Karim, A H M Rezaul, et al.
Published: (2025)
Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
by: Mei, Jingbiao, et al.
Published: (2025)
by: Mei, Jingbiao, et al.
Published: (2025)
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
by: Shalabi, Fatma, et al.
Published: (2024)
by: Shalabi, Fatma, et al.
Published: (2024)
Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace
by: Du, Shian, et al.
Published: (2024)
by: Du, Shian, et al.
Published: (2024)
Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)
by: Zhang, Beichen, et al.
Published: (2025)
MULTI: Multimodal Understanding Leaderboard with Text and Images
by: Zhu, Zichen, et al.
Published: (2024)
by: Zhu, Zichen, et al.
Published: (2024)
E$^2$AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models
by: Lu, Liming, et al.
Published: (2025)
by: Lu, Liming, et al.
Published: (2025)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
by: Chen, Haonan, et al.
Published: (2025)
by: Chen, Haonan, et al.
Published: (2025)
Recurrence Meets Transformers for Universal Multimodal Retrieval
by: Caffagni, Davide, et al.
Published: (2025)
by: Caffagni, Davide, et al.
Published: (2025)
MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
by: Wang, Han, et al.
Published: (2026)
by: Wang, Han, et al.
Published: (2026)
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
by: Zeng, Yu, et al.
Published: (2026)
by: Zeng, Yu, et al.
Published: (2026)
Figuring out Figures: Using Textual References to Caption Scientific Figures
by: Cao, Stanley, et al.
Published: (2024)
by: Cao, Stanley, et al.
Published: (2024)
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
by: Guo, Ziyu, et al.
Published: (2025)
by: Guo, Ziyu, et al.
Published: (2025)
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024)
by: Zheng, Sipeng, et al.
Published: (2024)
Generative Universal Verifier as Multimodal Meta-Reasoner
by: Zhang, Xinchen, et al.
Published: (2025)
by: Zhang, Xinchen, et al.
Published: (2025)
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)
by: Zhang, Shaolei, et al.
Published: (2025)
MLLM-CL: Continual Learning for Multimodal Large Language Models
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models
by: Liu, Shaonan, et al.
Published: (2026)
by: Liu, Shaonan, et al.
Published: (2026)
Similar Items
-
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
by: Qin, Libo, et al.
Published: (2024) -
Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026) -
Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
by: Jia, Sihang, et al.
Published: (2026) -
Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space
by: Verma, Gaurav, et al.
Published: (2024) -
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
by: You, Liangliang, et al.
Published: (2025)