Saved in:
| Main Authors: | Lee, Sua, Shin, Kyubum, Park, Jung Ho |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.07147 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Separated Inter/Intra-Modal Fusion Prompts for Compositional Zero-Shot Learning
by: Jung, Sua
Published: (2025)
by: Jung, Sua
Published: (2025)
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026)
by: Lee, Sua, et al.
Published: (2026)
By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting
by: Yoon, Hyungjun, et al.
Published: (2024)
by: Yoon, Hyungjun, et al.
Published: (2024)
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
by: Lan, Zhibin, et al.
Published: (2025)
by: Lan, Zhibin, et al.
Published: (2025)
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
by: Cai, Mu, et al.
Published: (2023)
by: Cai, Mu, et al.
Published: (2023)
Differentiable Prompt Learning for Vision Language Models
by: Huang, Zhenhan, et al.
Published: (2024)
by: Huang, Zhenhan, et al.
Published: (2024)
MINR: Implicit Neural Representations with Masked Image Modelling
by: Lee, Sua, et al.
Published: (2025)
by: Lee, Sua, et al.
Published: (2025)
Semantically-Prompted Language Models Improve Visual Descriptions
by: Ogezi, Michael, et al.
Published: (2023)
by: Ogezi, Michael, et al.
Published: (2023)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
by: Lin, Yuanze, et al.
Published: (2024)
by: Lin, Yuanze, et al.
Published: (2024)
Bayesian Principles Improve Prompt Learning In Vision-Language Models
by: Kim, Mingyu, et al.
Published: (2025)
by: Kim, Mingyu, et al.
Published: (2025)
ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
by: Lee, Jewon, et al.
Published: (2025)
by: Lee, Jewon, et al.
Published: (2025)
Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering
by: Guo, Danfeng, et al.
Published: (2024)
by: Guo, Danfeng, et al.
Published: (2024)
Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
by: Gupta, Akash, et al.
Published: (2025)
by: Gupta, Akash, et al.
Published: (2025)
Large Language Models can Share Images, Too!
by: Lee, Young-Jun, et al.
Published: (2023)
by: Lee, Young-Jun, et al.
Published: (2023)
Compositional Chain-of-Thought Prompting for Large Multimodal Models
by: Mitra, Chancharik, et al.
Published: (2023)
by: Mitra, Chancharik, et al.
Published: (2023)
Promptception: How Sensitive Are Large Multimodal Models to Prompts?
by: Ismithdeen, Mohamed Insaf, et al.
Published: (2025)
by: Ismithdeen, Mohamed Insaf, et al.
Published: (2025)
REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring
by: Ho, Thanh Cong, et al.
Published: (2025)
by: Ho, Thanh Cong, et al.
Published: (2025)
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
by: Lee, Kang-il, et al.
Published: (2024)
by: Lee, Kang-il, et al.
Published: (2024)
Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
by: Min, Kyungmin, et al.
Published: (2024)
by: Min, Kyungmin, et al.
Published: (2024)
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
by: Fan, Ziyang, et al.
Published: (2026)
by: Fan, Ziyang, et al.
Published: (2026)
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
by: Zhou, Andy, et al.
Published: (2024)
by: Zhou, Andy, et al.
Published: (2024)
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
by: Zhang, Yichi, et al.
Published: (2024)
by: Zhang, Yichi, et al.
Published: (2024)
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models
by: Kim, Donghoon, et al.
Published: (2024)
by: Kim, Donghoon, et al.
Published: (2024)
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
by: Park, Jeongkyun, et al.
Published: (2023)
by: Park, Jeongkyun, et al.
Published: (2023)
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)
by: Woo, Sangmin, et al.
Published: (2025)
Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
by: Abid, Abderrazek, et al.
Published: (2025)
by: Abid, Abderrazek, et al.
Published: (2025)
Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
by: Cai, Mu, et al.
Published: (2023)
by: Cai, Mu, et al.
Published: (2023)
Large Body Language Models
by: Punjwani, Saif, et al.
Published: (2024)
by: Punjwani, Saif, et al.
Published: (2024)
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
by: Kenthapadi, Krishnaram, et al.
Published: (2024)
by: Kenthapadi, Krishnaram, et al.
Published: (2024)
Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
by: He, Hulingxiao, et al.
Published: (2025)
by: He, Hulingxiao, et al.
Published: (2025)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
by: Liu, Dongyang, et al.
Published: (2024)
by: Liu, Dongyang, et al.
Published: (2024)
Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
by: Kriz, Anita, et al.
Published: (2025)
by: Kriz, Anita, et al.
Published: (2025)
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
by: Kim, Yoonshik, et al.
Published: (2025)
by: Kim, Yoonshik, et al.
Published: (2025)
Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
by: Kang, Haoqiang, et al.
Published: (2025)
by: Kang, Haoqiang, et al.
Published: (2025)
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
by: Maracani, Andrea, et al.
Published: (2025)
by: Maracani, Andrea, et al.
Published: (2025)
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
by: Ye, Jiabo, et al.
Published: (2024)
by: Ye, Jiabo, et al.
Published: (2024)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)
by: Zhai, Yuexiang, et al.
Published: (2024)
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
by: Kim, Sanghwan, et al.
Published: (2024)
by: Kim, Sanghwan, et al.
Published: (2024)
A Survey on Multimodal Large Language Models
by: Yin, Shukang, et al.
Published: (2023)
by: Yin, Shukang, et al.
Published: (2023)
Similar Items
-
Separated Inter/Intra-Modal Fusion Prompts for Compositional Zero-Shot Learning
by: Jung, Sua
Published: (2025) -
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026) -
By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting
by: Yoon, Hyungjun, et al.
Published: (2024) -
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
by: Lan, Zhibin, et al.
Published: (2025) -
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
by: Cai, Mu, et al.
Published: (2023)