:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lee, Sua, Shin, Kyubum, Park, Jung Ho
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.07147
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Separated Inter/Intra-Modal Fusion Prompts for Compositional Zero-Shot Learning
by: Jung, Sua
Published: (2025)

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026)

By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting
by: Yoon, Hyungjun, et al.
Published: (2024)

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
by: Lan, Zhibin, et al.
Published: (2025)

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
by: Cai, Mu, et al.
Published: (2023)

Differentiable Prompt Learning for Vision Language Models
by: Huang, Zhenhan, et al.
Published: (2024)

MINR: Implicit Neural Representations with Masked Image Modelling
by: Lee, Sua, et al.
Published: (2025)

Semantically-Prompted Language Models Improve Visual Descriptions
by: Ogezi, Michael, et al.
Published: (2023)

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
by: Lin, Yuanze, et al.
Published: (2024)

Bayesian Principles Improve Prompt Learning In Vision-Language Models
by: Kim, Mingyu, et al.
Published: (2025)

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
by: Lee, Jewon, et al.
Published: (2025)

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering
by: Guo, Danfeng, et al.
Published: (2024)

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
by: Gupta, Akash, et al.
Published: (2025)

Large Language Models can Share Images, Too!
by: Lee, Young-Jun, et al.
Published: (2023)

Compositional Chain-of-Thought Prompting for Large Multimodal Models
by: Mitra, Chancharik, et al.
Published: (2023)

Promptception: How Sensitive Are Large Multimodal Models to Prompts?
by: Ismithdeen, Mohamed Insaf, et al.
Published: (2025)

REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring
by: Ho, Thanh Cong, et al.
Published: (2025)

VLind-Bench: Measuring Language Priors in Large Vision-Language Models
by: Lee, Kang-il, et al.
Published: (2024)

Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
by: Min, Kyungmin, et al.
Published: (2024)

FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
by: Fan, Ziyang, et al.
Published: (2026)

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
by: Zhou, Andy, et al.
Published: (2024)

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
by: Zhang, Yichi, et al.
Published: (2024)

Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models
by: Kim, Donghoon, et al.
Published: (2024)

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
by: Park, Jeongkyun, et al.
Published: (2023)

Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)

Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
by: Abid, Abderrazek, et al.
Published: (2025)

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
by: Cai, Mu, et al.
Published: (2023)

Large Body Language Models
by: Punjwani, Saif, et al.
Published: (2024)

Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
by: Kenthapadi, Krishnaram, et al.
Published: (2024)

Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
by: He, Hulingxiao, et al.
Published: (2025)

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
by: Liu, Dongyang, et al.
Published: (2024)

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
by: Kriz, Anita, et al.
Published: (2025)

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
by: Kim, Yoonshik, et al.
Published: (2025)

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
by: Huang, Chengyue, et al.
Published: (2025)

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
by: Kang, Haoqiang, et al.
Published: (2025)

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
by: Maracani, Andrea, et al.
Published: (2025)

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
by: Ye, Jiabo, et al.
Published: (2024)

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
by: Kim, Sanghwan, et al.
Published: (2024)

A Survey on Multimodal Large Language Models
by: Yin, Shukang, et al.
Published: (2023)