Saved in:
| Main Authors: | Wang, Xiaomeng, Larson, Martha, Zhao, Zhengyu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.27553 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
by: Sun, Yanpeng, et al.
Published: (2024)
by: Sun, Yanpeng, et al.
Published: (2024)
Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training
by: Liang, Mingliang, et al.
Published: (2024)
by: Liang, Mingliang, et al.
Published: (2024)
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
by: Atuhurra, Jesse, et al.
Published: (2024)
by: Atuhurra, Jesse, et al.
Published: (2024)
Attribute-based Visual Reprogramming for Vision-Language Models
by: Cai, Chengyi, et al.
Published: (2025)
by: Cai, Chengyi, et al.
Published: (2025)
AnyText2: Visual Text Generation and Editing With Customizable Attributes
by: Tuo, Yuxiang, et al.
Published: (2024)
by: Tuo, Yuxiang, et al.
Published: (2024)
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
by: Wu, Tong, et al.
Published: (2024)
by: Wu, Tong, et al.
Published: (2024)
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
by: Zhu, Lei, et al.
Published: (2024)
by: Zhu, Lei, et al.
Published: (2024)
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
by: Li, Sijie, et al.
Published: (2026)
by: Li, Sijie, et al.
Published: (2026)
SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision
by: Li, Zhaoxu, et al.
Published: (2025)
by: Li, Zhaoxu, et al.
Published: (2025)
Exploiting Text-Image Latent Spaces for the Description of Visual Concepts
by: Schmalwasser, Laines, et al.
Published: (2024)
by: Schmalwasser, Laines, et al.
Published: (2024)
Visually Descriptive Language Model for Vector Graphics Reasoning
by: Wang, Zhenhailong, et al.
Published: (2024)
by: Wang, Zhenhailong, et al.
Published: (2024)
Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models
by: Nakayama, Aya, et al.
Published: (2025)
by: Nakayama, Aya, et al.
Published: (2025)
Visual Text Processing: A Comprehensive Review and Unified Evaluation
by: Shu, Yan, et al.
Published: (2025)
by: Shu, Yan, et al.
Published: (2025)
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
by: Meng, GuangHao, et al.
Published: (2025)
by: Meng, GuangHao, et al.
Published: (2025)
Tango: Taming Visual Signals for Efficient Video Large Language Models
by: Yin, Shukang, et al.
Published: (2026)
by: Yin, Shukang, et al.
Published: (2026)
Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models
by: Wang, Ruiyu, et al.
Published: (2025)
by: Wang, Ruiyu, et al.
Published: (2025)
Visual Perception by Large Language Model's Weights
by: Ma, Feipeng, et al.
Published: (2024)
by: Ma, Feipeng, et al.
Published: (2024)
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description
by: Jin, Yizhang, et al.
Published: (2024)
by: Jin, Yizhang, et al.
Published: (2024)
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
by: Xiong, Guangzhi, et al.
Published: (2026)
by: Xiong, Guangzhi, et al.
Published: (2026)
Visual Semantic Description Generation with MLLMs for Image-Text Matching
by: Chen, Junyu, et al.
Published: (2025)
by: Chen, Junyu, et al.
Published: (2025)
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
by: Cai, Kaitong, et al.
Published: (2025)
by: Cai, Kaitong, et al.
Published: (2025)
ControlLoc: Physical-World Hijacking Attack on Visual Perception in Autonomous Driving
by: Ma, Chen, et al.
Published: (2024)
by: Ma, Chen, et al.
Published: (2024)
DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
by: Majeedi, Abrar, et al.
Published: (2026)
by: Majeedi, Abrar, et al.
Published: (2026)
Towards Visual Text Grounding of Multimodal Large Language Model
by: Li, Ming, et al.
Published: (2025)
by: Li, Ming, et al.
Published: (2025)
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
by: Chen, Zeyu, et al.
Published: (2026)
by: Chen, Zeyu, et al.
Published: (2026)
Large Language Models are Universal Reasoners for Visual Generation
by: Ren, Sucheng, et al.
Published: (2026)
by: Ren, Sucheng, et al.
Published: (2026)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
by: Li, Xin, et al.
Published: (2024)
by: Li, Xin, et al.
Published: (2024)
Semantically-Prompted Language Models Improve Visual Descriptions
by: Ogezi, Michael, et al.
Published: (2023)
by: Ogezi, Michael, et al.
Published: (2023)
Learning to Produce Semi-dense Correspondences for Visual Localization
by: Giang, Khang Truong, et al.
Published: (2024)
by: Giang, Khang Truong, et al.
Published: (2024)
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
by: Chen, Tsai-Shien, et al.
Published: (2025)
by: Chen, Tsai-Shien, et al.
Published: (2025)
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation
by: Zhu, Xingyu, et al.
Published: (2026)
by: Zhu, Xingyu, et al.
Published: (2026)
AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
by: Wang, Yidan, et al.
Published: (2025)
by: Wang, Yidan, et al.
Published: (2025)
Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions
by: Xue, Jintang, et al.
Published: (2025)
by: Xue, Jintang, et al.
Published: (2025)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
LLM-AD: Large Language Model based Audio Description System
by: Chu, Peng, et al.
Published: (2024)
by: Chu, Peng, et al.
Published: (2024)
Visual Style Prompting with Swapping Self-Attention
by: Jeong, Jaeseok, et al.
Published: (2024)
by: Jeong, Jaeseok, et al.
Published: (2024)
Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
by: Wang, Lehan, et al.
Published: (2025)
by: Wang, Lehan, et al.
Published: (2025)
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
by: Wei, Cong, et al.
Published: (2024)
by: Wei, Cong, et al.
Published: (2024)
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
by: Li, Bohao, et al.
Published: (2024)
by: Li, Bohao, et al.
Published: (2024)
Enhancing Visual Representation for Text-based Person Searching
by: Shen, Wei, et al.
Published: (2024)
by: Shen, Wei, et al.
Published: (2024)
Similar Items
-
Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
by: Sun, Yanpeng, et al.
Published: (2024) -
Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training
by: Liang, Mingliang, et al.
Published: (2024) -
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
by: Atuhurra, Jesse, et al.
Published: (2024) -
Attribute-based Visual Reprogramming for Vision-Language Models
by: Cai, Chengyi, et al.
Published: (2025) -
AnyText2: Visual Text Generation and Editing With Customizable Attributes
by: Tuo, Yuxiang, et al.
Published: (2024)