Saved in:
| Main Authors: | Lu, Yujie, Li, Xiujun, Fu, Tsu-Jui, Eckstein, Miguel, Wang, William Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2405.14213 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
by: Li, Xiujun, et al.
Published: (2023)
by: Li, Xiujun, et al.
Published: (2023)
SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning
by: Fu, Tsu-Jui, et al.
Published: (2020)
by: Fu, Tsu-Jui, et al.
Published: (2020)
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
by: Fu, Xingyu, et al.
Published: (2024)
by: Fu, Xingyu, et al.
Published: (2024)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
by: Fu, Tsu-Jui, et al.
Published: (2019)
by: Fu, Tsu-Jui, et al.
Published: (2019)
Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
by: Ryan, Yuriel, et al.
Published: (2025)
by: Ryan, Yuriel, et al.
Published: (2025)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
by: Schumann, Raphael, et al.
Published: (2023)
by: Schumann, Raphael, et al.
Published: (2023)
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)
by: Feng, Weixi, et al.
Published: (2024)
MileBench: Benchmarking MLLMs in Long Context
by: Song, Dingjie, et al.
Published: (2024)
by: Song, Dingjie, et al.
Published: (2024)
Exploring the Design Space of Visual Context Representation in Video MLLMs
by: Du, Yifan, et al.
Published: (2024)
by: Du, Yifan, et al.
Published: (2024)
TransPixeler: Advancing Text-to-Video Generation with Transparency
by: Wang, Luozhou, et al.
Published: (2025)
by: Wang, Luozhou, et al.
Published: (2025)
Autoregressive Pre-Training on Pixels and Texts
by: Chai, Yekun, et al.
Published: (2024)
by: Chai, Yekun, et al.
Published: (2024)
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
by: Dong, Qihua, et al.
Published: (2026)
by: Dong, Qihua, et al.
Published: (2026)
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
by: Yang, Qize, et al.
Published: (2025)
by: Yang, Qize, et al.
Published: (2025)
Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
by: Sun, Kaiser, et al.
Published: (2026)
by: Sun, Kaiser, et al.
Published: (2026)
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
by: Yeh, Chun-Hsiao, et al.
Published: (2025)
by: Yeh, Chun-Hsiao, et al.
Published: (2025)
An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
by: Wu, Daiqing, et al.
Published: (2025)
by: Wu, Daiqing, et al.
Published: (2025)
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
by: Qian, Yusu, et al.
Published: (2025)
by: Qian, Yusu, et al.
Published: (2025)
VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding
by: Pei, Rongcan, et al.
Published: (2026)
by: Pei, Rongcan, et al.
Published: (2026)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
by: Yilmaz, Nilay, et al.
Published: (2025)
by: Yilmaz, Nilay, et al.
Published: (2025)
Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding
by: Guo, Pinxue, et al.
Published: (2025)
by: Guo, Pinxue, et al.
Published: (2025)
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
by: Huang, Kung-Hsiang, et al.
Published: (2024)
by: Huang, Kung-Hsiang, et al.
Published: (2024)
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)
by: Chen, Yukang, et al.
Published: (2024)
Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)
by: Du, Yuetian, et al.
Published: (2026)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
by: Saxon, Michael, et al.
Published: (2024)
by: Saxon, Michael, et al.
Published: (2024)
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
by: Zhao, Tiancheng, et al.
Published: (2024)
by: Zhao, Tiancheng, et al.
Published: (2024)
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
by: Miao, Ziqi, et al.
Published: (2025)
by: Miao, Ziqi, et al.
Published: (2025)
Guiding Instruction-based Image Editing via Multimodal Large Language Models
by: Fu, Tsu-Jui, et al.
Published: (2023)
by: Fu, Tsu-Jui, et al.
Published: (2023)
From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models
by: Yang, Cheng, et al.
Published: (2026)
by: Yang, Cheng, et al.
Published: (2026)
Pixel Sentence Representation Learning
by: Xiao, Chenghao, et al.
Published: (2024)
by: Xiao, Chenghao, et al.
Published: (2024)
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
by: Wang, Jiapeng, et al.
Published: (2024)
by: Wang, Jiapeng, et al.
Published: (2024)
Can MLLMs Understand the Deep Implication Behind Chinese Images?
by: Zhang, Chenhao, et al.
Published: (2024)
by: Zhang, Chenhao, et al.
Published: (2024)
From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
by: Wang, Xiangfeng, et al.
Published: (2025)
by: Wang, Xiangfeng, et al.
Published: (2025)
Internalized Reasoning for Long-Context Visual Document Understanding
by: Veselka, Austin
Published: (2026)
by: Veselka, Austin
Published: (2026)
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
by: Li, Jiachen, et al.
Published: (2024)
by: Li, Jiachen, et al.
Published: (2024)
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
by: Chen, Zeren, et al.
Published: (2023)
by: Chen, Zeren, et al.
Published: (2023)
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
by: Shang, Yuying, et al.
Published: (2024)
by: Shang, Yuying, et al.
Published: (2024)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)
by: Zhang, Wanpeng, et al.
Published: (2024)
PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
by: Wang, Nan, et al.
Published: (2026)
by: Wang, Nan, et al.
Published: (2026)
Similar Items
-
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
by: Li, Xiujun, et al.
Published: (2023) -
SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning
by: Fu, Tsu-Jui, et al.
Published: (2020) -
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
by: Fu, Xingyu, et al.
Published: (2024) -
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
by: Fu, Tsu-Jui, et al.
Published: (2019) -
Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
by: Ryan, Yuriel, et al.
Published: (2025)