:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lu, Yujie, Li, Xiujun, Fu, Tsu-Jui, Eckstein, Miguel, Wang, William Yang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2405.14213
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
by: Li, Xiujun, et al.
Published: (2023)

SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning
by: Fu, Tsu-Jui, et al.
Published: (2020)

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
by: Fu, Xingyu, et al.
Published: (2024)

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
by: Fu, Tsu-Jui, et al.
Published: (2019)

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
by: Ryan, Yuriel, et al.
Published: (2025)

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
by: Schumann, Raphael, et al.
Published: (2023)

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)

MileBench: Benchmarking MLLMs in Long Context
by: Song, Dingjie, et al.
Published: (2024)

Exploring the Design Space of Visual Context Representation in Video MLLMs
by: Du, Yifan, et al.
Published: (2024)

TransPixeler: Advancing Text-to-Video Generation with Transparency
by: Wang, Luozhou, et al.
Published: (2025)

Autoregressive Pre-Training on Pixels and Texts
by: Chai, Yekun, et al.
Published: (2024)

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
by: Dong, Qihua, et al.
Published: (2026)

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
by: Yang, Qize, et al.
Published: (2025)

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
by: Sun, Kaiser, et al.
Published: (2026)

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
by: Yeh, Chun-Hsiao, et al.
Published: (2025)

An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
by: Wu, Daiqing, et al.
Published: (2025)

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
by: Qian, Yusu, et al.
Published: (2025)

VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding
by: Pei, Rongcan, et al.
Published: (2026)

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
by: Yilmaz, Nilay, et al.
Published: (2025)

Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding
by: Guo, Pinxue, et al.
Published: (2025)

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
by: Huang, Kung-Hsiang, et al.
Published: (2024)

LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)

Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
by: Saxon, Michael, et al.
Published: (2024)

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
by: Zhao, Tiancheng, et al.
Published: (2024)

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
by: Miao, Ziqi, et al.
Published: (2025)

Guiding Instruction-based Image Editing via Multimodal Large Language Models
by: Fu, Tsu-Jui, et al.
Published: (2023)

From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models
by: Yang, Cheng, et al.
Published: (2026)

Pixel Sentence Representation Learning
by: Xiao, Chenghao, et al.
Published: (2024)

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
by: Wang, Jiapeng, et al.
Published: (2024)

Can MLLMs Understand the Deep Implication Behind Chinese Images?
by: Zhang, Chenhao, et al.
Published: (2024)

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
by: Wang, Xiangfeng, et al.
Published: (2025)

Internalized Reasoning for Long-Context Visual Document Understanding
by: Veselka, Austin
Published: (2026)

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
by: Li, Jiachen, et al.
Published: (2024)

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
by: Chen, Zeren, et al.
Published: (2023)

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
by: Shang, Yuying, et al.
Published: (2024)

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)

PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
by: Wang, Nan, et al.
Published: (2026)