Saved in:
| Main Authors: | Wang, Ziteng, He, Yujie, Li, Guanliang, Yang, Siqi, Xiong, Jiaqi, Liu, Songxiang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.04897 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating Text-to-Visual Generation with Image-to-Text Generation
by: Lin, Zhiqiu, et al.
Published: (2024)
by: Lin, Zhiqiu, et al.
Published: (2024)
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
by: Xu, Jiaqi, et al.
Published: (2023)
by: Xu, Jiaqi, et al.
Published: (2023)
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
by: Li, Baiqi, et al.
Published: (2024)
by: Li, Baiqi, et al.
Published: (2024)
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
by: Luo, Jianjie, et al.
Published: (2024)
by: Luo, Jianjie, et al.
Published: (2024)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
by: Han, Jiaming, et al.
Published: (2025)
by: Han, Jiaming, et al.
Published: (2025)
DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling
by: Lan, Jing, et al.
Published: (2026)
by: Lan, Jing, et al.
Published: (2026)
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
by: Sun, Shuyang, et al.
Published: (2023)
by: Sun, Shuyang, et al.
Published: (2023)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
by: Wu, Qiong, et al.
Published: (2024)
by: Wu, Qiong, et al.
Published: (2024)
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
by: Geng, Tiantian, et al.
Published: (2024)
by: Geng, Tiantian, et al.
Published: (2024)
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
by: Hendria, Willy Fitra
Published: (2023)
by: Hendria, Willy Fitra
Published: (2023)
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective
by: Zhu, Xiangru, et al.
Published: (2024)
by: Zhu, Xiangru, et al.
Published: (2024)
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
by: Zhong, Yiwu, et al.
Published: (2024)
by: Zhong, Yiwu, et al.
Published: (2024)
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
by: Chen, Wenting, et al.
Published: (2025)
by: Chen, Wenting, et al.
Published: (2025)
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)
by: Satar, Burak, et al.
Published: (2025)
Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?
by: Huang, Yuesheng, et al.
Published: (2025)
by: Huang, Yuesheng, et al.
Published: (2025)
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
by: Qu, Leigang, et al.
Published: (2024)
by: Qu, Leigang, et al.
Published: (2024)
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)
by: Luo, Chuwei, et al.
Published: (2022)
Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
by: Cai, Zhongang, et al.
Published: (2025)
by: Cai, Zhongang, et al.
Published: (2025)
EasyAnimate: High-Performance Video Generation Framework with Hybrid Windows Attention and Reward Backpropagation
by: Xu, Jiaqi, et al.
Published: (2024)
by: Xu, Jiaqi, et al.
Published: (2024)
Beyond Coarse-Grained Matching in Video-Text Retrieval
by: Chen, Aozhu, et al.
Published: (2024)
by: Chen, Aozhu, et al.
Published: (2024)
LookAhead Tuning: Safer Language Models via Partial Answer Previews
by: Liu, Kangwei, et al.
Published: (2025)
by: Liu, Kangwei, et al.
Published: (2025)
MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models
by: Lin, Xiao, et al.
Published: (2025)
by: Lin, Xiao, et al.
Published: (2025)
Contrastive Visual Data Augmentation
by: Zhou, Yu, et al.
Published: (2025)
by: Zhou, Yu, et al.
Published: (2025)
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
by: Zeng, Ziyao, et al.
Published: (2024)
by: Zeng, Ziyao, et al.
Published: (2024)
Language Models as Black-Box Optimizers for Vision-Language Models
by: Liu, Shihong, et al.
Published: (2023)
by: Liu, Shihong, et al.
Published: (2023)
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
by: Duan, Chengqi, et al.
Published: (2025)
by: Duan, Chengqi, et al.
Published: (2025)
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
by: Zhang, Dongxu, et al.
Published: (2026)
by: Zhang, Dongxu, et al.
Published: (2026)
Movie101v2: Improved Movie Narration Benchmark
by: Yue, Zihao, et al.
Published: (2024)
by: Yue, Zihao, et al.
Published: (2024)
Benchmarking Large Multimodal Models against Common Corruptions
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
MultiIoT: Benchmarking Machine Learning for the Internet of Things
by: Mo, Shentong, et al.
Published: (2023)
by: Mo, Shentong, et al.
Published: (2023)
RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction
by: Liu, Shuhong, et al.
Published: (2025)
by: Liu, Shuhong, et al.
Published: (2025)
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models
by: Qu, Leigang, et al.
Published: (2024)
by: Qu, Leigang, et al.
Published: (2024)
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
by: Deng, Ailin, et al.
Published: (2025)
by: Deng, Ailin, et al.
Published: (2025)
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
by: Maeda, Koki, et al.
Published: (2024)
by: Maeda, Koki, et al.
Published: (2024)
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
by: Wang, Xinran, et al.
Published: (2026)
by: Wang, Xinran, et al.
Published: (2026)
DreamArtist++: Controllable One-Shot Text-to-Image Generation via Positive-Negative Adapter
by: Dong, Ziyi, et al.
Published: (2022)
by: Dong, Ziyi, et al.
Published: (2022)
Improving Gloss-free Sign Language Translation by Reducing Representation Density
by: Ye, Jinhui, et al.
Published: (2024)
by: Ye, Jinhui, et al.
Published: (2024)
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning
by: Zhao, Zhixian, et al.
Published: (2026)
by: Zhao, Zhixian, et al.
Published: (2026)
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
by: Jiang, Yuechen, et al.
Published: (2026)
by: Jiang, Yuechen, et al.
Published: (2026)
Similar Items
-
Evaluating Text-to-Visual Generation with Image-to-Text Generation
by: Lin, Zhiqiu, et al.
Published: (2024) -
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
by: Xu, Jiaqi, et al.
Published: (2023) -
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
by: Li, Baiqi, et al.
Published: (2024) -
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
by: Luo, Jianjie, et al.
Published: (2024) -
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
by: Han, Jiaming, et al.
Published: (2025)