:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhai, Albert J., Shen, Yuan, Chen, Emily Y., Wang, Gloria X., Wang, Xinlei, Wang, Sheng, Guan, Kaiyu, Wang, Shenlong
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2404.04242
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CropCraft: Complete Structural Characterization of Crop Plants From Images
by: Zhai, Albert J., et al.
Published: (2024)

Demeter: A Parametric Model of Crop Plant Morphology from the Real World
by: Cheng, Tianhang, et al.
Published: (2025)

Human-like Navigation in a World Built for Humans
by: Chandaka, Bhargav, et al.
Published: (2025)

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
by: Hsu, Hao-Yu, et al.
Published: (2024)

Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
by: Cheng, Tianhang, et al.
Published: (2024)

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
by: Yao, David Yifan, et al.
Published: (2025)

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)

SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
by: Xia, Haotian, et al.
Published: (2024)

EarthGen: Generating the World from Top-Down Views
by: Sharma, Ansh, et al.
Published: (2024)

Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
by: Yeh, Chun-Hsiao, et al.
Published: (2025)

Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
by: Ma, Boyi, et al.
Published: (2025)

VAEER: Visual Attention-Inspired Emotion Elicitation Reasoning
by: Man, Fanhang, et al.
Published: (2025)

Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning
by: Wei, Xinyu, et al.
Published: (2025)

FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
by: Bourigault, Emmanuelle, et al.
Published: (2025)

VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)

Improving Language Understanding from Screenshots
by: Gao, Tianyu, et al.
Published: (2024)

Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs
by: Hong, Weihao, et al.
Published: (2026)

NeMo: Needle in a Montage for Video-Language Understanding
by: Hu, Zi-Yuan, et al.
Published: (2025)

Panoptic Vision-Language Feature Fields
by: Chen, Haoran, et al.
Published: (2023)

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
by: Xu, Runsen, et al.
Published: (2025)

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
by: Tian, Xueyun, et al.
Published: (2026)

PointLLM: Empowering Large Language Models to Understand Point Clouds
by: Xu, Runsen, et al.
Published: (2023)

Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)

VISaGE: Understanding Visual Generics and Exceptions
by: Frank, Stella, et al.
Published: (2025)

Inference Compute-Optimal Video Vision Language Models
by: Wang, Peiqi, et al.
Published: (2025)

PUMGPT: A Large Vision-Language Model for Product Understanding
by: Xue, Wei, et al.
Published: (2023)

Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)

Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
by: Cheng, Zihui, et al.
Published: (2025)

Enhancing Large Vision Language Models with Self-Training on Image Comprehension
by: Deng, Yihe, et al.
Published: (2024)

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
by: Wang, Yuxuan, et al.
Published: (2024)

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
by: Yang, Zongxin, et al.
Published: (2024)

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
by: Li, Ang, et al.
Published: (2025)

Understanding Museum Exhibits using Vision-Language Reasoning
by: Balauca, Ada-Astrid, et al.
Published: (2024)

Anatomical Structure-Guided Medical Vision-Language Pre-training
by: Li, Qingqiu, et al.
Published: (2024)

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
by: Wang, Zhecan, et al.
Published: (2023)

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
by: Wang, Dianyi, et al.
Published: (2025)

Deciphering Oracle Bone Language with Diffusion Models
by: Guan, Haisu, et al.
Published: (2024)

Unleashing Hour-Scale Video Training for Long Video-Language Understanding
by: Lin, Jingyang, et al.
Published: (2025)