Saved in:
| Main Authors: | Zhai, Albert J., Shen, Yuan, Chen, Emily Y., Wang, Gloria X., Wang, Xinlei, Wang, Sheng, Guan, Kaiyu, Wang, Shenlong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.04242 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CropCraft: Complete Structural Characterization of Crop Plants From Images
by: Zhai, Albert J., et al.
Published: (2024)
by: Zhai, Albert J., et al.
Published: (2024)
Demeter: A Parametric Model of Crop Plant Morphology from the Real World
by: Cheng, Tianhang, et al.
Published: (2025)
by: Cheng, Tianhang, et al.
Published: (2025)
Human-like Navigation in a World Built for Humans
by: Chandaka, Bhargav, et al.
Published: (2025)
by: Chandaka, Bhargav, et al.
Published: (2025)
AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
by: Hsu, Hao-Yu, et al.
Published: (2024)
by: Hsu, Hao-Yu, et al.
Published: (2024)
Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
by: Cheng, Tianhang, et al.
Published: (2024)
by: Cheng, Tianhang, et al.
Published: (2024)
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
by: Yao, David Yifan, et al.
Published: (2025)
by: Yao, David Yifan, et al.
Published: (2025)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)
by: Li, Chenxu, et al.
Published: (2025)
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
by: Xia, Haotian, et al.
Published: (2024)
by: Xia, Haotian, et al.
Published: (2024)
EarthGen: Generating the World from Top-Down Views
by: Sharma, Ansh, et al.
Published: (2024)
by: Sharma, Ansh, et al.
Published: (2024)
Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)
by: Wang, Xinyu, et al.
Published: (2025)
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
by: Yeh, Chun-Hsiao, et al.
Published: (2025)
by: Yeh, Chun-Hsiao, et al.
Published: (2025)
Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
by: Ma, Boyi, et al.
Published: (2025)
by: Ma, Boyi, et al.
Published: (2025)
VAEER: Visual Attention-Inspired Emotion Elicitation Reasoning
by: Man, Fanhang, et al.
Published: (2025)
by: Man, Fanhang, et al.
Published: (2025)
Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning
by: Wei, Xinyu, et al.
Published: (2025)
by: Wei, Xinyu, et al.
Published: (2025)
FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
by: Bourigault, Emmanuelle, et al.
Published: (2025)
by: Bourigault, Emmanuelle, et al.
Published: (2025)
VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)
by: Li, KunChang, et al.
Published: (2023)
Improving Language Understanding from Screenshots
by: Gao, Tianyu, et al.
Published: (2024)
by: Gao, Tianyu, et al.
Published: (2024)
Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs
by: Hong, Weihao, et al.
Published: (2026)
by: Hong, Weihao, et al.
Published: (2026)
NeMo: Needle in a Montage for Video-Language Understanding
by: Hu, Zi-Yuan, et al.
Published: (2025)
by: Hu, Zi-Yuan, et al.
Published: (2025)
Panoptic Vision-Language Feature Fields
by: Chen, Haoran, et al.
Published: (2023)
by: Chen, Haoran, et al.
Published: (2023)
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
by: Xu, Runsen, et al.
Published: (2025)
by: Xu, Runsen, et al.
Published: (2025)
ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
by: Tian, Xueyun, et al.
Published: (2026)
by: Tian, Xueyun, et al.
Published: (2026)
PointLLM: Empowering Large Language Models to Understand Point Clouds
by: Xu, Runsen, et al.
Published: (2023)
by: Xu, Runsen, et al.
Published: (2023)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
VISaGE: Understanding Visual Generics and Exceptions
by: Frank, Stella, et al.
Published: (2025)
by: Frank, Stella, et al.
Published: (2025)
Inference Compute-Optimal Video Vision Language Models
by: Wang, Peiqi, et al.
Published: (2025)
by: Wang, Peiqi, et al.
Published: (2025)
PUMGPT: A Large Vision-Language Model for Product Understanding
by: Xue, Wei, et al.
Published: (2023)
by: Xue, Wei, et al.
Published: (2023)
Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)
by: Tang, Yolo Y., et al.
Published: (2023)
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)
by: Woo, Sangmin, et al.
Published: (2025)
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
by: Cheng, Zihui, et al.
Published: (2025)
by: Cheng, Zihui, et al.
Published: (2025)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
by: Deng, Yihe, et al.
Published: (2024)
by: Deng, Yihe, et al.
Published: (2024)
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
by: Yang, Zongxin, et al.
Published: (2024)
by: Yang, Zongxin, et al.
Published: (2024)
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
by: Li, Ang, et al.
Published: (2025)
by: Li, Ang, et al.
Published: (2025)
Understanding Museum Exhibits using Vision-Language Reasoning
by: Balauca, Ada-Astrid, et al.
Published: (2024)
by: Balauca, Ada-Astrid, et al.
Published: (2024)
Anatomical Structure-Guided Medical Vision-Language Pre-training
by: Li, Qingqiu, et al.
Published: (2024)
by: Li, Qingqiu, et al.
Published: (2024)
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
by: Wang, Zhecan, et al.
Published: (2023)
by: Wang, Zhecan, et al.
Published: (2023)
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
by: Wang, Dianyi, et al.
Published: (2025)
by: Wang, Dianyi, et al.
Published: (2025)
Deciphering Oracle Bone Language with Diffusion Models
by: Guan, Haisu, et al.
Published: (2024)
by: Guan, Haisu, et al.
Published: (2024)
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
by: Lin, Jingyang, et al.
Published: (2025)
by: Lin, Jingyang, et al.
Published: (2025)
Similar Items
-
CropCraft: Complete Structural Characterization of Crop Plants From Images
by: Zhai, Albert J., et al.
Published: (2024) -
Demeter: A Parametric Model of Crop Plant Morphology from the Real World
by: Cheng, Tianhang, et al.
Published: (2025) -
Human-like Navigation in a World Built for Humans
by: Chandaka, Bhargav, et al.
Published: (2025) -
AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
by: Hsu, Hao-Yu, et al.
Published: (2024) -
Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
by: Cheng, Tianhang, et al.
Published: (2024)