Saved in:
| Main Authors: | Gauba, Aruna, Pi, Irene, Man, Yunze, Pang, Ziqi, Adve, Vikram S., Wang, Yu-Xiong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.10568 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PPTArena: A Benchmark for Agentic PowerPoint Editing
by: Ofengenden, Michael, et al.
Published: (2025)
by: Ofengenden, Michael, et al.
Published: (2025)
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
by: Pang, Ziqi, et al.
Published: (2023)
by: Pang, Ziqi, et al.
Published: (2023)
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
by: Awal, Rabiul, et al.
Published: (2025)
by: Awal, Rabiul, et al.
Published: (2025)
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
by: Sun, Yuxuan, et al.
Published: (2024)
by: Sun, Yuxuan, et al.
Published: (2024)
MR. Video: "MapReduce" is the Principle for Long Video Understanding
by: Pang, Ziqi, et al.
Published: (2025)
by: Pang, Ziqi, et al.
Published: (2025)
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
by: Dongre, Vardhan, et al.
Published: (2025)
by: Dongre, Vardhan, et al.
Published: (2025)
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
by: Pang, Ziqi, et al.
Published: (2024)
by: Pang, Ziqi, et al.
Published: (2024)
PaintScene4D: Consistent 4D Scene Generation from Text Prompts
by: Gupta, Vinayak, et al.
Published: (2024)
by: Gupta, Vinayak, et al.
Published: (2024)
Generalized Open-World Semi-Supervised Object Detection
by: Allabadi, Garvita, et al.
Published: (2023)
by: Allabadi, Garvita, et al.
Published: (2023)
SceneCraft: Layout-Guided 3D Scene Generation
by: Yang, Xiuyu, et al.
Published: (2024)
by: Yang, Xiuyu, et al.
Published: (2024)
DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception
by: Man, Yunze, et al.
Published: (2023)
by: Man, Yunze, et al.
Published: (2023)
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
by: Pang, Ziqi, et al.
Published: (2025)
by: Pang, Ziqi, et al.
Published: (2025)
Situational Awareness Matters in 3D Vision Language Reasoning
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
Floating No More: Object-Ground Reconstruction from a Single Image
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
by: Zhang, Zheyu, et al.
Published: (2026)
by: Zhang, Zheyu, et al.
Published: (2026)
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
by: Lin, Lang, et al.
Published: (2025)
by: Lin, Lang, et al.
Published: (2025)
Towards Energy-Efficiency by Navigating the Trilemma of Energy, Latency, and Accuracy
by: Tian, Boyuan, et al.
Published: (2024)
by: Tian, Boyuan, et al.
Published: (2024)
RMem: Restricted Memory Banks Improve Video Object Segmentation
by: Zhou, Junbao, et al.
Published: (2024)
by: Zhou, Junbao, et al.
Published: (2024)
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
by: Xie, Wulin, et al.
Published: (2025)
by: Xie, Wulin, et al.
Published: (2025)
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
by: Xia, Haotian, et al.
Published: (2024)
by: Xia, Haotian, et al.
Published: (2024)
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
by: Li, Qingmei, et al.
Published: (2025)
by: Li, Qingmei, et al.
Published: (2025)
LiCAF: LiDAR-Camera Asymmetric Fusion for Gait Recognition
by: Deng, Yunze, et al.
Published: (2024)
by: Deng, Yunze, et al.
Published: (2024)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)
by: Li, Kunchang, et al.
Published: (2023)
MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding
by: Kou, Qian, et al.
Published: (2026)
by: Kou, Qian, et al.
Published: (2026)
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion
by: Ruan, Jiacheng, et al.
Published: (2024)
by: Ruan, Jiacheng, et al.
Published: (2024)
CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding
by: Liu, Yunze, et al.
Published: (2024)
by: Liu, Yunze, et al.
Published: (2024)
QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs
by: Denipitiyage, Dishanika, et al.
Published: (2026)
by: Denipitiyage, Dishanika, et al.
Published: (2026)
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)
by: Wang, Youze, et al.
Published: (2025)
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025)
by: Yang, Jian, et al.
Published: (2025)
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture
by: Ye, Zi, et al.
Published: (2026)
by: Ye, Zi, et al.
Published: (2026)
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
by: Zhang, Da, et al.
Published: (2025)
by: Zhang, Da, et al.
Published: (2025)
GaitGS: Temporal Feature Learning in Granularity and Span Dimension for Gait Recognition
by: Xiong, Haijun, et al.
Published: (2023)
by: Xiong, Haijun, et al.
Published: (2023)
STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting
by: Deng, Yunze, et al.
Published: (2025)
by: Deng, Yunze, et al.
Published: (2025)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
by: Huang, Ziqi, et al.
Published: (2024)
by: Huang, Ziqi, et al.
Published: (2024)
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
by: Amirloo, Elmira, et al.
Published: (2024)
by: Amirloo, Elmira, et al.
Published: (2024)
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
by: Wu, Peiran, et al.
Published: (2025)
by: Wu, Peiran, et al.
Published: (2025)
MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal
by: Nie, Yiqi, et al.
Published: (2026)
by: Nie, Yiqi, et al.
Published: (2026)
Similar Items
-
PPTArena: A Benchmark for Agentic PowerPoint Editing
by: Ofengenden, Michael, et al.
Published: (2025) -
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
by: Pang, Ziqi, et al.
Published: (2023) -
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
by: Awal, Rabiul, et al.
Published: (2025) -
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
by: Sun, Yuxuan, et al.
Published: (2024) -
MR. Video: "MapReduce" is the Principle for Long Video Understanding
by: Pang, Ziqi, et al.
Published: (2025)