Saved in:
| Main Authors: | Liu, Ye, Ma, Zongyang, Qi, Zhongang, Wu, Yang, Shan, Ying, Chen, Chang Wen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.18111 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025)
by: Liu, Ye, et al.
Published: (2025)
EA-VTR: Event-Aware Video-Text Retrieval
by: Ma, Zongyang, et al.
Published: (2024)
by: Ma, Zongyang, et al.
Published: (2024)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
by: Yang, Tao, et al.
Published: (2024)
by: Yang, Tao, et al.
Published: (2024)
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
by: Xu, Weili, et al.
Published: (2025)
by: Xu, Weili, et al.
Published: (2025)
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
by: Chen, Yuxin, et al.
Published: (2024)
by: Chen, Yuxin, et al.
Published: (2024)
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
DOGR: Towards Versatile Visual Document Grounding and Referring
by: Zhou, Yinan, et al.
Published: (2024)
by: Zhou, Yinan, et al.
Published: (2024)
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
by: Zheng, Guangcong, et al.
Published: (2023)
by: Zheng, Guangcong, et al.
Published: (2023)
SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
by: Qiu, Zongyang, et al.
Published: (2025)
by: Qiu, Zongyang, et al.
Published: (2025)
OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes
by: Kurkova, Regina, et al.
Published: (2026)
by: Kurkova, Regina, et al.
Published: (2026)
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
by: Zhang, Zhixiong, et al.
Published: (2026)
by: Zhang, Zhixiong, et al.
Published: (2026)
VEU-Bench: Towards Comprehensive Understanding of Video Editing
by: Li, Bozheng, et al.
Published: (2025)
by: Li, Bozheng, et al.
Published: (2025)
iMOVE: Instance-Motion-Aware Video Understanding
by: Li, Jiaze, et al.
Published: (2025)
by: Li, Jiaze, et al.
Published: (2025)
StyleAdapter: A Unified Stylized Image Generation Model
by: Wang, Zhouxia, et al.
Published: (2023)
by: Wang, Zhouxia, et al.
Published: (2023)
SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
by: Li, Xuewei, et al.
Published: (2023)
by: Li, Xuewei, et al.
Published: (2023)
Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders
by: Stevens, Samuel, et al.
Published: (2025)
by: Stevens, Samuel, et al.
Published: (2025)
InstructionBench: An Instructional Video Understanding Benchmark
by: Wei, Haiwan, et al.
Published: (2025)
by: Wei, Haiwan, et al.
Published: (2025)
Towards Event-oriented Long Video Understanding
by: Du, Yifan, et al.
Published: (2024)
by: Du, Yifan, et al.
Published: (2024)
Hawk: Learning to Understand Open-World Video Anomalies
by: Tang, Jiaqi, et al.
Published: (2024)
by: Tang, Jiaqi, et al.
Published: (2024)
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
by: Liang, Tianming, et al.
Published: (2024)
by: Liang, Tianming, et al.
Published: (2024)
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
by: Qi, Yukun, et al.
Published: (2025)
by: Qi, Yukun, et al.
Published: (2025)
Taming Rectified Flow for Inversion and Editing
by: Wang, Jiangshan, et al.
Published: (2024)
by: Wang, Jiangshan, et al.
Published: (2024)
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
by: Chen, Xiuyuan, et al.
Published: (2023)
by: Chen, Xiuyuan, et al.
Published: (2023)
Generative Region-Language Pretraining for Open-Ended Object Detection
by: Lin, Chuang, et al.
Published: (2024)
by: Lin, Chuang, et al.
Published: (2024)
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026)
by: Wen, Siwei, et al.
Published: (2026)
Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning
by: Du, Jia-Run, et al.
Published: (2022)
by: Du, Jia-Run, et al.
Published: (2022)
Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
by: Yu, Songsong, et al.
Published: (2025)
by: Yu, Songsong, et al.
Published: (2025)
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
by: Huang, Wenhui, et al.
Published: (2026)
by: Huang, Wenhui, et al.
Published: (2026)
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
by: Liu, Hongbo, et al.
Published: (2025)
by: Liu, Hongbo, et al.
Published: (2025)
OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding
by: Wu, Yanmin, et al.
Published: (2024)
by: Wu, Yanmin, et al.
Published: (2024)
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)
by: Zhang, Zicheng, et al.
Published: (2024)
TennisExpert: Towards Expert-Level Analytical Sports Video Understanding
by: Liu, Zhaoyu, et al.
Published: (2026)
by: Liu, Zhaoyu, et al.
Published: (2026)
EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
by: Liu, Shaoyu, et al.
Published: (2025)
by: Liu, Shaoyu, et al.
Published: (2025)
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024)
by: Wu, Weijia, et al.
Published: (2024)
Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation
by: Zhao, Zhonghan, et al.
Published: (2024)
by: Zhao, Zhonghan, et al.
Published: (2024)
Open-Event Procedure Planning in Instructional Videos
by: Wu, Yilu, et al.
Published: (2024)
by: Wu, Yilu, et al.
Published: (2024)
TextVidBench: A Benchmark for Long Video Scene Text Understanding
by: Zhong, Yangyang, et al.
Published: (2025)
by: Zhong, Yangyang, et al.
Published: (2025)
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
by: Cheng, Ying, et al.
Published: (2025)
by: Cheng, Ying, et al.
Published: (2025)
Similar Items
-
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025) -
EA-VTR: Event-Aware Video-Text Retrieval
by: Ma, Zongyang, et al.
Published: (2024) -
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
by: Yang, Tao, et al.
Published: (2024) -
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
by: Xu, Weili, et al.
Published: (2025) -
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
by: Chen, Yuxin, et al.
Published: (2024)