:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Ye, Ma, Zongyang, Qi, Zhongang, Wu, Yang, Shan, Ying, Chen, Chang Wen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.18111
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025)

EA-VTR: Event-Aware Video-Text Retrieval
by: Ma, Zongyang, et al.
Published: (2024)

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
by: Yang, Tao, et al.
Published: (2024)

AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
by: Xu, Weili, et al.
Published: (2025)

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
by: Chen, Yuxin, et al.
Published: (2024)

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)

DOGR: Towards Versatile Visual Document Grounding and Referring
by: Zhou, Yinan, et al.
Published: (2024)

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
by: Wu, Tao, et al.
Published: (2024)

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
by: Zheng, Guangcong, et al.
Published: (2023)

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model
by: Wu, Tao, et al.
Published: (2024)

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
by: Qiu, Zongyang, et al.
Published: (2025)

OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes
by: Kurkova, Regina, et al.
Published: (2026)

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
by: Zhang, Zhixiong, et al.
Published: (2026)

VEU-Bench: Towards Comprehensive Understanding of Video Editing
by: Li, Bozheng, et al.
Published: (2025)

iMOVE: Instance-Motion-Aware Video Understanding
by: Li, Jiaze, et al.
Published: (2025)

StyleAdapter: A Unified Stylized Image Generation Model
by: Wang, Zhouxia, et al.
Published: (2023)

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
by: Li, Xuewei, et al.
Published: (2023)

Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders
by: Stevens, Samuel, et al.
Published: (2025)

InstructionBench: An Instructional Video Understanding Benchmark
by: Wei, Haiwan, et al.
Published: (2025)

Towards Event-oriented Long Video Understanding
by: Du, Yifan, et al.
Published: (2024)

Hawk: Learning to Understand Open-World Video Anomalies
by: Tang, Jiaqi, et al.
Published: (2024)

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
by: Liang, Tianming, et al.
Published: (2024)

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
by: Qi, Yukun, et al.
Published: (2025)

Taming Rectified Flow for Inversion and Editing
by: Wang, Jiangshan, et al.
Published: (2024)

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
by: Chen, Xiuyuan, et al.
Published: (2023)

Generative Region-Language Pretraining for Open-Ended Object Detection
by: Lin, Chuang, et al.
Published: (2024)

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026)

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning
by: Du, Jia-Run, et al.
Published: (2022)

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
by: Yu, Songsong, et al.
Published: (2025)

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
by: Huang, Wenhui, et al.
Published: (2026)

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
by: Liu, Hongbo, et al.
Published: (2025)

OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding
by: Wu, Yanmin, et al.
Published: (2024)

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding
by: Liu, Zhaoyu, et al.
Published: (2026)

EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
by: Liu, Shaoyu, et al.
Published: (2025)

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024)

Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation
by: Zhao, Zhonghan, et al.
Published: (2024)

Open-Event Procedure Planning in Instructional Videos
by: Wu, Yilu, et al.
Published: (2024)

TextVidBench: A Benchmark for Long Video Scene Text Understanding
by: Zhong, Yangyang, et al.
Published: (2025)

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
by: Cheng, Ying, et al.
Published: (2025)