:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yu, Shan, Zhu, Zhenting, Chen, Yu, Xu, Hanchen, Zhao, Pengzhan, Wang, Yang, Padmanabhan, Arthi, Latapie, Hugo, Xu, Harry
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2311.01623
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
by: Chen, Hanning, et al.
Published: (2024)

Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?
by: Xu, Qipan, et al.
Published: (2025)

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
by: Chen, Hanning, et al.
Published: (2024)

Sign Stitching: A Novel Approach to Sign Language Production
by: Walsh, Harry, et al.
Published: (2024)

Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents
by: Zhang, Xueqiao, et al.
Published: (2025)

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
by: Yin, Shukang, et al.
Published: (2024)

Mitigating Object Hallucination via Robust Local Perception Search
by: Gao, Zixian, et al.
Published: (2025)

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
by: Shang, Yuying, et al.
Published: (2024)

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool
by: Wang, Yan, et al.
Published: (2024)

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
by: Xu, Hongshen, et al.
Published: (2024)

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
by: Jin, Yang, et al.
Published: (2024)

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
by: Qi, Ji, et al.
Published: (2023)

Unhackable Temporal Rewarding for Scalable Video MLLMs
by: Yu, En, et al.
Published: (2025)

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
by: Wang, Xiangfeng, et al.
Published: (2025)

Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
by: Wang, Junyang, et al.
Published: (2025)

Actions and Objects Pathways for Domain Adaptation in Video Question Answering
by: Mohamud, Safaa Abdullahi Moallim, et al.
Published: (2024)

OSCBench: Benchmarking Object State Change in Text-to-Video Generation
by: Han, Xianjing, et al.
Published: (2026)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
by: Fu, Chaoyou, et al.
Published: (2024)

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
by: Li, Yunxin, et al.
Published: (2024)

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
by: Ma, Guoqing, et al.
Published: (2025)

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
by: Miao, Yibo, et al.
Published: (2024)

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
by: Wang, Shanshan, et al.
Published: (2026)

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
by: Xu, Yichen, et al.
Published: (2026)

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
by: Wang, Yueqian, et al.
Published: (2025)

LVCHAT: Facilitating Long Video Comprehension
by: Wang, Yu, et al.
Published: (2024)

Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024)

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
by: Qi, Ji, et al.
Published: (2025)

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)

VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)

Inference Compute-Optimal Video Vision Language Models
by: Wang, Peiqi, et al.
Published: (2025)

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
by: Xiang, Sike, et al.
Published: (2026)

Rethinking Boundary Discontinuity Problem for Oriented Object Detection
by: Xu, Hang, et al.
Published: (2023)

Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)

Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
by: Baharlouei, Elaheh, et al.
Published: (2024)

CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation
by: Jiang, H., et al.
Published: (2026)

Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning
by: Xu, Chang, et al.
Published: (2024)

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
by: Cheng, Zesen, et al.
Published: (2024)

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
by: Yuan, Shenghai, et al.
Published: (2024)

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
by: Shlapentokh-Rothman, Michal, et al.
Published: (2026)

Allegory of the Cave: Measurement-Grounded Vision-Language Learning
by: Xu, Kepeng, et al.
Published: (2026)