Saved in:
| Main Authors: | Yu, Shan, Zhu, Zhenting, Chen, Yu, Xu, Hanchen, Zhao, Pengzhan, Wang, Yang, Padmanabhan, Arthi, Latapie, Hugo, Xu, Harry |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.01623 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
by: Chen, Hanning, et al.
Published: (2024)
by: Chen, Hanning, et al.
Published: (2024)
Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?
by: Xu, Qipan, et al.
Published: (2025)
by: Xu, Qipan, et al.
Published: (2025)
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
by: Chen, Hanning, et al.
Published: (2024)
by: Chen, Hanning, et al.
Published: (2024)
Sign Stitching: A Novel Approach to Sign Language Production
by: Walsh, Harry, et al.
Published: (2024)
by: Walsh, Harry, et al.
Published: (2024)
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents
by: Zhang, Xueqiao, et al.
Published: (2025)
by: Zhang, Xueqiao, et al.
Published: (2025)
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
by: Yin, Shukang, et al.
Published: (2024)
by: Yin, Shukang, et al.
Published: (2024)
Mitigating Object Hallucination via Robust Local Perception Search
by: Gao, Zixian, et al.
Published: (2025)
by: Gao, Zixian, et al.
Published: (2025)
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
by: Shang, Yuying, et al.
Published: (2024)
by: Shang, Yuying, et al.
Published: (2024)
VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool
by: Wang, Yan, et al.
Published: (2024)
by: Wang, Yan, et al.
Published: (2024)
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
by: Xu, Hongshen, et al.
Published: (2024)
by: Xu, Hongshen, et al.
Published: (2024)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
by: Jin, Yang, et al.
Published: (2024)
by: Jin, Yang, et al.
Published: (2024)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
by: Qi, Ji, et al.
Published: (2023)
by: Qi, Ji, et al.
Published: (2023)
Unhackable Temporal Rewarding for Scalable Video MLLMs
by: Yu, En, et al.
Published: (2025)
by: Yu, En, et al.
Published: (2025)
From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
by: Wang, Xiangfeng, et al.
Published: (2025)
by: Wang, Xiangfeng, et al.
Published: (2025)
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
by: Wang, Junyang, et al.
Published: (2025)
by: Wang, Junyang, et al.
Published: (2025)
Actions and Objects Pathways for Domain Adaptation in Video Question Answering
by: Mohamud, Safaa Abdullahi Moallim, et al.
Published: (2024)
by: Mohamud, Safaa Abdullahi Moallim, et al.
Published: (2024)
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
by: Han, Xianjing, et al.
Published: (2026)
by: Han, Xianjing, et al.
Published: (2026)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
by: Fu, Chaoyou, et al.
Published: (2024)
by: Fu, Chaoyou, et al.
Published: (2024)
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
by: Ma, Guoqing, et al.
Published: (2025)
by: Ma, Guoqing, et al.
Published: (2025)
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
by: Miao, Yibo, et al.
Published: (2024)
by: Miao, Yibo, et al.
Published: (2024)
Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
by: Wang, Shanshan, et al.
Published: (2026)
by: Wang, Shanshan, et al.
Published: (2026)
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
by: Xu, Yichen, et al.
Published: (2026)
by: Xu, Yichen, et al.
Published: (2026)
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
by: Wang, Yueqian, et al.
Published: (2025)
by: Wang, Yueqian, et al.
Published: (2025)
LVCHAT: Facilitating Long Video Comprehension
by: Wang, Yu, et al.
Published: (2024)
by: Wang, Yu, et al.
Published: (2024)
Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024)
by: Yu, Sicheng, et al.
Published: (2024)
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
by: Qi, Ji, et al.
Published: (2025)
by: Qi, Ji, et al.
Published: (2025)
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)
by: Yin, Yufei, et al.
Published: (2025)
VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)
by: Li, KunChang, et al.
Published: (2023)
Inference Compute-Optimal Video Vision Language Models
by: Wang, Peiqi, et al.
Published: (2025)
by: Wang, Peiqi, et al.
Published: (2025)
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
by: Xiang, Sike, et al.
Published: (2026)
by: Xiang, Sike, et al.
Published: (2026)
Rethinking Boundary Discontinuity Problem for Oriented Object Detection
by: Xu, Hang, et al.
Published: (2023)
by: Xu, Hang, et al.
Published: (2023)
Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)
by: Tang, Yolo Y., et al.
Published: (2023)
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
by: Baharlouei, Elaheh, et al.
Published: (2024)
by: Baharlouei, Elaheh, et al.
Published: (2024)
CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation
by: Jiang, H., et al.
Published: (2026)
by: Jiang, H., et al.
Published: (2026)
Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning
by: Xu, Chang, et al.
Published: (2024)
by: Xu, Chang, et al.
Published: (2024)
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
by: Cheng, Zesen, et al.
Published: (2024)
by: Cheng, Zesen, et al.
Published: (2024)
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
by: Yuan, Shenghai, et al.
Published: (2024)
by: Yuan, Shenghai, et al.
Published: (2024)
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
by: Shlapentokh-Rothman, Michal, et al.
Published: (2026)
by: Shlapentokh-Rothman, Michal, et al.
Published: (2026)
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
by: Xu, Kepeng, et al.
Published: (2026)
by: Xu, Kepeng, et al.
Published: (2026)
Similar Items
-
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
by: Chen, Hanning, et al.
Published: (2024) -
Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?
by: Xu, Qipan, et al.
Published: (2025) -
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
by: Chen, Hanning, et al.
Published: (2024) -
Sign Stitching: A Novel Approach to Sign Language Production
by: Walsh, Harry, et al.
Published: (2024) -
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents
by: Zhang, Xueqiao, et al.
Published: (2025)