:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Wei, Hu, Bing, Shao, Rui, Shen, Leyang, Nie, Liqiang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.03663
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024)

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
by: Hu, Bing, et al.
Published: (2026)

DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer
by: Jiang, Junpeng, et al.
Published: (2025)

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
by: Cheng, Zixu, et al.
Published: (2026)

Slot-VLM: SlowFast Slots for Video-Language Modeling
by: Xu, Jiaqi, et al.
Published: (2024)

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
by: Li, Wei, et al.
Published: (2025)

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
by: Wang, Shijian, et al.
Published: (2025)

Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks
by: Dedhia, Bhishma, et al.
Published: (2025)

Slow-Fast Architecture for Video Multi-Modal Large Language Models
by: Shi, Min, et al.
Published: (2025)

ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction
by: Wang, Kun, et al.
Published: (2026)

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
by: Yang, Zhenyu, et al.
Published: (2025)

A Survey on Video Temporal Grounding with Multimodal Large Language Model
by: Wu, Jianlong, et al.
Published: (2025)

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking
by: Li, Zixu, et al.
Published: (2026)

OneThinker: All-in-one Reasoning Model for Image and Video
by: Feng, Kaituo, et al.
Published: (2025)

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
by: Li, Chenglin, et al.
Published: (2026)

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
by: Yin, Tianwei, et al.
Published: (2024)

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection
by: Shao, Rui, et al.
Published: (2023)

AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis
by: Yang, Zhiwei, et al.
Published: (2025)

LION: Implicit Vision Prompt Tuning
by: Wang, Haixin, et al.
Published: (2023)

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
by: Li, Zaijing, et al.
Published: (2026)

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)

Seeing Fast and Slow: Learning the Flow of Time in Videos
by: Wu, Yen-Siang, et al.
Published: (2026)

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval
by: Wang, Shaokun, et al.
Published: (2026)

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
by: Wang, Xiao, et al.
Published: (2024)

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
by: Nie, Ming, et al.
Published: (2026)

Towards Harmless Multimodal Assistants with Blind Preference Optimization
by: Li, Yongqi, et al.
Published: (2025)

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
by: Shao, Rui, et al.
Published: (2025)

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)

Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation
by: Gao, Junyu, et al.
Published: (2023)

Eliminating Warping Shakes for Unsupervised Online Video Stitching
by: Nie, Lang, et al.
Published: (2024)

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)

SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging
by: Zeng, Haijin, et al.
Published: (2025)

Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
by: Tan, Wenhui, et al.
Published: (2026)

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening
by: Nahin, Shahriar Kabir, et al.
Published: (2026)

FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting
by: He, Zefeng, et al.
Published: (2025)

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
by: Chu, Xiangxiang, et al.
Published: (2023)

VideoLLM-online: Online Video Large Language Model for Streaming Video
by: Chen, Joya, et al.
Published: (2024)

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
by: Wang, Qunzhong, et al.
Published: (2025)

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
by: Lyu, Yibo, et al.
Published: (2025)

Object-Shot Enhanced Grounding Network for Egocentric Video
by: Feng, Yisen, et al.
Published: (2025)