Saved in:
| Main Authors: | Li, Wei, Hu, Bing, Shao, Rui, Shen, Leyang, Nie, Liqiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.03663 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024)
by: Shen, Leyang, et al.
Published: (2024)
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
by: Hu, Bing, et al.
Published: (2026)
by: Hu, Bing, et al.
Published: (2026)
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer
by: Jiang, Junpeng, et al.
Published: (2025)
by: Jiang, Junpeng, et al.
Published: (2025)
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
by: Cheng, Zixu, et al.
Published: (2026)
by: Cheng, Zixu, et al.
Published: (2026)
Slot-VLM: SlowFast Slots for Video-Language Modeling
by: Xu, Jiaqi, et al.
Published: (2024)
by: Xu, Jiaqi, et al.
Published: (2024)
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
by: Wang, Shijian, et al.
Published: (2025)
by: Wang, Shijian, et al.
Published: (2025)
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks
by: Dedhia, Bhishma, et al.
Published: (2025)
by: Dedhia, Bhishma, et al.
Published: (2025)
Slow-Fast Architecture for Video Multi-Modal Large Language Models
by: Shi, Min, et al.
Published: (2025)
by: Shi, Min, et al.
Published: (2025)
ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction
by: Wang, Kun, et al.
Published: (2026)
by: Wang, Kun, et al.
Published: (2026)
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
by: Yang, Zhenyu, et al.
Published: (2025)
by: Yang, Zhenyu, et al.
Published: (2025)
A Survey on Video Temporal Grounding with Multimodal Large Language Model
by: Wu, Jianlong, et al.
Published: (2025)
by: Wu, Jianlong, et al.
Published: (2025)
R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking
by: Li, Zixu, et al.
Published: (2026)
by: Li, Zixu, et al.
Published: (2026)
OneThinker: All-in-one Reasoning Model for Image and Video
by: Feng, Kaituo, et al.
Published: (2025)
by: Feng, Kaituo, et al.
Published: (2025)
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
by: Li, Chenglin, et al.
Published: (2026)
by: Li, Chenglin, et al.
Published: (2026)
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
by: Yin, Tianwei, et al.
Published: (2024)
by: Yin, Tianwei, et al.
Published: (2024)
DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection
by: Shao, Rui, et al.
Published: (2023)
by: Shao, Rui, et al.
Published: (2023)
AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis
by: Yang, Zhiwei, et al.
Published: (2025)
by: Yang, Zhiwei, et al.
Published: (2025)
LION: Implicit Vision Prompt Tuning
by: Wang, Haixin, et al.
Published: (2023)
by: Wang, Haixin, et al.
Published: (2023)
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
by: Li, Zaijing, et al.
Published: (2026)
by: Li, Zaijing, et al.
Published: (2026)
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
Seeing Fast and Slow: Learning the Flow of Time in Videos
by: Wu, Yen-Siang, et al.
Published: (2026)
by: Wu, Yen-Siang, et al.
Published: (2026)
StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval
by: Wang, Shaokun, et al.
Published: (2026)
by: Wang, Shaokun, et al.
Published: (2026)
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
by: Wang, Xiao, et al.
Published: (2024)
by: Wang, Xiao, et al.
Published: (2024)
SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
by: Nie, Ming, et al.
Published: (2026)
by: Nie, Ming, et al.
Published: (2026)
Towards Harmless Multimodal Assistants with Blind Preference Optimization
by: Li, Yongqi, et al.
Published: (2025)
by: Li, Yongqi, et al.
Published: (2025)
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
by: Shao, Rui, et al.
Published: (2025)
by: Shao, Rui, et al.
Published: (2025)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)
by: Xu, Mingze, et al.
Published: (2025)
Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation
by: Gao, Junyu, et al.
Published: (2023)
by: Gao, Junyu, et al.
Published: (2023)
Eliminating Warping Shakes for Unsupervised Online Video Stitching
by: Nie, Lang, et al.
Published: (2024)
by: Nie, Lang, et al.
Published: (2024)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)
by: Xu, Mingze, et al.
Published: (2024)
SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging
by: Zeng, Haijin, et al.
Published: (2025)
by: Zeng, Haijin, et al.
Published: (2025)
Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
by: Tan, Wenhui, et al.
Published: (2026)
by: Tan, Wenhui, et al.
Published: (2026)
SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening
by: Nahin, Shahriar Kabir, et al.
Published: (2026)
by: Nahin, Shahriar Kabir, et al.
Published: (2026)
FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting
by: He, Zefeng, et al.
Published: (2025)
by: He, Zefeng, et al.
Published: (2025)
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
by: Chu, Xiangxiang, et al.
Published: (2023)
by: Chu, Xiangxiang, et al.
Published: (2023)
VideoLLM-online: Online Video Large Language Model for Streaming Video
by: Chen, Joya, et al.
Published: (2024)
by: Chen, Joya, et al.
Published: (2024)
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
by: Wang, Qunzhong, et al.
Published: (2025)
by: Wang, Qunzhong, et al.
Published: (2025)
PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
by: Lyu, Yibo, et al.
Published: (2025)
by: Lyu, Yibo, et al.
Published: (2025)
Object-Shot Enhanced Grounding Network for Egocentric Video
by: Feng, Yisen, et al.
Published: (2025)
by: Feng, Yisen, et al.
Published: (2025)
Similar Items
-
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024) -
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
by: Hu, Bing, et al.
Published: (2026) -
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer
by: Jiang, Junpeng, et al.
Published: (2025) -
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
by: Cheng, Zixu, et al.
Published: (2026) -
Slot-VLM: SlowFast Slots for Video-Language Modeling
by: Xu, Jiaqi, et al.
Published: (2024)