Saved in:
| Main Authors: | Wu, Jianlong, Liu, Wei, Liu, Ye, Liu, Meng, Nie, Liqiang, Lin, Zhouchen, Chen, Chang Wen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.10922 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution
by: Li, Xinrui, et al.
Published: (2025)
by: Li, Xinrui, et al.
Published: (2025)
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
by: Liu, Ye, et al.
Published: (2025)
by: Liu, Ye, et al.
Published: (2025)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024)
by: Wang, Xiao, et al.
Published: (2024)
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
by: Wang, Xiao, et al.
Published: (2024)
by: Wang, Xiao, et al.
Published: (2024)
Object-Shot Enhanced Grounding Network for Egocentric Video
by: Feng, Yisen, et al.
Published: (2025)
by: Feng, Yisen, et al.
Published: (2025)
$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
by: Liu, Ye, et al.
Published: (2024)
by: Liu, Ye, et al.
Published: (2024)
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
by: Shao, Rui, et al.
Published: (2025)
by: Shao, Rui, et al.
Published: (2025)
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models
by: Liu, Juntao, et al.
Published: (2025)
by: Liu, Juntao, et al.
Published: (2025)
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
by: Shen, Leyang, et al.
Published: (2024)
by: Shen, Leyang, et al.
Published: (2024)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
by: Qu, Mengxue, et al.
Published: (2024)
by: Qu, Mengxue, et al.
Published: (2024)
Expectation-Maximization Attention Networks for Semantic Segmentation
by: Li, Xia, et al.
Published: (2019)
by: Li, Xia, et al.
Published: (2019)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
by: Wang, Jiankang, et al.
Published: (2025)
by: Wang, Jiankang, et al.
Published: (2025)
StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval
by: Wang, Shaokun, et al.
Published: (2026)
by: Wang, Shaokun, et al.
Published: (2026)
TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs
by: Wang, Yunxiao, et al.
Published: (2025)
by: Wang, Yunxiao, et al.
Published: (2025)
Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge
by: Zhang, Jinrong, et al.
Published: (2026)
by: Zhang, Jinrong, et al.
Published: (2026)
The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation
by: He, Xusheng, et al.
Published: (2026)
by: He, Xusheng, et al.
Published: (2026)
GenView++: Unifying Adaptive Generative Augmentation and Quality-Driven Supervision for Contrastive Representation Learning
by: Li, Xiaojie, et al.
Published: (2025)
by: Li, Xiaojie, et al.
Published: (2025)
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
by: Wang, Xiao, et al.
Published: (2025)
by: Wang, Xiao, et al.
Published: (2025)
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
by: Fan, Rong, et al.
Published: (2026)
by: Fan, Rong, et al.
Published: (2026)
LION: Implicit Vision Prompt Tuning
by: Wang, Haixin, et al.
Published: (2023)
by: Wang, Haixin, et al.
Published: (2023)
Continuous Knowledge-Preserving Decomposition with Adaptive Layer Selection for Few-Shot Class-Incremental Learning
by: Li, Xiaojie, et al.
Published: (2025)
by: Li, Xiaojie, et al.
Published: (2025)
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
by: Wang, Xiao, et al.
Published: (2025)
by: Wang, Xiao, et al.
Published: (2025)
Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency
by: Li, Zihan, et al.
Published: (2025)
by: Li, Zihan, et al.
Published: (2025)
Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation
by: Liu, Han, et al.
Published: (2025)
by: Liu, Han, et al.
Published: (2025)
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding
by: Gao, Shida, et al.
Published: (2025)
by: Gao, Shida, et al.
Published: (2025)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
F-LMM: Grounding Frozen Large Multimodal Models
by: Wu, Size, et al.
Published: (2024)
by: Wu, Size, et al.
Published: (2024)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models
by: Zhan, Yu-Wei, et al.
Published: (2023)
by: Zhan, Yu-Wei, et al.
Published: (2023)
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models
by: Zhao, Fufangchen, et al.
Published: (2025)
by: Zhao, Fufangchen, et al.
Published: (2025)
A Refer-and-Ground Multimodal Large Language Model for Biomedicine
by: Huang, Xiaoshuang, et al.
Published: (2024)
by: Huang, Xiaoshuang, et al.
Published: (2024)
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
by: Zheng, Minghang, et al.
Published: (2024)
by: Zheng, Minghang, et al.
Published: (2024)
VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
by: Wang, Hanqing, et al.
Published: (2026)
by: Wang, Hanqing, et al.
Published: (2026)
Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models
by: Zhang, Cheng, et al.
Published: (2026)
by: Zhang, Cheng, et al.
Published: (2026)
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)
by: Guo, Yongxin, et al.
Published: (2024)
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
by: Wang, Shihao, et al.
Published: (2025)
by: Wang, Shihao, et al.
Published: (2025)
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
by: Lin, Haoqiang, et al.
Published: (2025)
by: Lin, Haoqiang, et al.
Published: (2025)
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
by: Zhang, Haoyu, et al.
Published: (2025)
by: Zhang, Haoyu, et al.
Published: (2025)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
by: Xiong, Yuanhao, et al.
Published: (2023)
by: Xiong, Yuanhao, et al.
Published: (2023)
Similar Items
-
MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution
by: Li, Xinrui, et al.
Published: (2025) -
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
by: Liu, Ye, et al.
Published: (2025) -
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024) -
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
by: Wang, Xiao, et al.
Published: (2024) -
Object-Shot Enhanced Grounding Network for Egocentric Video
by: Feng, Yisen, et al.
Published: (2025)