Saved in:
| Main Authors: | Zhu, Yuhan, Zeng, Xiangyu, Wang, Chenting, Li, Xinhao, Liu, Chunxu, Xu, Yicheng, Yan, Ziang, Wang, Yi, Wang, Limin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.24621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
by: Wang, Chenting, et al.
Published: (2025)
by: Wang, Chenting, et al.
Published: (2025)
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)
by: Zeng, Xiangyu, et al.
Published: (2024)
Make Your Training Flexible: Towards Deployment-Efficient Video Models
by: Wang, Chenting, et al.
Published: (2025)
by: Wang, Chenting, et al.
Published: (2025)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
by: Yan, Ziang, et al.
Published: (2024)
by: Yan, Ziang, et al.
Published: (2024)
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
by: Li, Xinhao, et al.
Published: (2023)
by: Li, Xinhao, et al.
Published: (2023)
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
by: Li, Xinhao, et al.
Published: (2024)
by: Li, Xinhao, et al.
Published: (2024)
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
by: Yan, Ziang, et al.
Published: (2025)
by: Yan, Ziang, et al.
Published: (2025)
Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval
by: Liu, Chunxu, et al.
Published: (2025)
by: Liu, Chunxu, et al.
Published: (2025)
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
by: Li, Xinhao, et al.
Published: (2025)
by: Li, Xinhao, et al.
Published: (2025)
Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026)
by: Zeng, Xiangyu, et al.
Published: (2026)
StreamForest: Efficient Online Video Understanding with Persistent Event Memory
by: Zeng, Xiangyu, et al.
Published: (2025)
by: Zeng, Xiangyu, et al.
Published: (2025)
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
by: Yue, Zhengrong, et al.
Published: (2025)
by: Yue, Zhengrong, et al.
Published: (2025)
SORCE: Small Object Retrieval in Complex Environments
by: Liu, Chunxu, et al.
Published: (2025)
by: Liu, Chunxu, et al.
Published: (2025)
History-Aware Transformation of ReID Features for Multiple Object Tracking
by: Gao, Ruopeng, et al.
Published: (2025)
by: Gao, Ruopeng, et al.
Published: (2025)
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
by: Jiang, Tianxiang, et al.
Published: (2025)
by: Jiang, Tianxiang, et al.
Published: (2025)
Training-Free Reasoning and Reflection in MLLMs
by: Wei, Hongchen, et al.
Published: (2025)
by: Wei, Hongchen, et al.
Published: (2025)
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)
by: Meng, Desen, et al.
Published: (2025)
Sparse Global Matching for Video Frame Interpolation with Large Motion
by: Liu, Chunxu, et al.
Published: (2024)
by: Liu, Chunxu, et al.
Published: (2024)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024)
by: Wang, Yi, et al.
Published: (2024)
ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Redirection
by: Sun, Yi, et al.
Published: (2026)
by: Sun, Yi, et al.
Published: (2026)
Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models
by: Zhang, Zhiqi, et al.
Published: (2026)
by: Zhang, Zhiqi, et al.
Published: (2026)
VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations
by: Dong, Lu, et al.
Published: (2025)
by: Dong, Lu, et al.
Published: (2025)
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
by: Li, Xinhao, et al.
Published: (2024)
by: Li, Xinhao, et al.
Published: (2024)
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
by: Xu, Yifan, et al.
Published: (2024)
by: Xu, Yifan, et al.
Published: (2024)
VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)
by: Li, Kunchang, et al.
Published: (2024)
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
by: Das, Deepayan, et al.
Published: (2025)
by: Das, Deepayan, et al.
Published: (2025)
RIVER: A Real-Time Interaction Benchmark for Video LLMs
by: Shi, Yansong, et al.
Published: (2026)
by: Shi, Yansong, et al.
Published: (2026)
Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
by: Han, Su Ho, et al.
Published: (2025)
by: Han, Su Ho, et al.
Published: (2025)
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
by: Liu, Yuanxin, et al.
Published: (2025)
by: Liu, Yuanxin, et al.
Published: (2025)
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
by: Xu, Yicheng, et al.
Published: (2025)
by: Xu, Yicheng, et al.
Published: (2025)
RetCompletion:High-Speed Inference Image Completion with Retentive Network
by: Cang, Yueyang, et al.
Published: (2024)
by: Cang, Yueyang, et al.
Published: (2024)
ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins
by: Liu, Xinhao, et al.
Published: (2026)
by: Liu, Xinhao, et al.
Published: (2026)
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
by: Wang, Jiahao, et al.
Published: (2025)
by: Wang, Jiahao, et al.
Published: (2025)
Training-Free Pretrained Model Merging
by: Xu, Zhengqi, et al.
Published: (2024)
by: Xu, Zhengqi, et al.
Published: (2024)
TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction
by: Zheng, Zhijie, et al.
Published: (2026)
by: Zheng, Zhijie, et al.
Published: (2026)
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
by: Shi, Fengyuan, et al.
Published: (2023)
by: Shi, Fengyuan, et al.
Published: (2023)
Online Video Understanding: OVBench and VideoChat-Online
by: Huang, Zhenpeng, et al.
Published: (2024)
by: Huang, Zhenpeng, et al.
Published: (2024)
Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
by: Wang, Tong, et al.
Published: (2026)
by: Wang, Tong, et al.
Published: (2026)
M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
by: Liu, Che, et al.
Published: (2025)
by: Liu, Che, et al.
Published: (2025)
Similar Items
-
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
by: Wang, Chenting, et al.
Published: (2025) -
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024) -
Make Your Training Flexible: Towards Deployment-Efficient Video Models
by: Wang, Chenting, et al.
Published: (2025) -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
by: Wang, Yi, et al.
Published: (2025) -
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
by: Yan, Ziang, et al.
Published: (2024)