Saved in:
| Main Authors: | Wang, Ziyi, Wu, Haoran, Rong, Yiming, Jiang, Deyang, Zhang, Yixin, Zhao, Yunlong, Xu, Shuang, XU, Bo |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.06835 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
by: Rong, Yiming, et al.
Published: (2025)
by: Rong, Yiming, et al.
Published: (2025)
Linear Scaling Video VLMs for Long Video Understanding
by: Eyzaguirre, Cristobal, et al.
Published: (2026)
by: Eyzaguirre, Cristobal, et al.
Published: (2026)
LVC-LGMC: Joint Local and Global Motion Compensation for Learned Video Compression
by: Jiang, Wei, et al.
Published: (2024)
by: Jiang, Wei, et al.
Published: (2024)
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
by: Liu, Xiangrui, et al.
Published: (2025)
by: Liu, Xiangrui, et al.
Published: (2025)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
by: Shen, Xiaoqian, et al.
Published: (2024)
by: Shen, Xiaoqian, et al.
Published: (2024)
Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding
by: Zhao, Xinkui, et al.
Published: (2025)
by: Zhao, Xinkui, et al.
Published: (2025)
NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
by: Uricchio, Tiberio, et al.
Published: (2026)
by: Uricchio, Tiberio, et al.
Published: (2026)
Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression
by: Zhang, Yichi, et al.
Published: (2026)
by: Zhang, Yichi, et al.
Published: (2026)
Task-Aware KV Compression For Cost-Effective Long Video Understanding
by: Qin, Minghao, et al.
Published: (2025)
by: Qin, Minghao, et al.
Published: (2025)
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
by: Cheng, Dingxin, et al.
Published: (2024)
by: Cheng, Dingxin, et al.
Published: (2024)
LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs
by: Luo, Kun, et al.
Published: (2026)
by: Luo, Kun, et al.
Published: (2026)
FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
by: Guo, Yanan, et al.
Published: (2025)
by: Guo, Yanan, et al.
Published: (2025)
Stateful Token Reduction for Long-Video Hybrid VLMs
by: Jiang, Jindong, et al.
Published: (2026)
by: Jiang, Jindong, et al.
Published: (2026)
Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding
by: Wang, Zheng, et al.
Published: (2026)
by: Wang, Zheng, et al.
Published: (2026)
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
by: Salazar, Israfel, et al.
Published: (2025)
by: Salazar, Israfel, et al.
Published: (2025)
EEA: Exploration-Exploitation Agent for Long Video Understanding
by: Yang, Te, et al.
Published: (2025)
by: Yang, Te, et al.
Published: (2025)
LoViC: Efficient Long Video Generation with Context Compression
by: Jiang, Jiaxiu, et al.
Published: (2025)
by: Jiang, Jiaxiu, et al.
Published: (2025)
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
by: Ma, Wentao, et al.
Published: (2025)
by: Ma, Wentao, et al.
Published: (2025)
OmniVid: A Generative Framework for Universal Video Understanding
by: Wang, Junke, et al.
Published: (2024)
by: Wang, Junke, et al.
Published: (2024)
A Unified Framework for Human-centric Point Cloud Video Understanding
by: Xu, Yiteng, et al.
Published: (2024)
by: Xu, Yiteng, et al.
Published: (2024)
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
by: Xie, Yiweng, et al.
Published: (2026)
by: Xie, Yiweng, et al.
Published: (2026)
Towards Lossless Ultimate Vision Token Compression for VLMs
by: Zheng, Dehua, et al.
Published: (2025)
by: Zheng, Dehua, et al.
Published: (2025)
Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising
by: Yuan, Yunlong, et al.
Published: (2025)
by: Yuan, Yunlong, et al.
Published: (2025)
Towards Event-oriented Long Video Understanding
by: Du, Yifan, et al.
Published: (2024)
by: Du, Yifan, et al.
Published: (2024)
UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation
by: Wang, Ziyi, et al.
Published: (2026)
by: Wang, Ziyi, et al.
Published: (2026)
StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding
by: Wang, Junxi, et al.
Published: (2026)
by: Wang, Junxi, et al.
Published: (2026)
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
by: Yamao, Sosuke, et al.
Published: (2026)
by: Yamao, Sosuke, et al.
Published: (2026)
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
by: Patel, Shrenik, et al.
Published: (2025)
by: Patel, Shrenik, et al.
Published: (2025)
Controllable Generative Video Compression
by: Ding, Ding, et al.
Published: (2026)
by: Ding, Ding, et al.
Published: (2026)
MLVU: Benchmarking Multi-task Long Video Understanding
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
by: Shu, Yan, et al.
Published: (2024)
by: Shu, Yan, et al.
Published: (2024)
Benchmarking and Enhancing VLM for Compressed Image Understanding
by: Zhang, Zifu, et al.
Published: (2025)
by: Zhang, Zifu, et al.
Published: (2025)
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
by: Lan, Xiaohan, et al.
Published: (2024)
by: Lan, Xiaohan, et al.
Published: (2024)
VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning
by: Zhang, Xuanyu, et al.
Published: (2025)
by: Zhang, Xuanyu, et al.
Published: (2025)
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
by: Wang, Jiaqi, et al.
Published: (2025)
by: Wang, Jiaqi, et al.
Published: (2025)
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding
by: Guo, Chaohong, et al.
Published: (2025)
by: Guo, Chaohong, et al.
Published: (2025)
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
by: Wang, Mengyue, et al.
Published: (2025)
by: Wang, Mengyue, et al.
Published: (2025)
Generative Frame Sampler for Long Video Understanding
by: Yao, Linli, et al.
Published: (2025)
by: Yao, Linli, et al.
Published: (2025)
Extreme Video Compression with Pre-trained Diffusion Models
by: Li, Bohan, et al.
Published: (2024)
by: Li, Bohan, et al.
Published: (2024)
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context
by: Li, Jungang, et al.
Published: (2024)
by: Li, Jungang, et al.
Published: (2024)
Similar Items
-
Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
by: Rong, Yiming, et al.
Published: (2025) -
Linear Scaling Video VLMs for Long Video Understanding
by: Eyzaguirre, Cristobal, et al.
Published: (2026) -
LVC-LGMC: Joint Local and Global Motion Compensation for Learned Video Compression
by: Jiang, Wei, et al.
Published: (2024) -
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
by: Liu, Xiangrui, et al.
Published: (2025) -
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
by: Shen, Xiaoqian, et al.
Published: (2024)