Saved in:
| Main Authors: | Liang, Zhengyang, Shu, Yan, Liu, Xiangrui, Qin, Minghao, Liang, Kaixin, Sebe, Nicu, Liu, Zheng, Liao, Lizi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.23044 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
by: Liu, Xiangrui, et al.
Published: (2025)
by: Liu, Xiangrui, et al.
Published: (2025)
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
by: Qin, Minghao, et al.
Published: (2025)
by: Qin, Minghao, et al.
Published: (2025)
VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
Memory-enhanced Retrieval Augmentation for Long Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
by: Shu, Yan, et al.
Published: (2024)
by: Shu, Yan, et al.
Published: (2024)
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
by: Liang, Zhengyang, et al.
Published: (2025)
by: Liang, Zhengyang, et al.
Published: (2025)
VidText: Towards Comprehensive Evaluation for Video Text Understanding
by: Yang, Zhoufaran, et al.
Published: (2025)
by: Yang, Zhoufaran, et al.
Published: (2025)
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
by: Liu, Xiangrui, et al.
Published: (2025)
by: Liu, Xiangrui, et al.
Published: (2025)
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
by: Li, Jinlong, et al.
Published: (2026)
by: Li, Jinlong, et al.
Published: (2026)
Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos
by: Zuo, Zhi, et al.
Published: (2025)
by: Zuo, Zhi, et al.
Published: (2025)
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
by: Peruzzo, Elia, et al.
Published: (2025)
by: Peruzzo, Elia, et al.
Published: (2025)
MLVU: Benchmarking Multi-task Long Video Understanding
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
Transferable-guided Attention Is All You Need for Video Domain Adaptation
by: Sacilotti, André, et al.
Published: (2024)
by: Sacilotti, André, et al.
Published: (2024)
Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
by: Dong, Jiahua, et al.
Published: (2025)
by: Dong, Jiahua, et al.
Published: (2025)
Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis
by: Tang, Hao, et al.
Published: (2025)
by: Tang, Hao, et al.
Published: (2025)
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
by: Li, Jinlong, et al.
Published: (2025)
by: Li, Jinlong, et al.
Published: (2025)
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
by: Li, Wenhao, et al.
Published: (2025)
by: Li, Wenhao, et al.
Published: (2025)
Vision+X: A Survey on Multimodal Learning in the Light of Data
by: Zhu, Ye, et al.
Published: (2022)
by: Zhu, Ye, et al.
Published: (2022)
Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning
by: Zheng, Haiyang, et al.
Published: (2025)
by: Zheng, Haiyang, et al.
Published: (2025)
Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
by: Liu, Jiaqi, et al.
Published: (2025)
by: Liu, Jiaqi, et al.
Published: (2025)
DVD: Deterministic Video Depth Estimation with Generative Priors
by: Zhang, Hongfei, et al.
Published: (2026)
by: Zhang, Hongfei, et al.
Published: (2026)
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
by: Xing, Songlong, et al.
Published: (2025)
by: Xing, Songlong, et al.
Published: (2025)
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos
by: Peruzzo, Elia, et al.
Published: (2024)
by: Peruzzo, Elia, et al.
Published: (2024)
Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation
by: Xue, Feng, et al.
Published: (2025)
by: Xue, Feng, et al.
Published: (2025)
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
by: Zhao, Dong, et al.
Published: (2026)
by: Zhao, Dong, et al.
Published: (2026)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)
by: Zhang, Pingping, et al.
Published: (2024)
Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery
by: Liu, Xiao, et al.
Published: (2025)
by: Liu, Xiao, et al.
Published: (2025)
Reverse Personalization
by: Kung, Han-Wei, et al.
Published: (2025)
by: Kung, Han-Wei, et al.
Published: (2025)
Asymmetric GANs for Image-to-Image Translation
by: Tang, Hao, et al.
Published: (2019)
by: Tang, Hao, et al.
Published: (2019)
AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
by: Wang, Yidan, et al.
Published: (2025)
by: Wang, Yidan, et al.
Published: (2025)
Prototypical Hash Encoding for On-the-Fly Fine-Grained Category Discovery
by: Zheng, Haiyang, et al.
Published: (2024)
by: Zheng, Haiyang, et al.
Published: (2024)
Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts
by: Zheng, Haiyang, et al.
Published: (2025)
by: Zheng, Haiyang, et al.
Published: (2025)
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
by: Zheng, Haiyang, et al.
Published: (2024)
by: Zheng, Haiyang, et al.
Published: (2024)
RankFeat&RankWeight: Rank-1 Feature/Weight Removal for Out-of-distribution Detection
by: Song, Yue, et al.
Published: (2023)
by: Song, Yue, et al.
Published: (2023)
Beyond the Known: Enhancing Open Set Domain Adaptation with Unknown Exploration
by: Silva, Lucas Fernando Alvarenga e, et al.
Published: (2024)
by: Silva, Lucas Fernando Alvarenga e, et al.
Published: (2024)
Task-Aware KV Compression For Cost-Effective Long Video Understanding
by: Qin, Minghao, et al.
Published: (2025)
by: Qin, Minghao, et al.
Published: (2025)
Superpowering Open-Vocabulary Object Detectors for X-ray Vision
by: Garcia-Fernandez, Pablo, et al.
Published: (2025)
by: Garcia-Fernandez, Pablo, et al.
Published: (2025)
Hierarchical Cross-Attention Network for Virtual Try-On
by: Tang, Hao, et al.
Published: (2024)
by: Tang, Hao, et al.
Published: (2024)
Rethinking the Learning Paradigm for Facial Expression Recognition
by: Wang, Weijie, et al.
Published: (2022)
by: Wang, Weijie, et al.
Published: (2022)
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)
by: Yin, Yufei, et al.
Published: (2025)
Similar Items
-
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
by: Liu, Xiangrui, et al.
Published: (2025) -
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
by: Qin, Minghao, et al.
Published: (2025) -
VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025) -
Memory-enhanced Retrieval Augmentation for Long Video Understanding
by: Yuan, Huaying, et al.
Published: (2025) -
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
by: Shu, Yan, et al.
Published: (2024)