:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liang, Zhengyang, Shu, Yan, Liu, Xiangrui, Qin, Minghao, Liang, Kaixin, Sebe, Nicu, Liu, Zheng, Liao, Lizi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.23044
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
by: Liu, Xiangrui, et al.
Published: (2025)

Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification
by: Qin, Minghao, et al.
Published: (2025)

VideoExplorer: Think With Videos For Agentic Long-Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)

Memory-enhanced Retrieval Augmentation for Long Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
by: Shu, Yan, et al.
Published: (2024)

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
by: Liang, Zhengyang, et al.
Published: (2025)

VidText: Towards Comprehensive Evaluation for Video Text Understanding
by: Yang, Zhoufaran, et al.
Published: (2025)

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
by: Liu, Xiangrui, et al.
Published: (2025)

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
by: Li, Jinlong, et al.
Published: (2026)

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos
by: Zuo, Zhi, et al.
Published: (2025)

RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
by: Peruzzo, Elia, et al.
Published: (2025)

MLVU: Benchmarking Multi-task Long Video Understanding
by: Zhou, Junjie, et al.
Published: (2024)

Transferable-guided Attention Is All You Need for Video Domain Adaptation
by: Sacilotti, André, et al.
Published: (2024)

Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
by: Dong, Jiahua, et al.
Published: (2025)

Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis
by: Tang, Hao, et al.
Published: (2025)

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
by: Li, Jinlong, et al.
Published: (2025)

H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
by: Li, Wenhao, et al.
Published: (2025)

Vision+X: A Survey on Multimodal Learning in the Light of Data
by: Zhu, Ye, et al.
Published: (2022)

Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning
by: Zheng, Haiyang, et al.
Published: (2025)

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
by: Liu, Jiaqi, et al.
Published: (2025)

DVD: Deterministic Video Depth Estimation with Generative Priors
by: Zhang, Hongfei, et al.
Published: (2026)

CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
by: Xing, Songlong, et al.
Published: (2025)

VASE: Object-Centric Appearance and Shape Manipulation of Real Videos
by: Peruzzo, Elia, et al.
Published: (2024)

Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation
by: Xue, Feng, et al.
Published: (2025)

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
by: Zhao, Dong, et al.
Published: (2026)

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)

Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery
by: Liu, Xiao, et al.
Published: (2025)

Reverse Personalization
by: Kung, Han-Wei, et al.
Published: (2025)

Asymmetric GANs for Image-to-Image Translation
by: Tang, Hao, et al.
Published: (2019)

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
by: Wang, Yidan, et al.
Published: (2025)

Prototypical Hash Encoding for On-the-Fly Fine-Grained Category Discovery
by: Zheng, Haiyang, et al.
Published: (2024)

Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts
by: Zheng, Haiyang, et al.
Published: (2025)

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
by: Zheng, Haiyang, et al.
Published: (2024)

RankFeat&RankWeight: Rank-1 Feature/Weight Removal for Out-of-distribution Detection
by: Song, Yue, et al.
Published: (2023)

Beyond the Known: Enhancing Open Set Domain Adaptation with Unknown Exploration
by: Silva, Lucas Fernando Alvarenga e, et al.
Published: (2024)

Task-Aware KV Compression For Cost-Effective Long Video Understanding
by: Qin, Minghao, et al.
Published: (2025)

Superpowering Open-Vocabulary Object Detectors for X-ray Vision
by: Garcia-Fernandez, Pablo, et al.
Published: (2025)

Hierarchical Cross-Attention Network for Virtual Try-On
by: Tang, Hao, et al.
Published: (2024)

Rethinking the Learning Paradigm for Facial Expression Recognition
by: Wang, Weijie, et al.
Published: (2022)

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)