:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Chen-Lin, Sui, Lin, Liu, Shuming, Mu, Fangzhou, Wang, Zhangcheng, Ghanem, Bernard
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2503.06526
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
by: Liu, Shuming, et al.
Published: (2023)

End-to-End Optimized Image Compression with the Frequency-Oriented Transform
by: Zhang, Yuefeng, et al.
Published: (2024)

Harnessing Temporal Causality for Advanced Temporal Action Detection
by: Liu, Shuming, et al.
Published: (2024)

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)

Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines
by: Liu, Junle, et al.
Published: (2025)

DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor
by: Zhao, Yan, et al.
Published: (2025)

Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development
by: Sheng, Xihua, et al.
Published: (2026)

End-to-end Semantic-centric Video-based Multimodal Affective Computing
by: Lin, Ronghao, et al.
Published: (2024)

Deep-JGAC: End-to-End Deep Joint Geometry and Attribute Compression for Dense Colored Point Clouds
by: Zhang, Yun, et al.
Published: (2025)

LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
by: Han, ZhaoYang, et al.
Published: (2025)

One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
by: Sun, Hao, et al.
Published: (2024)

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
by: Fang, Xinyu, et al.
Published: (2024)

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
by: Gu, Jing, et al.
Published: (2024)

Bridging Your Imagination with Audio-Video Generation via a Unified Director
by: Zhang, Jiaxu, et al.
Published: (2025)

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation
by: Tian, Huilin, et al.
Published: (2024)

Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)

Perceptual Learned Image Compression via End-to-End JND-Based Optimization
by: Pakdaman, Farhad, et al.
Published: (2024)

End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model
by: Wang, Haofeng, et al.
Published: (2025)

MotionPro: A Precise Motion Controller for Image-to-Video Generation
by: Zhang, Zhongwei, et al.
Published: (2025)

Generative Frame Sampler for Long Video Understanding
by: Yao, Linli, et al.
Published: (2025)

Joint End-to-End Image Compression and Denoising: Leveraging Contrastive Learning and Multi-Scale Self-ONNs
by: Xie, Yuxin, et al.
Published: (2024)

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)

Memory-enhanced Retrieval Augmentation for Long Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)

Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation
by: Wei, Jia, et al.
Published: (2025)

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
by: Wang, Shaoguang, et al.
Published: (2026)

SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment
by: Zhang, Hongyang, et al.
Published: (2025)

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
by: Pu, Junfu, et al.
Published: (2026)

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
by: Mei, Yuting, et al.
Published: (2024)

Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025)

VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
by: Gao, Shibo, et al.
Published: (2025)

VideoMem: Constructing, Analyzing, Predicting Short-term and Long-term Video Memorability
by: Cohendet, Romain, et al.
Published: (2018)

Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
by: Xie, Zequn, et al.
Published: (2026)

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)

PG-Attack: A Precision-Guided Adversarial Attack Framework Against Vision Foundation Models for Autonomous Driving
by: Fu, Jiyuan, et al.
Published: (2024)

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
by: Lan, Xiaohan, et al.
Published: (2024)

Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)

Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)

Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding
by: Huang, Dawei, et al.
Published: (2025)