Saved in:
| Main Authors: | Zhang, Chen-Lin, Sui, Lin, Liu, Shuming, Mu, Fangzhou, Wang, Zhangcheng, Ghanem, Bernard |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.06526 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
by: Liu, Shuming, et al.
Published: (2023)
by: Liu, Shuming, et al.
Published: (2023)
End-to-End Optimized Image Compression with the Frequency-Oriented Transform
by: Zhang, Yuefeng, et al.
Published: (2024)
by: Zhang, Yuefeng, et al.
Published: (2024)
Harnessing Temporal Causality for Advanced Temporal Action Detection
by: Liu, Shuming, et al.
Published: (2024)
by: Liu, Shuming, et al.
Published: (2024)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines
by: Liu, Junle, et al.
Published: (2025)
by: Liu, Junle, et al.
Published: (2025)
DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor
by: Zhao, Yan, et al.
Published: (2025)
by: Zhao, Yan, et al.
Published: (2025)
Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development
by: Sheng, Xihua, et al.
Published: (2026)
by: Sheng, Xihua, et al.
Published: (2026)
End-to-end Semantic-centric Video-based Multimodal Affective Computing
by: Lin, Ronghao, et al.
Published: (2024)
by: Lin, Ronghao, et al.
Published: (2024)
Deep-JGAC: End-to-End Deep Joint Geometry and Attribute Compression for Dense Colored Point Clouds
by: Zhang, Yun, et al.
Published: (2025)
by: Zhang, Yun, et al.
Published: (2025)
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
by: Han, ZhaoYang, et al.
Published: (2025)
by: Han, ZhaoYang, et al.
Published: (2025)
One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
by: Sun, Hao, et al.
Published: (2024)
by: Sun, Hao, et al.
Published: (2024)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
by: Fang, Xinyu, et al.
Published: (2024)
by: Fang, Xinyu, et al.
Published: (2024)
VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
by: Gu, Jing, et al.
Published: (2024)
by: Gu, Jing, et al.
Published: (2024)
Bridging Your Imagination with Audio-Video Generation via a Unified Director
by: Zhang, Jiaxu, et al.
Published: (2025)
by: Zhang, Jiaxu, et al.
Published: (2025)
Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation
by: Tian, Huilin, et al.
Published: (2024)
by: Tian, Huilin, et al.
Published: (2024)
Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)
by: Wang, Jing, et al.
Published: (2025)
Perceptual Learned Image Compression via End-to-End JND-Based Optimization
by: Pakdaman, Farhad, et al.
Published: (2024)
by: Pakdaman, Farhad, et al.
Published: (2024)
End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model
by: Wang, Haofeng, et al.
Published: (2025)
by: Wang, Haofeng, et al.
Published: (2025)
MotionPro: A Precise Motion Controller for Image-to-Video Generation
by: Zhang, Zhongwei, et al.
Published: (2025)
by: Zhang, Zhongwei, et al.
Published: (2025)
Generative Frame Sampler for Long Video Understanding
by: Yao, Linli, et al.
Published: (2025)
by: Yao, Linli, et al.
Published: (2025)
Joint End-to-End Image Compression and Denoising: Leveraging Contrastive Learning and Multi-Scale Self-ONNs
by: Xie, Yuxin, et al.
Published: (2024)
by: Xie, Yuxin, et al.
Published: (2024)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)
by: Zhang, Pingping, et al.
Published: (2024)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
Memory-enhanced Retrieval Augmentation for Long Video Understanding
by: Yuan, Huaying, et al.
Published: (2025)
by: Yuan, Huaying, et al.
Published: (2025)
Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation
by: Wei, Jia, et al.
Published: (2025)
by: Wei, Jia, et al.
Published: (2025)
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)
by: Chen, Shuang, et al.
Published: (2026)
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
by: Wang, Shaoguang, et al.
Published: (2026)
by: Wang, Shaoguang, et al.
Published: (2026)
SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment
by: Zhang, Hongyang, et al.
Published: (2025)
by: Zhang, Hongyang, et al.
Published: (2025)
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
by: Pu, Junfu, et al.
Published: (2026)
by: Pu, Junfu, et al.
Published: (2026)
UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
by: Mei, Yuting, et al.
Published: (2024)
by: Mei, Yuting, et al.
Published: (2024)
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025)
by: Yuan, Hangjie, et al.
Published: (2025)
VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
by: Gao, Shibo, et al.
Published: (2025)
by: Gao, Shibo, et al.
Published: (2025)
VideoMem: Constructing, Analyzing, Predicting Short-term and Long-term Video Memorability
by: Cohendet, Romain, et al.
Published: (2018)
by: Cohendet, Romain, et al.
Published: (2018)
Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
by: Xie, Zequn, et al.
Published: (2026)
by: Xie, Zequn, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
PG-Attack: A Precision-Guided Adversarial Attack Framework Against Vision Foundation Models for Autonomous Driving
by: Fu, Jiyuan, et al.
Published: (2024)
by: Fu, Jiyuan, et al.
Published: (2024)
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
by: Lan, Xiaohan, et al.
Published: (2024)
by: Lan, Xiaohan, et al.
Published: (2024)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding
by: Huang, Dawei, et al.
Published: (2025)
by: Huang, Dawei, et al.
Published: (2025)
Similar Items
-
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
by: Liu, Shuming, et al.
Published: (2023) -
End-to-End Optimized Image Compression with the Frequency-Oriented Transform
by: Zhang, Yuefeng, et al.
Published: (2024) -
Harnessing Temporal Causality for Advanced Temporal Action Detection
by: Liu, Shuming, et al.
Published: (2024) -
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026) -
Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines
by: Liu, Junle, et al.
Published: (2025)