:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gao, Kaifeng, Chen, Siqi, Zhang, Hanwang, Xiao, Jun, Zhuang, Yueting, Sun, Qianru
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2504.12100
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models
by: Gao, Kaifeng, et al.
Published: (2024)

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
by: Gao, Kaifeng, et al.
Published: (2024)

Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models
by: Zhu, Beier, et al.
Published: (2023)

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
by: Zhao, Kesen, et al.
Published: (2025)

Few-shot Learner Parameterization by Diffusion Time-steps
by: Yue, Zhongqi, et al.
Published: (2024)

Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models
by: Cheng, Tianle, et al.
Published: (2025)

Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization
by: Qi, Jiaxin, et al.
Published: (2022)

Exploring Diffusion Time-steps for Unsupervised Representation Learning
by: Yue, Zhongqi, et al.
Published: (2024)

3D Question Answering via only 2D Vision-Language Models
by: Wang, Fengyun, et al.
Published: (2025)

Real-Time Motion-Controllable Autoregressive Video Diffusion
by: Zhao, Kesen, et al.
Published: (2025)

DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
by: Lyu, Hengye, et al.
Published: (2026)

Video Anomaly Detection and Explanation via Large Language Models
by: Lv, Hui, et al.
Published: (2024)

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
by: Shi, Hanrong, et al.
Published: (2024)

Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models
by: Chen, Zhaozheng, et al.
Published: (2023)

Physically Plausible Human-Object Rendering from Sparse Views via 3D Gaussian Splatting
by: Wang, Weiquan, et al.
Published: (2025)

Unified Generative and Discriminative Training for Multi-modal Large Language Models
by: Chow, Wei, et al.
Published: (2024)

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
by: Li, Juncheng, et al.
Published: (2023)

Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting
by: Wang, Weiquan, et al.
Published: (2026)

GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
by: Dai, Guangyu, et al.
Published: (2025)

FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
by: Jiang, Yilei, et al.
Published: (2025)

Auto-Encoding Morph-Tokens for Multimodal LLM
by: Pan, Kaihang, et al.
Published: (2024)

Reducing Class-Wise Performance Disparity via Margin Regularization
by: Zhu, Beier, et al.
Published: (2026)

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models
by: Meng, Chutian, et al.
Published: (2024)

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
by: Yu, Qifan, et al.
Published: (2024)

NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation
by: Li, Lin, et al.
Published: (2022)

Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models
by: Zhu, Beier, et al.
Published: (2025)

IDPro: Flexible Interactive Video Object Segmentation by ID-queried Concurrent Propagation
by: Li, Kexin, et al.
Published: (2024)

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
by: Gao, Minghe, et al.
Published: (2025)

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
by: Pan, Kaihang, et al.
Published: (2024)

Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization
by: Zhang, Yuxi, et al.
Published: (2025)

Non-confusing Generation of Customized Concepts in Diffusion Models
by: Lin, Wang, et al.
Published: (2024)

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning
by: Pan, Kaihang, et al.
Published: (2025)

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
by: Hu, Zijing, et al.
Published: (2025)

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark
by: Wang, Wei, et al.
Published: (2026)

Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark
by: Miao, Bingchen, et al.
Published: (2024)

Two Causal Principles for Improving Visual Dialog
by: Qi, Jiaxin, et al.
Published: (2019)

SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models
by: Zheng, Haoyu, et al.
Published: (2025)

Diffusion Time-step Curriculum for One Image to 3D Generation
by: Yi, Xuanyu, et al.
Published: (2024)

Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On
by: Wan, Siqi, et al.
Published: (2025)

Learning De-Biased Representations for Remote-Sensing Imagery
by: Tian, Zichen, et al.
Published: (2024)