Saved in:
| Main Authors: | Gao, Kaifeng, Chen, Siqi, Zhang, Hanwang, Xiao, Jun, Zhuang, Yueting, Sun, Qianru |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.12100 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models
by: Gao, Kaifeng, et al.
Published: (2024)
by: Gao, Kaifeng, et al.
Published: (2024)
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
by: Gao, Kaifeng, et al.
Published: (2024)
by: Gao, Kaifeng, et al.
Published: (2024)
Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models
by: Zhu, Beier, et al.
Published: (2023)
by: Zhu, Beier, et al.
Published: (2023)
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
by: Zhao, Kesen, et al.
Published: (2025)
by: Zhao, Kesen, et al.
Published: (2025)
Few-shot Learner Parameterization by Diffusion Time-steps
by: Yue, Zhongqi, et al.
Published: (2024)
by: Yue, Zhongqi, et al.
Published: (2024)
Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models
by: Cheng, Tianle, et al.
Published: (2025)
by: Cheng, Tianle, et al.
Published: (2025)
Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization
by: Qi, Jiaxin, et al.
Published: (2022)
by: Qi, Jiaxin, et al.
Published: (2022)
Exploring Diffusion Time-steps for Unsupervised Representation Learning
by: Yue, Zhongqi, et al.
Published: (2024)
by: Yue, Zhongqi, et al.
Published: (2024)
3D Question Answering via only 2D Vision-Language Models
by: Wang, Fengyun, et al.
Published: (2025)
by: Wang, Fengyun, et al.
Published: (2025)
Real-Time Motion-Controllable Autoregressive Video Diffusion
by: Zhao, Kesen, et al.
Published: (2025)
by: Zhao, Kesen, et al.
Published: (2025)
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
by: Lyu, Hengye, et al.
Published: (2026)
by: Lyu, Hengye, et al.
Published: (2026)
Video Anomaly Detection and Explanation via Large Language Models
by: Lv, Hui, et al.
Published: (2024)
by: Lv, Hui, et al.
Published: (2024)
From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
by: Shi, Hanrong, et al.
Published: (2024)
by: Shi, Hanrong, et al.
Published: (2024)
Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models
by: Chen, Zhaozheng, et al.
Published: (2023)
by: Chen, Zhaozheng, et al.
Published: (2023)
Physically Plausible Human-Object Rendering from Sparse Views via 3D Gaussian Splatting
by: Wang, Weiquan, et al.
Published: (2025)
by: Wang, Weiquan, et al.
Published: (2025)
Unified Generative and Discriminative Training for Multi-modal Large Language Models
by: Chow, Wei, et al.
Published: (2024)
by: Chow, Wei, et al.
Published: (2024)
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
by: Li, Juncheng, et al.
Published: (2023)
by: Li, Juncheng, et al.
Published: (2023)
Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting
by: Wang, Weiquan, et al.
Published: (2026)
by: Wang, Weiquan, et al.
Published: (2026)
GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
by: Dai, Guangyu, et al.
Published: (2025)
by: Dai, Guangyu, et al.
Published: (2025)
FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
by: Jiang, Yilei, et al.
Published: (2025)
by: Jiang, Yilei, et al.
Published: (2025)
Auto-Encoding Morph-Tokens for Multimodal LLM
by: Pan, Kaihang, et al.
Published: (2024)
by: Pan, Kaihang, et al.
Published: (2024)
Reducing Class-Wise Performance Disparity via Margin Regularization
by: Zhu, Beier, et al.
Published: (2026)
by: Zhu, Beier, et al.
Published: (2026)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models
by: Meng, Chutian, et al.
Published: (2024)
by: Meng, Chutian, et al.
Published: (2024)
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
by: Yu, Qifan, et al.
Published: (2024)
by: Yu, Qifan, et al.
Published: (2024)
NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation
by: Li, Lin, et al.
Published: (2022)
by: Li, Lin, et al.
Published: (2022)
Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models
by: Zhu, Beier, et al.
Published: (2025)
by: Zhu, Beier, et al.
Published: (2025)
IDPro: Flexible Interactive Video Object Segmentation by ID-queried Concurrent Propagation
by: Li, Kexin, et al.
Published: (2024)
by: Li, Kexin, et al.
Published: (2024)
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
by: Gao, Minghe, et al.
Published: (2025)
by: Gao, Minghe, et al.
Published: (2025)
Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
by: Pan, Kaihang, et al.
Published: (2024)
by: Pan, Kaihang, et al.
Published: (2024)
Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization
by: Zhang, Yuxi, et al.
Published: (2025)
by: Zhang, Yuxi, et al.
Published: (2025)
Non-confusing Generation of Customized Concepts in Diffusion Models
by: Lin, Wang, et al.
Published: (2024)
by: Lin, Wang, et al.
Published: (2024)
Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning
by: Pan, Kaihang, et al.
Published: (2025)
by: Pan, Kaihang, et al.
Published: (2025)
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
by: Hu, Zijing, et al.
Published: (2025)
by: Hu, Zijing, et al.
Published: (2025)
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark
by: Wang, Wei, et al.
Published: (2026)
by: Wang, Wei, et al.
Published: (2026)
Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark
by: Miao, Bingchen, et al.
Published: (2024)
by: Miao, Bingchen, et al.
Published: (2024)
Two Causal Principles for Improving Visual Dialog
by: Qi, Jiaxin, et al.
Published: (2019)
by: Qi, Jiaxin, et al.
Published: (2019)
SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models
by: Zheng, Haoyu, et al.
Published: (2025)
by: Zheng, Haoyu, et al.
Published: (2025)
Diffusion Time-step Curriculum for One Image to 3D Generation
by: Yi, Xuanyu, et al.
Published: (2024)
by: Yi, Xuanyu, et al.
Published: (2024)
Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On
by: Wan, Siqi, et al.
Published: (2025)
by: Wan, Siqi, et al.
Published: (2025)
Learning De-Biased Representations for Remote-Sensing Imagery
by: Tian, Zichen, et al.
Published: (2024)
by: Tian, Zichen, et al.
Published: (2024)
Similar Items
-
ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models
by: Gao, Kaifeng, et al.
Published: (2024) -
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
by: Gao, Kaifeng, et al.
Published: (2024) -
Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models
by: Zhu, Beier, et al.
Published: (2023) -
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
by: Zhao, Kesen, et al.
Published: (2025) -
Few-shot Learner Parameterization by Diffusion Time-steps
by: Yue, Zhongqi, et al.
Published: (2024)