Saved in:
| Main Authors: | Lin, Yuanze, Chen, Yi-Wen, Tsai, Yi-Hsuan, Clark, Ronald, Yang, Ming-Hsuan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.03150 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Text-Driven Image Editing via Learnable Regions
by: Lin, Yuanze, et al.
Published: (2023)
by: Lin, Yuanze, et al.
Published: (2023)
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025)
by: Yuan, Hangjie, et al.
Published: (2025)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning
by: Chen, Jiun-Man, et al.
Published: (2024)
by: Chen, Jiun-Man, et al.
Published: (2024)
A Sleep Monitoring System Based on Audio, Video and Depth Information
by: Chen, Lyn Chao-ling, et al.
Published: (2025)
by: Chen, Lyn Chao-ling, et al.
Published: (2025)
Bridging Your Imagination with Audio-Video Generation via a Unified Director
by: Zhang, Jiaxu, et al.
Published: (2025)
by: Zhang, Jiaxu, et al.
Published: (2025)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)
by: Wang, Jing, et al.
Published: (2025)
Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
by: Liu, Kangwei, et al.
Published: (2025)
by: Liu, Kangwei, et al.
Published: (2025)
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models
by: Yang, Haibo, et al.
Published: (2024)
by: Yang, Haibo, et al.
Published: (2024)
TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos
by: Zhang, Chen-Lin, et al.
Published: (2025)
by: Zhang, Chen-Lin, et al.
Published: (2025)
ReCorD: Reasoning and Correcting Diffusion for HOI Generation
by: Jiang-Lin, Jian-Yu, et al.
Published: (2024)
by: Jiang-Lin, Jian-Yu, et al.
Published: (2024)
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
by: Yu, Lijun, et al.
Published: (2023)
by: Yu, Lijun, et al.
Published: (2023)
SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
by: Chen, Tong, et al.
Published: (2024)
by: Chen, Tong, et al.
Published: (2024)
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
by: Cai, Qi, et al.
Published: (2026)
by: Cai, Qi, et al.
Published: (2026)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
by: Zhao, Lei, et al.
Published: (2025)
by: Zhao, Lei, et al.
Published: (2025)
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)
by: Chen, Shuang, et al.
Published: (2026)
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis
by: Huang, Po-Hsuan, et al.
Published: (2024)
by: Huang, Po-Hsuan, et al.
Published: (2024)
VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
by: Gao, Shibo, et al.
Published: (2025)
by: Gao, Shibo, et al.
Published: (2025)
RemEdit: Efficient Diffusion Editing with Riemannian Geometry
by: Adhikarla, Eashan, et al.
Published: (2026)
by: Adhikarla, Eashan, et al.
Published: (2026)
UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
by: Chen, Yanzhe, et al.
Published: (2025)
by: Chen, Yanzhe, et al.
Published: (2025)
MOC-3D: Manifold-Order Consistency for Text-to-3D Generation
by: Fan, Chenyang, et al.
Published: (2026)
by: Fan, Chenyang, et al.
Published: (2026)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)
by: Zhang, Pingping, et al.
Published: (2024)
Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos
by: Zhang, Junbin, et al.
Published: (2022)
by: Zhang, Junbin, et al.
Published: (2022)
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
by: Wang, Zhouxia, et al.
Published: (2023)
by: Wang, Zhouxia, et al.
Published: (2023)
Bernini: Latent Semantic Planning for Video Diffusion
by: Bernini Team, et al.
Published: (2026)
by: Bernini Team, et al.
Published: (2026)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
by: Zhao, Shuai, et al.
Published: (2023)
by: Zhao, Shuai, et al.
Published: (2023)
JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits
by: Pan, Minzhou, et al.
Published: (2024)
by: Pan, Minzhou, et al.
Published: (2024)
MotionPro: A Precise Motion Controller for Image-to-Video Generation
by: Zhang, Zhongwei, et al.
Published: (2025)
by: Zhang, Zhongwei, et al.
Published: (2025)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection
by: Li, Yuxuan, et al.
Published: (2024)
by: Li, Yuxuan, et al.
Published: (2024)
StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
by: Chen, Liyang, et al.
Published: (2025)
by: Chen, Liyang, et al.
Published: (2025)
LayerT2V: A Unified Multi-Layer Video Generation Framework
by: Li, Guangzhao, et al.
Published: (2025)
by: Li, Guangzhao, et al.
Published: (2025)
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
by: Cai, Qi, et al.
Published: (2025)
by: Cai, Qi, et al.
Published: (2025)
Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
by: Taghipour, Ashkan, et al.
Published: (2026)
by: Taghipour, Ashkan, et al.
Published: (2026)
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
by: Guan, Jiazhi, et al.
Published: (2025)
by: Guan, Jiazhi, et al.
Published: (2025)
ROI-Guided Point Cloud Geometry Compression Towards Human and Machine Vision
by: Liang, Xie, et al.
Published: (2025)
by: Liang, Xie, et al.
Published: (2025)
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
by: He, Xu, et al.
Published: (2024)
by: He, Xu, et al.
Published: (2024)
Adaptive Low Light Enhancement via Joint Global-Local Illumination Adjustment
by: Wang, Haodian, et al.
Published: (2025)
by: Wang, Haodian, et al.
Published: (2025)
Similar Items
-
Text-Driven Image Editing via Learnable Regions
by: Lin, Yuanze, et al.
Published: (2023) -
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025) -
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026) -
QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning
by: Chen, Jiun-Man, et al.
Published: (2024) -
A Sleep Monitoring System Based on Audio, Video and Depth Information
by: Chen, Lyn Chao-ling, et al.
Published: (2025)