:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lin, Yuanze, Chen, Yi-Wen, Tsai, Yi-Hsuan, Clark, Ronald, Yang, Ming-Hsuan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning Multimedia
Online Access:	https://arxiv.org/abs/2506.03150
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Text-Driven Image Editing via Learnable Regions
by: Lin, Yuanze, et al.
Published: (2023)

Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025)

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)

QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning
by: Chen, Jiun-Man, et al.
Published: (2024)

A Sleep Monitoring System Based on Audio, Video and Depth Information
by: Chen, Lyn Chao-ling, et al.
Published: (2025)

Bridging Your Imagination with Audio-Video Generation via a Unified Director
by: Zhang, Jiaxu, et al.
Published: (2025)

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)

Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)

Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
by: Liu, Kangwei, et al.
Published: (2025)

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models
by: Yang, Haibo, et al.
Published: (2024)

TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos
by: Zhang, Chen-Lin, et al.
Published: (2025)

ReCorD: Reasoning and Correcting Diffusion for HOI Generation
by: Jiang-Lin, Jian-Yu, et al.
Published: (2024)

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
by: Yu, Lijun, et al.
Published: (2023)

SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
by: Chen, Tong, et al.
Published: (2024)

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
by: Cai, Qi, et al.
Published: (2026)

Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
by: Zhao, Lei, et al.
Published: (2025)

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)

Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis
by: Huang, Po-Hsuan, et al.
Published: (2024)

VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
by: Gao, Shibo, et al.
Published: (2025)

RemEdit: Efficient Diffusion Editing with Riemannian Geometry
by: Adhikarla, Eashan, et al.
Published: (2026)

UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
by: Chen, Yanzhe, et al.
Published: (2025)

MOC-3D: Manifold-Order Consistency for Text-to-3D Generation
by: Fan, Chenyang, et al.
Published: (2026)

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos
by: Zhang, Junbin, et al.
Published: (2022)

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
by: Wang, Zhouxia, et al.
Published: (2023)

Bernini: Latent Semantic Planning for Video Diffusion
by: Bernini Team, et al.
Published: (2026)

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
by: Zhao, Shuai, et al.
Published: (2023)

JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits
by: Pan, Minzhou, et al.
Published: (2024)

MotionPro: A Precise Motion Controller for Image-to-Video Generation
by: Zhang, Zhongwei, et al.
Published: (2025)

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection
by: Li, Yuxuan, et al.
Published: (2024)

StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
by: Chen, Liyang, et al.
Published: (2025)

LayerT2V: A Unified Multi-Layer Video Generation Framework
by: Li, Guangzhao, et al.
Published: (2025)

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
by: Cai, Qi, et al.
Published: (2025)

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
by: Taghipour, Ashkan, et al.
Published: (2026)

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
by: Guan, Jiazhi, et al.
Published: (2025)

ROI-Guided Point Cloud Geometry Compression Towards Human and Machine Vision
by: Liang, Xie, et al.
Published: (2025)

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
by: He, Xu, et al.
Published: (2024)

Adaptive Low Light Enhancement via Joint Global-Local Illumination Adjustment
by: Wang, Haodian, et al.
Published: (2025)