:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dou, Huanzhang, Li, Ruixiang, Su, Wei, Li, Xi
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2407.01921
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ScanFormer: Referring Expression Comprehension by Iteratively Scanning
by: Su, Wei, et al.
Published: (2024)

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation
by: Yuan, Yike, et al.
Published: (2024)

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)

CLASH: Complementary Learning with Neural Architecture Search for Gait Recognition
by: Dou, Huanzhang, et al.
Published: (2024)

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
by: Wang, Zhao, et al.
Published: (2024)

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
by: Zhang, Wenhu, et al.
Published: (2026)

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
by: Zhao, Ruixiang, et al.
Published: (2026)

Group Diffusion Transformers are Unsupervised Multitask Learners
by: Huang, Lianghua, et al.
Published: (2024)

Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
by: Lan, Bangxiang, et al.
Published: (2025)

In-Context LoRA for Diffusion Transformers
by: Huang, Lianghua, et al.
Published: (2024)

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts
by: Li, Wenhao, et al.
Published: (2024)

IDEA-Bench: How Far are Generative Models from Professional Designing?
by: Liang, Chen, et al.
Published: (2024)

Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction
by: Yu, Li, et al.
Published: (2025)

Grid Diffusion Models for Text-to-Video Generation
by: Lee, Taegyeong, et al.
Published: (2024)

Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models
by: Hassan, Mariam, et al.
Published: (2025)

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
by: Huang, Lianghua, et al.
Published: (2024)

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
by: Li, Hongyu, et al.
Published: (2024)

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models
by: Zhang, Yabo, et al.
Published: (2024)

HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird's Eye View
by: Wu, Yiming, et al.
Published: (2023)

On Semiotic-Grounded Interpretive Evaluation of Generative Art
by: Jiang, Ruixiang, et al.
Published: (2026)

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation
by: Wang, Chen, et al.
Published: (2025)

VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
by: Yang, Lehan, et al.
Published: (2025)

OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
by: Xi, Dianbing, et al.
Published: (2025)

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
by: Zhang, David Junhao, et al.
Published: (2023)

TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)

TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
by: Zhang, Guofeng, et al.
Published: (2025)

T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models
by: Li, Changzhen, et al.
Published: (2025)

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
by: Zheng, Guangcong, et al.
Published: (2023)

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
by: Jeong, Hyeonho, et al.
Published: (2023)

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
by: Kara, Ozgur, et al.
Published: (2025)

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models
by: Li, Xiaomin, et al.
Published: (2024)

Exploring Iterative Refinement with Diffusion Models for Video Grounding
by: Liang, Xiao, et al.
Published: (2023)

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
by: Zhang, Ruixin, et al.
Published: (2025)

Multi-sentence Video Grounding for Long Video Generation
by: Feng, Wei, et al.
Published: (2024)

Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation
by: Zhang, Chi, et al.
Published: (2026)

Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models
by: Jiang, Rui, et al.
Published: (2025)

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
by: Jagpal, Diljeet, et al.
Published: (2025)

CamI2V: Camera-Controlled Image-to-Video Diffusion Model
by: Zheng, Guangcong, et al.
Published: (2024)

Dual-Stream Diffusion Net for Text-to-Video Generation
by: Liu, Binhui, et al.
Published: (2023)

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
by: Wang, Wenjing, et al.
Published: (2023)