:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bao, Xiaoyi, Sun, Siyang, Ma, Shuailei, Zheng, Kecheng, Guo, Yuxin, Zhao, Guosheng, Zheng, Yun, Wang, Xingang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.05673
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Aligned Better, Listen Better for Audio-Visual Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model
by: Ma, Shuailei, et al.
Published: (2023)

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
by: Zhao, Guosheng, et al.
Published: (2024)

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
by: Wang, Xiaofeng, et al.
Published: (2024)

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
by: Wu, Wei, et al.
Published: (2024)

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
by: Bao, Xiaoyi, et al.
Published: (2025)

DreamLIP: Language-Image Pre-training with Long Captions
by: Zheng, Kecheng, et al.
Published: (2024)

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
by: Zhao, Guosheng, et al.
Published: (2025)

Learning Consistent Taxonomic Classification through Hierarchical Reasoning
by: Li, Zhenghong, et al.
Published: (2026)

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
by: Lu, Fan, et al.
Published: (2024)

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
by: Guo, Yuxin, et al.
Published: (2024)

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
by: Qu, Zhen, et al.
Published: (2026)

Learning Visual Generative Priors without Text
by: Ma, Shuailei, et al.
Published: (2024)

Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection
by: Jia, Mingda, et al.
Published: (2024)

GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
by: Bao, Xiaoyi, et al.
Published: (2025)

Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search
by: Liang, Tianming, et al.
Published: (2026)

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation
by: Tan, Shuai, et al.
Published: (2026)

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
by: Liu, Shi, et al.
Published: (2024)

Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
by: Zhang, Yuchen, et al.
Published: (2026)

HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration
by: Wang, Boyuan, et al.
Published: (2025)

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
by: Tang, Hao, et al.
Published: (2025)

EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
by: Zheng, Xu, et al.
Published: (2024)

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
by: Zhao, Guosheng, et al.
Published: (2024)

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
by: Tang, Hao, et al.
Published: (2025)

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
by: Wang, Boyuan, et al.
Published: (2025)

ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
by: Wang, Boyuan, et al.
Published: (2026)

DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds
by: Zhang, Jiaxu, et al.
Published: (2025)

Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning
by: Wan, Xixi, et al.
Published: (2025)

EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
by: Zheng, Xu, et al.
Published: (2024)

ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation
by: Lin, Jingzhong, et al.
Published: (2025)

PS-ReID: Advancing Person Re-Identification and Precise Segmentation with Multimodal Retrieval
by: Yan, Jincheng, et al.
Published: (2025)

Contextual AD Narration with Interleaved Multimodal Sequence
by: Wang, Hanlin, et al.
Published: (2024)

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector
by: Ma, Shuailei, et al.
Published: (2023)

UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
by: Zhao, Guosheng, et al.
Published: (2026)

ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)

Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation
by: Bao, Nan, et al.
Published: (2025)

Navigating Image Restoration with VAR's Distribution Alignment Prior
by: Wang, Siyang, et al.
Published: (2024)

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization
by: Chang, Yifan, et al.
Published: (2025)

Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation
by: Li, Haocheng, et al.
Published: (2026)

UKnow: A Unified Knowledge Protocol with Multimodal Knowledge Graph Datasets for Reasoning and Vision-Language Pre-Training
by: Gong, Biao, et al.
Published: (2023)