Saved in:
| Main Authors: | Bao, Xiaoyi, Sun, Siyang, Ma, Shuailei, Zheng, Kecheng, Guo, Yuxin, Zhao, Guosheng, Zheng, Yun, Wang, Xingang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.05673 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Aligned Better, Listen Better for Audio-Visual Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)
by: Guo, Yuxin, et al.
Published: (2025)
Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model
by: Ma, Shuailei, et al.
Published: (2023)
by: Ma, Shuailei, et al.
Published: (2023)
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
by: Zhao, Guosheng, et al.
Published: (2024)
by: Zhao, Guosheng, et al.
Published: (2024)
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
by: Wang, Xiaofeng, et al.
Published: (2024)
by: Wang, Xiaofeng, et al.
Published: (2024)
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
by: Wu, Wei, et al.
Published: (2024)
by: Wu, Wei, et al.
Published: (2024)
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
by: Bao, Xiaoyi, et al.
Published: (2025)
by: Bao, Xiaoyi, et al.
Published: (2025)
DreamLIP: Language-Image Pre-training with Long Captions
by: Zheng, Kecheng, et al.
Published: (2024)
by: Zheng, Kecheng, et al.
Published: (2024)
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
by: Zhao, Guosheng, et al.
Published: (2025)
by: Zhao, Guosheng, et al.
Published: (2025)
Learning Consistent Taxonomic Classification through Hierarchical Reasoning
by: Li, Zhenghong, et al.
Published: (2026)
by: Li, Zhenghong, et al.
Published: (2026)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
by: Lu, Fan, et al.
Published: (2024)
by: Lu, Fan, et al.
Published: (2024)
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
by: Guo, Yuxin, et al.
Published: (2024)
by: Guo, Yuxin, et al.
Published: (2024)
AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
by: Qu, Zhen, et al.
Published: (2026)
by: Qu, Zhen, et al.
Published: (2026)
Learning Visual Generative Priors without Text
by: Ma, Shuailei, et al.
Published: (2024)
by: Ma, Shuailei, et al.
Published: (2024)
Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection
by: Jia, Mingda, et al.
Published: (2024)
by: Jia, Mingda, et al.
Published: (2024)
GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
by: Bao, Xiaoyi, et al.
Published: (2025)
by: Bao, Xiaoyi, et al.
Published: (2025)
Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search
by: Liang, Tianming, et al.
Published: (2026)
by: Liang, Tianming, et al.
Published: (2026)
CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation
by: Tan, Shuai, et al.
Published: (2026)
by: Tan, Shuai, et al.
Published: (2026)
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
by: Liu, Shi, et al.
Published: (2024)
by: Liu, Shi, et al.
Published: (2024)
Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
by: Zhang, Yuchen, et al.
Published: (2026)
by: Zhang, Yuchen, et al.
Published: (2026)
HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration
by: Wang, Boyuan, et al.
Published: (2025)
by: Wang, Boyuan, et al.
Published: (2025)
UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
by: Tang, Hao, et al.
Published: (2025)
by: Tang, Hao, et al.
Published: (2025)
EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
by: Zheng, Xu, et al.
Published: (2024)
by: Zheng, Xu, et al.
Published: (2024)
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
by: Zhao, Guosheng, et al.
Published: (2024)
by: Zhao, Guosheng, et al.
Published: (2024)
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
by: Tang, Hao, et al.
Published: (2025)
by: Tang, Hao, et al.
Published: (2025)
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
by: Wang, Boyuan, et al.
Published: (2025)
by: Wang, Boyuan, et al.
Published: (2025)
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
by: Wang, Boyuan, et al.
Published: (2026)
by: Wang, Boyuan, et al.
Published: (2026)
DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds
by: Zhang, Jiaxu, et al.
Published: (2025)
by: Zhang, Jiaxu, et al.
Published: (2025)
Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning
by: Wan, Xixi, et al.
Published: (2025)
by: Wan, Xixi, et al.
Published: (2025)
EventDance++: Language-guided Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
by: Zheng, Xu, et al.
Published: (2024)
by: Zheng, Xu, et al.
Published: (2024)
ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation
by: Lin, Jingzhong, et al.
Published: (2025)
by: Lin, Jingzhong, et al.
Published: (2025)
PS-ReID: Advancing Person Re-Identification and Precise Segmentation with Multimodal Retrieval
by: Yan, Jincheng, et al.
Published: (2025)
by: Yan, Jincheng, et al.
Published: (2025)
Contextual AD Narration with Interleaved Multimodal Sequence
by: Wang, Hanlin, et al.
Published: (2024)
by: Wang, Hanlin, et al.
Published: (2024)
SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector
by: Ma, Shuailei, et al.
Published: (2023)
by: Ma, Shuailei, et al.
Published: (2023)
UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
by: Zhao, Guosheng, et al.
Published: (2026)
by: Zhao, Guosheng, et al.
Published: (2026)
ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)
by: Zheng, Rongkun, et al.
Published: (2024)
Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation
by: Bao, Nan, et al.
Published: (2025)
by: Bao, Nan, et al.
Published: (2025)
Navigating Image Restoration with VAR's Distribution Alignment Prior
by: Wang, Siyang, et al.
Published: (2024)
by: Wang, Siyang, et al.
Published: (2024)
Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization
by: Chang, Yifan, et al.
Published: (2025)
by: Chang, Yifan, et al.
Published: (2025)
Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation
by: Li, Haocheng, et al.
Published: (2026)
by: Li, Haocheng, et al.
Published: (2026)
UKnow: A Unified Knowledge Protocol with Multimodal Knowledge Graph Datasets for Reasoning and Vision-Language Pre-Training
by: Gong, Biao, et al.
Published: (2023)
by: Gong, Biao, et al.
Published: (2023)
Similar Items
-
Aligned Better, Listen Better for Audio-Visual Large Language Models
by: Guo, Yuxin, et al.
Published: (2025) -
Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model
by: Ma, Shuailei, et al.
Published: (2023) -
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
by: Zhao, Guosheng, et al.
Published: (2024) -
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
by: Wang, Xiaofeng, et al.
Published: (2024) -
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
by: Wu, Wei, et al.
Published: (2024)