:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ding, Yanbo, Zhuang, Shaobin, Li, Kunchang, Yue, Zhengrong, Qiao, Yu, Wang, Yali
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Computer Vision and Pattern Recognition Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2408.10605
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
von: Yue, Zhengrong, et al.
Veröffentlicht: (2025)

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
von: Guo, Yiwei, et al.
Veröffentlicht: (2024)

Vlogger: Make Your Dream A Vlog
von: Zhuang, Shaobin, et al.
Veröffentlicht: (2024)

TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision
von: Zhuang, Shaobin, et al.
Veröffentlicht: (2025)

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
von: Chen, Boyu, et al.
Veröffentlicht: (2025)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
von: Zeng, Xiangyu, et al.
Veröffentlicht: (2024)

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
von: Chen, Siran, et al.
Veröffentlicht: (2025)

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
von: Chen, Boyu, et al.
Veröffentlicht: (2025)

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
von: Yue, Zhengrong, et al.
Veröffentlicht: (2025)

3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation
von: Zhang, Jusheng, et al.
Veröffentlicht: (2026)

Video-GPT via Next Clip Diffusion
von: Zhuang, Shaobin, et al.
Veröffentlicht: (2025)

BitDance: Scaling Autoregressive Generative Models with Binary Tokens
von: Ai, Yuang, et al.
Veröffentlicht: (2026)

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition
von: Chen, Boyu, et al.
Veröffentlicht: (2024)

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
von: Chen, Boyu, et al.
Veröffentlicht: (2025)

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
von: Guo, Yiwei, et al.
Veröffentlicht: (2026)

Harvest Video Foundation Models via Efficient Post-Pretraining
von: Li, Yizhuo, et al.
Veröffentlicht: (2023)

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
von: Wang, Zikang, et al.
Veröffentlicht: (2025)

MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
von: Hu, Xirui, et al.
Veröffentlicht: (2026)

VideoMamba: State Space Model for Efficient Video Understanding
von: Li, Kunchang, et al.
Veröffentlicht: (2024)

Unmasked Teacher: Towards Training-Efficient Video Foundation Models
von: Li, Kunchang, et al.
Veröffentlicht: (2023)

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
von: Wang, Zun, et al.
Veröffentlicht: (2024)

MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty
von: Brödermann, Tim, et al.
Veröffentlicht: (2024)

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
von: Yu, Yang, et al.
Veröffentlicht: (2026)

UniFormer: Unifying Convolution and Self-attention for Visual Recognition
von: Li, Kunchang, et al.
Veröffentlicht: (2022)

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
von: Wang, Ziyi, et al.
Veröffentlicht: (2024)

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration
von: Hu, Panwen, et al.
Veröffentlicht: (2024)

3D MRI Image Pretraining via Controllable 2D Slice Navigation Task
von: Wang, Yu, et al.
Veröffentlicht: (2026)

Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration
von: Hu, Yue, et al.
Veröffentlicht: (2025)

FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching
von: Zhu, Huayi, et al.
Veröffentlicht: (2025)

Feedforward 3D Editing via Text-Steerable Image-to-3D
von: Ma, Ziqi, et al.
Veröffentlicht: (2025)

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
von: Zhuang, Shaobin, et al.
Veröffentlicht: (2026)

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
von: Yan, Ziang, et al.
Veröffentlicht: (2025)

Any4D: Open-Prompt 4D Generation from Natural Language and Images
von: Li, Hao, et al.
Veröffentlicht: (2025)

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
von: Yang, Zhongyu, et al.
Veröffentlicht: (2026)

MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning
von: Yang, Pu, et al.
Veröffentlicht: (2025)

GEN3D: Generating Domain-Free 3D Scenes from a Single Image
von: Zhang, Yuxin, et al.
Veröffentlicht: (2025)

Collaborative Multi-Modal Coding for High-Quality 3D Generation
von: Cao, Ziang, et al.
Veröffentlicht: (2025)

MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models
von: Fang, Shaoheng, et al.
Veröffentlicht: (2025)

Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
von: Baba, Kaito, et al.
Veröffentlicht: (2026)

MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents
von: Xing, Yun, et al.
Veröffentlicht: (2024)