:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Wu, Shengqiong, Wu, Lanhu, Bao, Mingyang, Xu, Wenhao, Zhang, Hanwang, Yan, Shuicheng, Fei, Hao, Chua, Tat-Seng
Formato:	Preprint
Publicado:	2026
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2603.03564
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
por: Fei, Hao, et al.
Publicado: (2024)

Towards Semantic Equivalence of Tokenization in Multimodal LLM
por: Wu, Shengqiong, et al.
Publicado: (2024)

Universal Scene Graph Generation
por: Wu, Shengqiong, et al.
Publicado: (2025)

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
por: Fei, Hao, et al.
Publicado: (2023)

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
por: Fei, Hao, et al.
Publicado: (2024)

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
por: Wu, Shengqiong, et al.
Publicado: (2024)

Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation
por: Jin, Kaiming, et al.
Publicado: (2026)

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
por: Wu, Shengqiong, et al.
Publicado: (2025)

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
por: Liu, Kai, et al.
Publicado: (2026)

Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
por: Wu, Shengqiong, et al.
Publicado: (2026)

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
por: Wu, Shengqiong, et al.
Publicado: (2025)

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
por: Liu, Kai, et al.
Publicado: (2025)

Auto-Encoding Morph-Tokens for Multimodal LLM
por: Pan, Kaihang, et al.
Publicado: (2024)

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
por: Huang, Haojian, et al.
Publicado: (2025)

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
por: Wang, Yaoting, et al.
Publicado: (2025)

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
por: Qian, Long, et al.
Publicado: (2024)

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
por: Wu, Shengqiong, et al.
Publicado: (2025)

Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
por: Jin, Zhe, et al.
Publicado: (2025)

Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models
por: Li, Shaotian, et al.
Publicado: (2026)

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling
por: Yi, Xuanyu, et al.
Publicado: (2024)

LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection
por: Wu, Lanhu, et al.
Publicado: (2025)

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
por: Zhang, Tao, et al.
Publicado: (2024)

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
por: Zhou, Yuchen, et al.
Publicado: (2025)

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
por: Fei, Hao, et al.
Publicado: (2024)

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
por: Shi, Enyi, et al.
Publicado: (2026)

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
por: Ma, Weijian, et al.
Publicado: (2026)

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
por: Zhu, Fengbin, et al.
Publicado: (2024)

Discriminative Probing and Tuning for Text-to-Image Generation
por: Qu, Leigang, et al.
Publicado: (2024)

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models
por: Liang, Xiao, et al.
Publicado: (2025)

Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration
por: He, Jinghan, et al.
Publicado: (2026)

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving
por: Zhang, Dapeng, et al.
Publicado: (2025)

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
por: Shi, Chuancheng, et al.
Publicado: (2026)

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
por: Gao, Minghe, et al.
Publicado: (2025)

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
por: Liu, Kai, et al.
Publicado: (2025)

Non-confusing Generation of Customized Concepts in Diffusion Models
por: Lin, Wang, et al.
Publicado: (2024)

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model
por: Shen, Fei, et al.
Publicado: (2025)

Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
por: Cui, Chenhang, et al.
Publicado: (2024)

TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models
por: Qu, Leigang, et al.
Publicado: (2024)

FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models
por: Liao, Xinting, et al.
Publicado: (2025)

Understanding Long Videos via LLM-Powered Entity Relation Graphs
por: Chu, Meng, et al.
Publicado: (2025)