Guardado en:
| Autores principales: | Wu, Shengqiong, Wu, Lanhu, Bao, Mingyang, Xu, Wenhao, Zhang, Hanwang, Yan, Shuicheng, Fei, Hao, Chua, Tat-Seng |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2603.03564 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
por: Fei, Hao, et al.
Publicado: (2024)
por: Fei, Hao, et al.
Publicado: (2024)
Towards Semantic Equivalence of Tokenization in Multimodal LLM
por: Wu, Shengqiong, et al.
Publicado: (2024)
por: Wu, Shengqiong, et al.
Publicado: (2024)
Universal Scene Graph Generation
por: Wu, Shengqiong, et al.
Publicado: (2025)
por: Wu, Shengqiong, et al.
Publicado: (2025)
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
por: Fei, Hao, et al.
Publicado: (2023)
por: Fei, Hao, et al.
Publicado: (2023)
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
por: Fei, Hao, et al.
Publicado: (2024)
por: Fei, Hao, et al.
Publicado: (2024)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
por: Wu, Shengqiong, et al.
Publicado: (2024)
por: Wu, Shengqiong, et al.
Publicado: (2024)
Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation
por: Jin, Kaiming, et al.
Publicado: (2026)
por: Jin, Kaiming, et al.
Publicado: (2026)
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
por: Wu, Shengqiong, et al.
Publicado: (2025)
por: Wu, Shengqiong, et al.
Publicado: (2025)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
por: Liu, Kai, et al.
Publicado: (2026)
por: Liu, Kai, et al.
Publicado: (2026)
Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
por: Wu, Shengqiong, et al.
Publicado: (2026)
por: Wu, Shengqiong, et al.
Publicado: (2026)
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
por: Wu, Shengqiong, et al.
Publicado: (2025)
por: Wu, Shengqiong, et al.
Publicado: (2025)
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
por: Liu, Kai, et al.
Publicado: (2025)
por: Liu, Kai, et al.
Publicado: (2025)
Auto-Encoding Morph-Tokens for Multimodal LLM
por: Pan, Kaihang, et al.
Publicado: (2024)
por: Pan, Kaihang, et al.
Publicado: (2024)
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
por: Huang, Haojian, et al.
Publicado: (2025)
por: Huang, Haojian, et al.
Publicado: (2025)
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
por: Wang, Yaoting, et al.
Publicado: (2025)
por: Wang, Yaoting, et al.
Publicado: (2025)
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
por: Qian, Long, et al.
Publicado: (2024)
por: Qian, Long, et al.
Publicado: (2024)
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
por: Wu, Shengqiong, et al.
Publicado: (2025)
por: Wu, Shengqiong, et al.
Publicado: (2025)
Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
por: Jin, Zhe, et al.
Publicado: (2025)
por: Jin, Zhe, et al.
Publicado: (2025)
Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models
por: Li, Shaotian, et al.
Publicado: (2026)
por: Li, Shaotian, et al.
Publicado: (2026)
MVGamba: Unify 3D Content Generation as State Space Sequence Modeling
por: Yi, Xuanyu, et al.
Publicado: (2024)
por: Yi, Xuanyu, et al.
Publicado: (2024)
LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection
por: Wu, Lanhu, et al.
Publicado: (2025)
por: Wu, Lanhu, et al.
Publicado: (2025)
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
por: Zhang, Tao, et al.
Publicado: (2024)
por: Zhang, Tao, et al.
Publicado: (2024)
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
por: Zhou, Yuchen, et al.
Publicado: (2025)
por: Zhou, Yuchen, et al.
Publicado: (2025)
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
por: Fei, Hao, et al.
Publicado: (2024)
por: Fei, Hao, et al.
Publicado: (2024)
Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
por: Shi, Enyi, et al.
Publicado: (2026)
por: Shi, Enyi, et al.
Publicado: (2026)
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
por: Ma, Weijian, et al.
Publicado: (2026)
por: Ma, Weijian, et al.
Publicado: (2026)
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
por: Zhu, Fengbin, et al.
Publicado: (2024)
por: Zhu, Fengbin, et al.
Publicado: (2024)
Discriminative Probing and Tuning for Text-to-Image Generation
por: Qu, Leigang, et al.
Publicado: (2024)
por: Qu, Leigang, et al.
Publicado: (2024)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models
por: Liang, Xiao, et al.
Publicado: (2025)
por: Liang, Xiao, et al.
Publicado: (2025)
Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration
por: He, Jinghan, et al.
Publicado: (2026)
por: He, Jinghan, et al.
Publicado: (2026)
Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving
por: Zhang, Dapeng, et al.
Publicado: (2025)
por: Zhang, Dapeng, et al.
Publicado: (2025)
TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
por: Shi, Chuancheng, et al.
Publicado: (2026)
por: Shi, Chuancheng, et al.
Publicado: (2026)
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
por: Gao, Minghe, et al.
Publicado: (2025)
por: Gao, Minghe, et al.
Publicado: (2025)
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
por: Liu, Kai, et al.
Publicado: (2025)
por: Liu, Kai, et al.
Publicado: (2025)
Non-confusing Generation of Customized Concepts in Diffusion Models
por: Lin, Wang, et al.
Publicado: (2024)
por: Lin, Wang, et al.
Publicado: (2024)
Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model
por: Shen, Fei, et al.
Publicado: (2025)
por: Shen, Fei, et al.
Publicado: (2025)
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
por: Cui, Chenhang, et al.
Publicado: (2024)
por: Cui, Chenhang, et al.
Publicado: (2024)
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models
por: Qu, Leigang, et al.
Publicado: (2024)
por: Qu, Leigang, et al.
Publicado: (2024)
FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models
por: Liao, Xinting, et al.
Publicado: (2025)
por: Liao, Xinting, et al.
Publicado: (2025)
Understanding Long Videos via LLM-Powered Entity Relation Graphs
por: Chu, Meng, et al.
Publicado: (2025)
por: Chu, Meng, et al.
Publicado: (2025)
Ejemplares similares
-
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
por: Fei, Hao, et al.
Publicado: (2024) -
Towards Semantic Equivalence of Tokenization in Multimodal LLM
por: Wu, Shengqiong, et al.
Publicado: (2024) -
Universal Scene Graph Generation
por: Wu, Shengqiong, et al.
Publicado: (2025) -
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
por: Fei, Hao, et al.
Publicado: (2023) -
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
por: Fei, Hao, et al.
Publicado: (2024)