Saved in:
| Main Authors: | Wang, Chunwei, Lu, Guansong, Yang, Junwei, Huang, Runhui, Han, Jianhua, Hou, Lu, Zhang, Wei, Xu, Hang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.06673 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
by: Huang, Runhui, et al.
Published: (2025)
by: Huang, Runhui, et al.
Published: (2025)
LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
by: Huang, Runhui, et al.
Published: (2024)
by: Huang, Runhui, et al.
Published: (2024)
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
by: Huang, Runhui, et al.
Published: (2024)
by: Huang, Runhui, et al.
Published: (2024)
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
by: Lu, Guansong, et al.
Published: (2023)
by: Lu, Guansong, et al.
Published: (2023)
UNIT: Unifying Image and Text Recognition in One Vision Encoder
by: Zhu, Yi, et al.
Published: (2024)
by: Zhu, Yi, et al.
Published: (2024)
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
by: Chen, Zisheng, et al.
Published: (2025)
by: Chen, Zisheng, et al.
Published: (2025)
Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
by: Nie, Ming, et al.
Published: (2026)
by: Nie, Ming, et al.
Published: (2026)
SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
by: Nie, Ming, et al.
Published: (2026)
by: Nie, Ming, et al.
Published: (2026)
KFFocus: Highlighting Keyframes for Enhanced Video Understanding
by: Nie, Ming, et al.
Published: (2025)
by: Nie, Ming, et al.
Published: (2025)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment
by: Jiang, Zutao, et al.
Published: (2023)
by: Jiang, Zutao, et al.
Published: (2023)
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
by: Chen, Kai, et al.
Published: (2024)
by: Chen, Kai, et al.
Published: (2024)
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
by: Zhang, Zihao, et al.
Published: (2025)
by: Zhang, Zihao, et al.
Published: (2025)
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
by: Nie, Ming, et al.
Published: (2023)
by: Nie, Ming, et al.
Published: (2023)
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving
by: Ding, Xinpeng, et al.
Published: (2023)
by: Ding, Xinpeng, et al.
Published: (2023)
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)
by: Liu, Yulong, et al.
Published: (2024)
See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning
by: Wei, Yuxi, et al.
Published: (2026)
by: Wei, Yuxi, et al.
Published: (2026)
Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data
by: Wang, Haonan, et al.
Published: (2023)
by: Wang, Haonan, et al.
Published: (2023)
KAN See Your Face
by: Han, Dong, et al.
Published: (2024)
by: Han, Dong, et al.
Published: (2024)
Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings
by: Qin, Feiwei, et al.
Published: (2025)
by: Qin, Feiwei, et al.
Published: (2025)
Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
by: Xiang, Kun, et al.
Published: (2025)
by: Xiang, Kun, et al.
Published: (2025)
Seeing the World through Your Eyes
by: Alzayer, Hadi, et al.
Published: (2023)
by: Alzayer, Hadi, et al.
Published: (2023)
See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction
by: Wu, Yuan, et al.
Published: (2025)
by: Wu, Yuan, et al.
Published: (2025)
Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising
by: Yuan, Yunlong, et al.
Published: (2025)
by: Yuan, Yunlong, et al.
Published: (2025)
You Only Speak Once to See
by: Yang, Wenhao, et al.
Published: (2024)
by: Yang, Wenhao, et al.
Published: (2024)
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
by: Liu, Zhe, et al.
Published: (2025)
by: Liu, Zhe, et al.
Published: (2025)
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
by: Zhao, Haoyu, et al.
Published: (2025)
by: Zhao, Haoyu, et al.
Published: (2025)
Does YOLO Really Need to See Every Training Image in Every Epoch?
by: Xie, Xingxing, et al.
Published: (2026)
by: Xie, Xingxing, et al.
Published: (2026)
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
by: Li, Rong, et al.
Published: (2024)
by: Li, Rong, et al.
Published: (2024)
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
by: Zhang, Jinglei, et al.
Published: (2025)
by: Zhang, Jinglei, et al.
Published: (2025)
Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
by: Zhang, Baoheng, et al.
Published: (2026)
by: Zhang, Baoheng, et al.
Published: (2026)
Federated Out-of-Distribution Generalization: A Causal Augmentation View
by: Zhang, Runhui, et al.
Published: (2025)
by: Zhang, Runhui, et al.
Published: (2025)
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
by: Huang, Runhui, et al.
Published: (2026)
by: Huang, Runhui, et al.
Published: (2026)
Paintings and Drawings Aesthetics Assessment with Rich Attributes for Various Artistic Categories
by: Jin, Xin, et al.
Published: (2024)
by: Jin, Xin, et al.
Published: (2024)
Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis
by: Gao, Xuehao, et al.
Published: (2025)
by: Gao, Xuehao, et al.
Published: (2025)
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
by: Yuan, Qianhao, et al.
Published: (2026)
by: Yuan, Qianhao, et al.
Published: (2026)
Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts
by: Zhu, Jiawen, et al.
Published: (2024)
by: Zhu, Jiawen, et al.
Published: (2024)
Improving Out-of-Distribution Detection with Disentangled Foreground and Background Features
by: Ding, Choubo, et al.
Published: (2023)
by: Ding, Choubo, et al.
Published: (2023)
Zero-Shot Out-of-Distribution Detection with Outlier Label Exposure
by: Ding, Choubo, et al.
Published: (2024)
by: Ding, Choubo, et al.
Published: (2024)
Text-Enhanced Panoptic Symbol Spotting in CAD Drawings
by: Liu, Xianlin, et al.
Published: (2025)
by: Liu, Xianlin, et al.
Published: (2025)
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts
by: Li, Honglin, et al.
Published: (2024)
by: Li, Honglin, et al.
Published: (2024)
Similar Items
-
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
by: Huang, Runhui, et al.
Published: (2025) -
LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
by: Huang, Runhui, et al.
Published: (2024) -
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
by: Huang, Runhui, et al.
Published: (2024) -
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
by: Lu, Guansong, et al.
Published: (2023) -
UNIT: Unifying Image and Text Recognition in One Vision Encoder
by: Zhu, Yi, et al.
Published: (2024)