:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Chunwei, Lu, Guansong, Yang, Junwei, Huang, Runhui, Han, Jianhua, Hou, Lu, Zhang, Wei, Xu, Hang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2412.06673
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
by: Huang, Runhui, et al.
Published: (2025)

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
by: Huang, Runhui, et al.
Published: (2024)

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
by: Huang, Runhui, et al.
Published: (2024)

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
by: Lu, Guansong, et al.
Published: (2023)

UNIT: Unifying Image and Text Recognition in One Vision Encoder
by: Zhu, Yi, et al.
Published: (2024)

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
by: Chen, Zisheng, et al.
Published: (2025)

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
by: Nie, Ming, et al.
Published: (2026)

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
by: Nie, Ming, et al.
Published: (2026)

KFFocus: Highlighting Keyframes for Enhanced Video Understanding
by: Nie, Ming, et al.
Published: (2025)

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment
by: Jiang, Zutao, et al.
Published: (2023)

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
by: Chen, Kai, et al.
Published: (2024)

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
by: Zhang, Zihao, et al.
Published: (2025)

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
by: Nie, Ming, et al.
Published: (2023)

HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving
by: Ding, Xinpeng, et al.
Published: (2023)

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)

See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning
by: Wei, Yuxi, et al.
Published: (2026)

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data
by: Wang, Haonan, et al.
Published: (2023)

KAN See Your Face
by: Han, Dong, et al.
Published: (2024)

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings
by: Qin, Feiwei, et al.
Published: (2025)

Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
by: Xiang, Kun, et al.
Published: (2025)

Seeing the World through Your Eyes
by: Alzayer, Hadi, et al.
Published: (2023)

See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction
by: Wu, Yuan, et al.
Published: (2025)

Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising
by: Yuan, Yunlong, et al.
Published: (2025)

You Only Speak Once to See
by: Yang, Wenhao, et al.
Published: (2024)

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
by: Liu, Zhe, et al.
Published: (2025)

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
by: Zhao, Haoyu, et al.
Published: (2025)

Does YOLO Really Need to See Every Training Image in Every Epoch?
by: Xie, Xingxing, et al.
Published: (2026)

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
by: Li, Rong, et al.
Published: (2024)

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
by: Zhang, Jinglei, et al.
Published: (2025)

Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
by: Zhang, Baoheng, et al.
Published: (2026)

Federated Out-of-Distribution Generalization: A Causal Augmentation View
by: Zhang, Runhui, et al.
Published: (2025)

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
by: Huang, Runhui, et al.
Published: (2026)

Paintings and Drawings Aesthetics Assessment with Rich Attributes for Various Artistic Categories
by: Jin, Xin, et al.
Published: (2024)

Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis
by: Gao, Xuehao, et al.
Published: (2025)

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
by: Yuan, Qianhao, et al.
Published: (2026)

Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts
by: Zhu, Jiawen, et al.
Published: (2024)

Improving Out-of-Distribution Detection with Disentangled Foreground and Background Features
by: Ding, Choubo, et al.
Published: (2023)

Zero-Shot Out-of-Distribution Detection with Outlier Label Exposure
by: Ding, Choubo, et al.
Published: (2024)

Text-Enhanced Panoptic Symbol Spotting in CAD Drawings
by: Liu, Xianlin, et al.
Published: (2025)

Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts
by: Li, Honglin, et al.
Published: (2024)