:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Qihao, Mao, Chengzhi, Liu, Yaojie, Yuille, Alan, Chu, Wen-Sheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.16921
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
by: Wang, Xingrui, et al.
Published: (2025)

Mull-Tokens: Modality-Agnostic Latent Thinking
by: Ray, Arijit, et al.
Published: (2025)

A Bayesian Approach to OOD Robustness in Image Classification
by: Kaushik, Prakhar, et al.
Published: (2024)

Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
by: Wen, Jiawen, et al.
Published: (2026)

Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation
by: Kaushik, Prakhar, et al.
Published: (2024)

DREAM: Diffusion Rectification and Estimation-Adaptive Models
by: Zhou, Jinxin, et al.
Published: (2023)

Fake it till You Make it: Reward Modeling as Discriminative Prediction
by: Liu, Runtao, et al.
Published: (2025)

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
by: Guan, Yaohan, et al.
Published: (2026)

Leveraging AI Predicted and Expert Revised Annotations in Interactive Segmentation: Continual Tuning or Full Training?
by: Zhang, Tiezheng, et al.
Published: (2024)

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
by: Zhang, Zirui, et al.
Published: (2026)

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
by: Wang, Xingrui, et al.
Published: (2025)

RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification
by: Yang, Zhen, et al.
Published: (2025)

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
by: Wang, Xingrui, et al.
Published: (2024)

Acquiring Weak Annotations for Tumor Localization in Temporal and Volumetric Data
by: Chou, Yu-Cheng, et al.
Published: (2023)

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
by: Liu, Chengzhi, et al.
Published: (2025)

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
by: Zhao, Xuanpu, et al.
Published: (2026)

Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
by: Zhang, Chenshuang, et al.
Published: (2025)

ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025)

Auditing Gender Presentation Differences in Text-to-Image Models
by: Zhang, Yanzhe, et al.
Published: (2023)

Accurate and Fast Compressed Video Captioning
by: Shen, Yaojie, et al.
Published: (2023)

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
by: Sun, Han, et al.
Published: (2026)

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
by: Liu, Qihao, et al.
Published: (2024)

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
by: Liu, Qihao, et al.
Published: (2024)

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
by: Jiang, Yibo, et al.
Published: (2026)

The Universal Weight Subspace Hypothesis
by: Kaushik, Prakhar, et al.
Published: (2025)

Shared LoRA Subspaces for almost Strict Continual Learning
by: Kaushik, Prakhar, et al.
Published: (2026)

The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment
by: Liu, Hongyuan, et al.
Published: (2026)

Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation
by: Wang, Yanyan, et al.
Published: (2025)

Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
by: Yang, Peiyu, et al.
Published: (2026)

Bridging the Gap Between Sparsity and Redundancy: A Dual-Decoding Framework with Global Context for Map Inference
by: Shen, Yudong, et al.
Published: (2025)

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
by: Pan, Chenbin, et al.
Published: (2025)

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
by: Wen, Zichen, et al.
Published: (2026)

SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing
by: Jiang, Lifan, et al.
Published: (2026)

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
by: Zhang, Chenshuang, et al.
Published: (2024)

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models
by: Han, Yuhang, et al.
Published: (2026)

DiffSeg: A Segmentation Model for Skin Lesions Based on Diffusion Difference
by: Shuai, Zhihao, et al.
Published: (2024)

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
by: Jeon, Yerim, et al.
Published: (2025)

Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities
by: Gandhi, Kahaan, et al.
Published: (2025)

Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
by: Liu, Ziyi, et al.
Published: (2025)

Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation
by: Huang, Qiming, et al.
Published: (2025)