:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Shimin, Chen, Xianwei, Shen, Yufan, Ye, Ziyuan, Wu, Jibin
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.07558
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
by: Yang, Qi, et al.
Published: (2025)

Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
by: Yang, Shuo, et al.
Published: (2025)

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
by: Li, Ling, et al.
Published: (2024)

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
by: Jin, Jiachun, et al.
Published: (2026)

DeepLatent: Think with Images via Parallel Latent Visual Reasoning
by: Lu, Dongchen, et al.
Published: (2026)

LanteRn: Latent Visual Structured Reasoning
by: Viveiros, André G., et al.
Published: (2026)

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models
by: Huang, Chenyu, et al.
Published: (2026)

ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models
by: Tian, Jiaxu, et al.
Published: (2025)

Text-to-Scene with Large Reasoning Models
by: Berdoz, Frédéric, et al.
Published: (2025)

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models
by: Sun, Mengyu, et al.
Published: (2026)

Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information
by: Chu, Xu, et al.
Published: (2025)

Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval
by: Alavi, Ali
Published: (2026)

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning
by: Li, Lingxiao, et al.
Published: (2025)

LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
by: Xie, Mingyang, et al.
Published: (2026)

LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models
by: Zhu, Mengdan, et al.
Published: (2024)

CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence
by: Yu, Tianjiao, et al.
Published: (2025)

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model
by: Chen, Yufan, et al.
Published: (2025)

Towards Sparse Video Understanding and Reasoning
by: Xu, Chenwei, et al.
Published: (2026)

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
by: Zheng, Xiaoji, et al.
Published: (2025)

ShaLa: Multimodal Shared Latent Space Modelling
by: Cui, Jiali, et al.
Published: (2025)

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
by: Wang, Junchi, et al.
Published: (2024)

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
by: Min, Juhong, et al.
Published: (2024)

DragGANSpace: Latent Space Exploration and Control for GANs
by: Odendaal, Kirsten, et al.
Published: (2025)

Spatial Reasoning with Denoising Models
by: Wewer, Christopher, et al.
Published: (2025)

Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
by: You, Weiqiu, et al.
Published: (2026)

What's Holding Back Latent Visual Reasoning?
by: Viveiros, André G., et al.
Published: (2026)

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery
by: Gao, Ziyuan
Published: (2026)

ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment
by: Wang, Xinyi, et al.
Published: (2024)

Gradient-Guided Exploration of Generative Model's Latent Space for Controlled Iris Image Augmentations
by: Mitcheff, Mahsa, et al.
Published: (2025)

LaRe: Latent Refocusing for Multimodal Reasoning
by: Ma, Jizheng, et al.
Published: (2025)

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models
by: Chen, Ziyuan, et al.
Published: (2026)

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
by: Huang, Chi-Pin, et al.
Published: (2025)

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning
by: Shen, Junhao, et al.
Published: (2025)

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies
by: Su, Hung-Ting, et al.
Published: (2024)

Scaling Supervised Local Learning with Augmented Auxiliary Networks
by: Ma, Chenxiang, et al.
Published: (2024)

DC-Merge: Improving Model Merging with Directional Consistency
by: Zhang, Han-Chen, et al.
Published: (2026)

Interleaving Reasoning for Better Text-to-Image Generation
by: Huang, Wenxuan, et al.
Published: (2025)

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
by: Murphy, Alexander, et al.
Published: (2025)

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
by: Bigverdi, Mahtab, et al.
Published: (2024)

InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem
by: Hong, Yeobin, et al.
Published: (2025)