:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Yuqi, Zhang, Xiaohan, Arrabi, Ahmad, Sultani, Waqas, Chen, Chen, Wshah, Safwan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.10721
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance
by: Arrabi, Ahmad, et al.
Published: (2024)

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
by: Lehyeh, Ayesh Abu, et al.
Published: (2026)

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement
by: Zhang, Xiaohan, et al.
Published: (2023)

Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control
by: Jung, Jay, et al.
Published: (2026)

Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
by: Zhang, Yancheng, et al.
Published: (2026)

Automated C-Arm Positioning via Conformal Landmark Localization
by: Arrabi, Ahmad, et al.
Published: (2025)

VICI: VLM-Instructed Cross-view Image-localisation
by: Zhang, Xiaohan, et al.
Published: (2025)

C-arm Guidance: A Self-supervised Approach To Automated Positioning During Stroke Thrombectomy
by: Arrabi, Ahmad, et al.
Published: (2025)

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
by: Zhang, Yi, et al.
Published: (2025)

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes
by: Ling, Chen, et al.
Published: (2026)

L2P: Unlocking Latent Potential for Pixel Generation
by: Chen, Zhennan, et al.
Published: (2026)

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
by: Xi, Suyang, et al.
Published: (2025)

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
by: Yin, Hao, et al.
Published: (2025)

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
by: Zhou, Jiazhou, et al.
Published: (2026)

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)

Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline
by: Zuo, Rui, et al.
Published: (2025)

Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
by: Liu, Shuliang, et al.
Published: (2026)

RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
by: Chang, Yue, et al.
Published: (2026)

RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations
by: He, Xingqi, et al.
Published: (2025)

Image Forgery Localization via Guided Noise and Multi-Scale Feature Aggregation
by: Niu, Yakun, et al.
Published: (2024)

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
by: Wang, Jingchao, et al.
Published: (2025)

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs
by: Gao, Yufei, et al.
Published: (2025)

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
by: Chen, Shimin, et al.
Published: (2024)

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
by: Jiang, Kai, et al.
Published: (2025)

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval
by: Pan, Jiancheng, et al.
Published: (2024)

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)

Retrieval-based Disentangled Representation Learning with Natural Language Supervision
by: Zhou, Jiawei, et al.
Published: (2022)

MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
by: Wang, Rongsheng, et al.
Published: (2025)

Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models
by: Yu, Bo, et al.
Published: (2026)

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
by: Yuan, Jiakang, et al.
Published: (2025)

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training
by: Liu, Anglin, et al.
Published: (2026)

Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning
by: Chen, Harold Haodong, et al.
Published: (2024)

Find Them All: Unveiling MLLMs for Versatile Person Re-identification
by: Li, Jinhao, et al.
Published: (2025)

Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework
by: Han, Xiao, et al.
Published: (2024)

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
by: Luo, Bingjun, et al.
Published: (2026)

From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information
by: Jiao, Qirui, et al.
Published: (2024)

Dense Connector for MLLMs
by: Yao, Huanjin, et al.
Published: (2024)

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
by: Li, Kaiyuan, et al.
Published: (2025)

Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
by: Guo, Zirun, et al.
Published: (2025)

Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
by: Ma, Yanbiao, et al.
Published: (2025)