Saved in:
| Main Authors: | Chen, Yuqi, Zhang, Xiaohan, Arrabi, Ahmad, Sultani, Waqas, Chen, Chen, Wshah, Safwan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.10721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance
by: Arrabi, Ahmad, et al.
Published: (2024)
by: Arrabi, Ahmad, et al.
Published: (2024)
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
by: Lehyeh, Ayesh Abu, et al.
Published: (2026)
by: Lehyeh, Ayesh Abu, et al.
Published: (2026)
GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement
by: Zhang, Xiaohan, et al.
Published: (2023)
by: Zhang, Xiaohan, et al.
Published: (2023)
Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control
by: Jung, Jay, et al.
Published: (2026)
by: Jung, Jay, et al.
Published: (2026)
Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
by: Zhang, Yancheng, et al.
Published: (2026)
by: Zhang, Yancheng, et al.
Published: (2026)
Automated C-Arm Positioning via Conformal Landmark Localization
by: Arrabi, Ahmad, et al.
Published: (2025)
by: Arrabi, Ahmad, et al.
Published: (2025)
VICI: VLM-Instructed Cross-view Image-localisation
by: Zhang, Xiaohan, et al.
Published: (2025)
by: Zhang, Xiaohan, et al.
Published: (2025)
C-arm Guidance: A Self-supervised Approach To Automated Positioning During Stroke Thrombectomy
by: Arrabi, Ahmad, et al.
Published: (2025)
by: Arrabi, Ahmad, et al.
Published: (2025)
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
by: Zhang, Yi, et al.
Published: (2025)
by: Zhang, Yi, et al.
Published: (2025)
Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes
by: Ling, Chen, et al.
Published: (2026)
by: Ling, Chen, et al.
Published: (2026)
L2P: Unlocking Latent Potential for Pixel Generation
by: Chen, Zhennan, et al.
Published: (2026)
by: Chen, Zhennan, et al.
Published: (2026)
Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
by: Xi, Suyang, et al.
Published: (2025)
by: Xi, Suyang, et al.
Published: (2025)
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
by: Yin, Hao, et al.
Published: (2025)
by: Yin, Hao, et al.
Published: (2025)
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
by: Zhou, Jiazhou, et al.
Published: (2026)
by: Zhou, Jiazhou, et al.
Published: (2026)
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)
by: Mao, Jiawei, et al.
Published: (2025)
Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline
by: Zuo, Rui, et al.
Published: (2025)
by: Zuo, Rui, et al.
Published: (2025)
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
by: Liu, Shuliang, et al.
Published: (2026)
by: Liu, Shuliang, et al.
Published: (2026)
RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
by: Chang, Yue, et al.
Published: (2026)
by: Chang, Yue, et al.
Published: (2026)
RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations
by: He, Xingqi, et al.
Published: (2025)
by: He, Xingqi, et al.
Published: (2025)
Image Forgery Localization via Guided Noise and Multi-Scale Feature Aggregation
by: Niu, Yakun, et al.
Published: (2024)
by: Niu, Yakun, et al.
Published: (2024)
Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
by: Wang, Jingchao, et al.
Published: (2025)
by: Wang, Jingchao, et al.
Published: (2025)
MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs
by: Gao, Yufei, et al.
Published: (2025)
by: Gao, Yufei, et al.
Published: (2025)
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
by: Chen, Shimin, et al.
Published: (2024)
by: Chen, Shimin, et al.
Published: (2024)
Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
by: Jiang, Kai, et al.
Published: (2025)
by: Jiang, Kai, et al.
Published: (2025)
PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval
by: Pan, Jiancheng, et al.
Published: (2024)
by: Pan, Jiancheng, et al.
Published: (2024)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision
by: Zhou, Jiawei, et al.
Published: (2022)
by: Zhou, Jiawei, et al.
Published: (2022)
MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
by: Wang, Rongsheng, et al.
Published: (2025)
by: Wang, Rongsheng, et al.
Published: (2025)
Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models
by: Yu, Bo, et al.
Published: (2026)
by: Yu, Bo, et al.
Published: (2026)
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
by: Yuan, Jiakang, et al.
Published: (2025)
by: Yuan, Jiakang, et al.
Published: (2025)
Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training
by: Liu, Anglin, et al.
Published: (2026)
by: Liu, Anglin, et al.
Published: (2026)
Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning
by: Chen, Harold Haodong, et al.
Published: (2024)
by: Chen, Harold Haodong, et al.
Published: (2024)
Find Them All: Unveiling MLLMs for Versatile Person Re-identification
by: Li, Jinhao, et al.
Published: (2025)
by: Li, Jinhao, et al.
Published: (2025)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework
by: Han, Xiao, et al.
Published: (2024)
by: Han, Xiao, et al.
Published: (2024)
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
by: Luo, Bingjun, et al.
Published: (2026)
by: Luo, Bingjun, et al.
Published: (2026)
From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information
by: Jiao, Qirui, et al.
Published: (2024)
by: Jiao, Qirui, et al.
Published: (2024)
Dense Connector for MLLMs
by: Yao, Huanjin, et al.
Published: (2024)
by: Yao, Huanjin, et al.
Published: (2024)
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
by: Li, Kaiyuan, et al.
Published: (2025)
by: Li, Kaiyuan, et al.
Published: (2025)
Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
by: Guo, Zirun, et al.
Published: (2025)
by: Guo, Zirun, et al.
Published: (2025)
Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
by: Ma, Yanbiao, et al.
Published: (2025)
by: Ma, Yanbiao, et al.
Published: (2025)
Similar Items
-
Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance
by: Arrabi, Ahmad, et al.
Published: (2024) -
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
by: Lehyeh, Ayesh Abu, et al.
Published: (2026) -
GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement
by: Zhang, Xiaohan, et al.
Published: (2023) -
Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control
by: Jung, Jay, et al.
Published: (2026) -
Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
by: Zhang, Yancheng, et al.
Published: (2026)