Saved in:
| Main Authors: | Yang, Jiacheng, Chen, Anqi, Dang, Yunkai, Fan, Qi, Wang, Cong, Li, Wenbin, Miao, Feng, Gao, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.23615 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
by: Dang, Yunkai, et al.
Published: (2025)
by: Dang, Yunkai, et al.
Published: (2025)
Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models
by: Dang, Yunkai, et al.
Published: (2026)
by: Dang, Yunkai, et al.
Published: (2026)
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
by: Dang, Yunkai, et al.
Published: (2026)
by: Dang, Yunkai, et al.
Published: (2026)
CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models
by: Dang, Yunkai, et al.
Published: (2026)
by: Dang, Yunkai, et al.
Published: (2026)
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
by: Dang, Yunkai, et al.
Published: (2025)
by: Dang, Yunkai, et al.
Published: (2025)
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
by: Huang, Xinyu, et al.
Published: (2025)
by: Huang, Xinyu, et al.
Published: (2025)
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
by: Jiang, Zheng, et al.
Published: (2026)
by: Jiang, Zheng, et al.
Published: (2026)
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
by: Hua, Jiacheng, et al.
Published: (2026)
by: Hua, Jiacheng, et al.
Published: (2026)
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
by: Wang, Wenbin, et al.
Published: (2024)
by: Wang, Wenbin, et al.
Published: (2024)
Prompt-Free Universal Region Proposal Network
by: Tang, Qihong, et al.
Published: (2026)
by: Tang, Qihong, et al.
Published: (2026)
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
by: Zhan, Yufei, et al.
Published: (2025)
by: Zhan, Yufei, et al.
Published: (2025)
Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation
by: Luo, Jingnan, et al.
Published: (2026)
by: Luo, Jingnan, et al.
Published: (2026)
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models
by: Zhang, Jiacheng, et al.
Published: (2024)
by: Zhang, Jiacheng, et al.
Published: (2024)
VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models
by: Sima, Bingrui, et al.
Published: (2025)
by: Sima, Bingrui, et al.
Published: (2025)
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models
by: Gao, Kuofeng, et al.
Published: (2024)
by: Gao, Kuofeng, et al.
Published: (2024)
Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning
by: Hao, Pengfei, et al.
Published: (2025)
by: Hao, Pengfei, et al.
Published: (2025)
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
by: Zhang, Renshan, et al.
Published: (2025)
by: Zhang, Renshan, et al.
Published: (2025)
Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
by: Ma, Junyuan, et al.
Published: (2026)
by: Ma, Junyuan, et al.
Published: (2026)
Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation
by: Zhang, Anqi, et al.
Published: (2024)
by: Zhang, Anqi, et al.
Published: (2024)
Towards Arbitrary-Scale Spacecraft Image Super-Resolution via Salient Region-Guidance
by: Yang, Jingfan, et al.
Published: (2025)
by: Yang, Jingfan, et al.
Published: (2025)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models
by: Meng, Chutian, et al.
Published: (2024)
by: Meng, Chutian, et al.
Published: (2024)
From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning
by: Sharif, Omar, et al.
Published: (2026)
by: Sharif, Omar, et al.
Published: (2026)
AviationLMM: A Large Multimodal Foundation Model for Civil Aviation
by: Li, Wenbin, et al.
Published: (2026)
by: Li, Wenbin, et al.
Published: (2026)
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
by: Fang, Yiyang, et al.
Published: (2026)
by: Fang, Yiyang, et al.
Published: (2026)
Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model
by: Wang, Hanqing, et al.
Published: (2025)
by: Wang, Hanqing, et al.
Published: (2025)
QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning
by: Tong, Xinyang, et al.
Published: (2024)
by: Tong, Xinyang, et al.
Published: (2024)
Learning Accurate and Enriched Features for Stereo Image Super-Resolution
by: Gao, Hu, et al.
Published: (2024)
by: Gao, Hu, et al.
Published: (2024)
CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning
by: Xie, Yuxin, et al.
Published: (2026)
by: Xie, Yuxin, et al.
Published: (2026)
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
by: Ni, Minheng, et al.
Published: (2025)
by: Ni, Minheng, et al.
Published: (2025)
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
by: Miao, Yanting, et al.
Published: (2026)
by: Miao, Yanting, et al.
Published: (2026)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
by: Zhang, Yi-Fan, et al.
Published: (2024)
by: Zhang, Yi-Fan, et al.
Published: (2024)
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
by: Zhan, Yufei, et al.
Published: (2024)
by: Zhan, Yufei, et al.
Published: (2024)
MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions
by: Zhang, Haoyu, et al.
Published: (2026)
by: Zhang, Haoyu, et al.
Published: (2026)
DM$^3$Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching
by: Guan, Cong, et al.
Published: (2025)
by: Guan, Cong, et al.
Published: (2025)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)
by: Rasheed, Hanoona, et al.
Published: (2025)
Annotation-Efficient Polyp Segmentation via Active Learning
by: Huang, Duojun, et al.
Published: (2024)
by: Huang, Duojun, et al.
Published: (2024)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model
by: Sun, Yiming, et al.
Published: (2024)
by: Sun, Yiming, et al.
Published: (2024)
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
by: Dong, Yuhao, et al.
Published: (2024)
by: Dong, Yuhao, et al.
Published: (2024)
VRSO: Visual-Centric Reconstruction for Static Object Annotation
by: Yu, Chenyao, et al.
Published: (2024)
by: Yu, Chenyao, et al.
Published: (2024)
Similar Items
-
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
by: Dang, Yunkai, et al.
Published: (2025) -
Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models
by: Dang, Yunkai, et al.
Published: (2026) -
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
by: Dang, Yunkai, et al.
Published: (2026) -
CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models
by: Dang, Yunkai, et al.
Published: (2026) -
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
by: Dang, Yunkai, et al.
Published: (2025)