Saved in:
| Main Authors: | Yang, Yi, He, Xiaoxuan, Pan, Hongkun, Jiang, Xiyan, Deng, Yan, Yang, Xingtao, Lu, Haoyu, Yin, Dacheng, Rao, Fengyun, Zhu, Minfeng, Zhang, Bo, Chen, Wei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.10615 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
by: He, Xiaoxuan, et al.
Published: (2025)
by: He, Xiaoxuan, et al.
Published: (2025)
SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
by: He, Xiaoxuan, et al.
Published: (2026)
by: He, Xiaoxuan, et al.
Published: (2026)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
by: Yang, Jian, et al.
Published: (2024)
by: Yang, Jian, et al.
Published: (2024)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025)
by: Yang, Jian, et al.
Published: (2025)
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
by: Yang, Jie, et al.
Published: (2025)
by: Yang, Jie, et al.
Published: (2025)
V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
by: Wang, Han, et al.
Published: (2026)
by: Wang, Han, et al.
Published: (2026)
MMhops-R1: Multimodal Multi-hop Reasoning
by: Zhang, Tao, et al.
Published: (2025)
by: Zhang, Tao, et al.
Published: (2025)
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
by: Suo, Yucheng, et al.
Published: (2025)
by: Suo, Yucheng, et al.
Published: (2025)
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
by: Li, Yong, et al.
Published: (2026)
by: Li, Yong, et al.
Published: (2026)
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
by: Wang, Zitian, et al.
Published: (2025)
by: Wang, Zitian, et al.
Published: (2025)
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
by: Pan, Hongkun, et al.
Published: (2026)
by: Pan, Hongkun, et al.
Published: (2026)
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
by: Zhong, Qihuang, et al.
Published: (2026)
by: Zhong, Qihuang, et al.
Published: (2026)
Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions
by: Wei, Jingxuan, et al.
Published: (2025)
by: Wei, Jingxuan, et al.
Published: (2025)
CC-Time: Cross-Model and Cross-Modality Time Series Forecasting
by: Chen, Peng, et al.
Published: (2025)
by: Chen, Peng, et al.
Published: (2025)
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
by: Yan, Shannan, et al.
Published: (2026)
by: Yan, Shannan, et al.
Published: (2026)
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
by: Zhang, Yunzhu, et al.
Published: (2025)
by: Zhang, Yunzhu, et al.
Published: (2025)
Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?
by: He, Haibin, et al.
Published: (2025)
by: He, Haibin, et al.
Published: (2025)
Spatial-Semantic Collaborative Cropping for User Generated Content
by: Su, Yukun, et al.
Published: (2024)
by: Su, Yukun, et al.
Published: (2024)
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
by: Chen, Cong, et al.
Published: (2025)
by: Chen, Cong, et al.
Published: (2025)
Robust 3D Object Detection from LiDAR-Radar Point Clouds via Cross-Modal Feature Augmentation
by: Deng, Jianning, et al.
Published: (2023)
by: Deng, Jianning, et al.
Published: (2023)
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
by: Yao, Huanjin, et al.
Published: (2025)
by: Yao, Huanjin, et al.
Published: (2025)
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
by: Liu, Haoyu, et al.
Published: (2026)
by: Liu, Haoyu, et al.
Published: (2026)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
by: Zhu, Xingtao Ling Yingying
Published: (2025)
by: Zhu, Xingtao Ling Yingying
Published: (2025)
The Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning
by: Fu, Xiyan, et al.
Published: (2024)
by: Fu, Xiyan, et al.
Published: (2024)
ObjEmbed: Towards Universal Multimodal Object Embeddings
by: Fu, Shenghao, et al.
Published: (2026)
by: Fu, Shenghao, et al.
Published: (2026)
Mars-PO: Multi-Agent Reasoning System Preference Optimization
by: Lou, Xiaoxuan, et al.
Published: (2024)
by: Lou, Xiaoxuan, et al.
Published: (2024)
CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation
by: Du, Yexing, et al.
Published: (2025)
by: Du, Yexing, et al.
Published: (2025)
Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning
by: He, Xiang, et al.
Published: (2025)
by: He, Xiang, et al.
Published: (2025)
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
by: Yang, Cheng, et al.
Published: (2024)
by: Yang, Cheng, et al.
Published: (2024)
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
by: Yue, Xinli, et al.
Published: (2025)
by: Yue, Xinli, et al.
Published: (2025)
InterDeepResearch: Enabling Human-Agent Collaborative Information Seeking through Interactive Deep Research
by: Pan, Bo, et al.
Published: (2026)
by: Pan, Bo, et al.
Published: (2026)
GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
by: Sun, Jiayin, et al.
Published: (2026)
by: Sun, Jiayin, et al.
Published: (2026)
Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection
by: Ning, Kanglin, et al.
Published: (2026)
by: Ning, Kanglin, et al.
Published: (2026)
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
by: Wu, Peixi, et al.
Published: (2026)
by: Wu, Peixi, et al.
Published: (2026)
Cross-Phase Mutual Learning Framework for Pulmonary Embolism Identification on Non-Contrast CT Scans
by: Bai, Bizhe, et al.
Published: (2024)
by: Bai, Bizhe, et al.
Published: (2024)
Safety Reasoning with Guidelines
by: Wang, Haoyu, et al.
Published: (2025)
by: Wang, Haoyu, et al.
Published: (2025)
R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
by: Zhang, Zirui, et al.
Published: (2026)
by: Zhang, Zirui, et al.
Published: (2026)
Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association
by: Ling, Xingtao, et al.
Published: (2025)
by: Ling, Xingtao, et al.
Published: (2025)
Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning
by: Zhang, Yi, et al.
Published: (2025)
by: Zhang, Yi, et al.
Published: (2025)
Similar Items
-
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
by: He, Xiaoxuan, et al.
Published: (2025) -
SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
by: He, Xiaoxuan, et al.
Published: (2026) -
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
by: Yang, Jian, et al.
Published: (2024) -
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025) -
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
by: Yang, Jie, et al.
Published: (2025)