Saved in:
| Main Authors: | Yang, Xinglong, Peng, Zhilin, Liu, Zhanzhan, Shi, Haochen, Huang, Sheng-Jun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.04413 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
by: Yang, Xinglong, et al.
Published: (2025)
by: Yang, Xinglong, et al.
Published: (2025)
GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View
by: Cheng, Fenghua, et al.
Published: (2025)
by: Cheng, Fenghua, et al.
Published: (2025)
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
by: Zhou, Qianrui, et al.
Published: (2025)
by: Zhou, Qianrui, et al.
Published: (2025)
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)
by: He, Zheqi, et al.
Published: (2024)
Interpretable Multimodal Misinformation Detection with Logic Reasoning
by: Liu, Hui, et al.
Published: (2023)
by: Liu, Hui, et al.
Published: (2023)
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities
by: Chen, Lichang, et al.
Published: (2024)
by: Chen, Lichang, et al.
Published: (2024)
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
by: Lu, Jinghui, et al.
Published: (2025)
by: Lu, Jinghui, et al.
Published: (2025)
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)
by: Satar, Burak, et al.
Published: (2025)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion
by: Lv, Zheqi, et al.
Published: (2025)
by: Lv, Zheqi, et al.
Published: (2025)
Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis
by: Feng, Xinyu, et al.
Published: (2024)
by: Feng, Xinyu, et al.
Published: (2024)
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
by: Lu, Jinghui, et al.
Published: (2024)
by: Lu, Jinghui, et al.
Published: (2024)
SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
by: Ren, Yiming, et al.
Published: (2026)
by: Ren, Yiming, et al.
Published: (2026)
MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA
by: Wen, Shengtao, et al.
Published: (2025)
by: Wen, Shengtao, et al.
Published: (2025)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
by: Compagnoni, Alberto, et al.
Published: (2025)
by: Compagnoni, Alberto, et al.
Published: (2025)
Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction
by: Li, Yuanchao, et al.
Published: (2024)
by: Li, Yuanchao, et al.
Published: (2024)
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
by: Duan, Chengqi, et al.
Published: (2025)
by: Duan, Chengqi, et al.
Published: (2025)
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
by: Zhong, Yiwu, et al.
Published: (2024)
by: Zhong, Yiwu, et al.
Published: (2024)
Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing
by: Wu, Zichen, et al.
Published: (2025)
by: Wu, Zichen, et al.
Published: (2025)
EventCast: Hybrid Demand Forecasting in E-Commerce with LLM-Based Event Knowledge
by: Hu, Congcong, et al.
Published: (2026)
by: Hu, Congcong, et al.
Published: (2026)
Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
by: Caffagni, Davide, et al.
Published: (2025)
by: Caffagni, Davide, et al.
Published: (2025)
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
by: Cocchi, Federico, et al.
Published: (2024)
by: Cocchi, Federico, et al.
Published: (2024)
AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives
by: Chen, Yanxi, et al.
Published: (2025)
by: Chen, Yanxi, et al.
Published: (2025)
PTA: Enhancing Multimodal Sentiment Analysis through Pipelined Prediction and Translation-based Alignment
by: Song, Shezheng, et al.
Published: (2024)
by: Song, Shezheng, et al.
Published: (2024)
Towards Better Text-to-Image Generation Alignment via Attention Modulation
by: Wu, Yihang, et al.
Published: (2024)
by: Wu, Yihang, et al.
Published: (2024)
Reasoning LLMs are Wandering Solution Explorers
by: Lu, Jiahao, et al.
Published: (2025)
by: Lu, Jiahao, et al.
Published: (2025)
FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detection
by: Zhou, Ziyi, et al.
Published: (2024)
by: Zhou, Ziyi, et al.
Published: (2024)
Traj-MLLM: Can Multimodal Large Language Models Reform Trajectory Data Mining?
by: Liu, Shuo, et al.
Published: (2025)
by: Liu, Shuo, et al.
Published: (2025)
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
by: Lu, Xingyu, et al.
Published: (2026)
by: Lu, Xingyu, et al.
Published: (2026)
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
by: Yan, Qianqi, et al.
Published: (2026)
by: Yan, Qianqi, et al.
Published: (2026)
Retrieval-Augmented Generation for Electrocardiogram-Language Models
by: Song, Xiaoyu, et al.
Published: (2025)
by: Song, Xiaoyu, et al.
Published: (2025)
A Survey on Image-text Multimodal Models
by: Guo, Ruifeng, et al.
Published: (2023)
by: Guo, Ruifeng, et al.
Published: (2023)
HeGTa: Leveraging Heterogeneous Graph-enhanced Large Language Models for Few-shot Complex Table Understanding
by: Jin, Rihui, et al.
Published: (2024)
by: Jin, Rihui, et al.
Published: (2024)
Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning
by: Liang, Dayong, et al.
Published: (2025)
by: Liang, Dayong, et al.
Published: (2025)
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning
by: Zheng, Changmeng, et al.
Published: (2024)
by: Zheng, Changmeng, et al.
Published: (2024)
Contrastive Visual Data Augmentation
by: Zhou, Yu, et al.
Published: (2025)
by: Zhou, Yu, et al.
Published: (2025)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
by: Han, Jiaming, et al.
Published: (2025)
by: Han, Jiaming, et al.
Published: (2025)
MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization
by: Saha, Anisha, et al.
Published: (2026)
by: Saha, Anisha, et al.
Published: (2026)
Counterfactual Reasoning Using Predicted Latent Personality Dimensions for Optimizing Persuasion Outcome
by: Zeng, Donghuo, et al.
Published: (2024)
by: Zeng, Donghuo, et al.
Published: (2024)
Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis
by: Meng, Chunlei, et al.
Published: (2026)
by: Meng, Chunlei, et al.
Published: (2026)
SlideTailor: Personalized Presentation Slide Generation for Scientific Papers
by: Zeng, Wenzheng, et al.
Published: (2025)
by: Zeng, Wenzheng, et al.
Published: (2025)
Similar Items
-
Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
by: Yang, Xinglong, et al.
Published: (2025) -
GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View
by: Cheng, Fenghua, et al.
Published: (2025) -
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
by: Zhou, Qianrui, et al.
Published: (2025) -
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024) -
Interpretable Multimodal Misinformation Detection with Logic Reasoning
by: Liu, Hui, et al.
Published: (2023)