Saved in:
| Main Authors: | Wu, Lingyan, Zheng, Xiang, Zhai, Weiqi, Wang, Wei, Ren, Xuan, Zhang, Zifan, Wei, Hu, Zhao, Bing |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.17282 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
by: Song, Mingyang, et al.
Published: (2025)
by: Song, Mingyang, et al.
Published: (2025)
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
by: Li, Xiang, et al.
Published: (2025)
by: Li, Xiang, et al.
Published: (2025)
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
by: Zheng, Xiang, et al.
Published: (2026)
by: Zheng, Xiang, et al.
Published: (2026)
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards
by: Yun, Jaehoon, et al.
Published: (2025)
by: Yun, Jaehoon, et al.
Published: (2025)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)
by: Ding, Meidan, et al.
Published: (2025)
MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain
by: Jiang, Chao, et al.
Published: (2024)
by: Jiang, Chao, et al.
Published: (2024)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2025)
by: Tang, Xiangru, et al.
Published: (2025)
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
by: Sun, Zhouhao, et al.
Published: (2026)
by: Sun, Zhouhao, et al.
Published: (2026)
ClinConsensus: A Physician-Calibrated Benchmark for Evaluating Clinical Rubric Coverage in Chinese Medical LLMs
by: Zheng, Xiang, et al.
Published: (2026)
by: Zheng, Xiang, et al.
Published: (2026)
SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation
by: Wei, Hu, et al.
Published: (2025)
by: Wei, Hu, et al.
Published: (2025)
Training Language Models to Generate Text with Citations via Fine-grained Rewards
by: Huang, Chengyu, et al.
Published: (2024)
by: Huang, Chengyu, et al.
Published: (2024)
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards
by: Hwang, Hyeonbin, et al.
Published: (2024)
by: Hwang, Hyeonbin, et al.
Published: (2024)
From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models
by: Liu, Jinyi, et al.
Published: (2025)
by: Liu, Jinyi, et al.
Published: (2025)
SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis
by: Qi, Hongzhi, et al.
Published: (2024)
by: Qi, Hongzhi, et al.
Published: (2024)
MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs
by: Wei, Jianhui, et al.
Published: (2025)
by: Wei, Jianhui, et al.
Published: (2025)
Reward Reasoning Model
by: Guo, Jiaxin, et al.
Published: (2025)
by: Guo, Jiaxin, et al.
Published: (2025)
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
by: Wu, Juncheng, et al.
Published: (2025)
by: Wu, Juncheng, et al.
Published: (2025)
Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM
by: Zhang, Xuan, et al.
Published: (2024)
by: Zhang, Xuan, et al.
Published: (2024)
ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
by: Li, Dawei, et al.
Published: (2026)
by: Li, Dawei, et al.
Published: (2026)
From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks
by: Ren, Qingyu, et al.
Published: (2026)
by: Ren, Qingyu, et al.
Published: (2026)
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset
by: Wang, Weiqi, et al.
Published: (2024)
by: Wang, Weiqi, et al.
Published: (2024)
HiMed: Incentivizing Hindi Reasoning in Medical LLMs
by: Jiang, Dingfeng, et al.
Published: (2026)
by: Jiang, Dingfeng, et al.
Published: (2026)
MedConsultBench: A Full-Cycle, Fine-Grained, Process-Aware Benchmark for Medical Consultation Agents
by: Qiao, Chuhan, et al.
Published: (2026)
by: Qiao, Chuhan, et al.
Published: (2026)
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning
by: Jin, Congyun, et al.
Published: (2024)
by: Jin, Congyun, et al.
Published: (2024)
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
by: Bao, Zhijie, et al.
Published: (2026)
by: Bao, Zhijie, et al.
Published: (2026)
AR-Med: Automated Relevance Enhancement in Medical Search via LLM-Driven Information Augmentation
by: Wang, Chuyue, et al.
Published: (2025)
by: Wang, Chuyue, et al.
Published: (2025)
CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
by: Xu, Xinzhe, et al.
Published: (2025)
by: Xu, Xinzhe, et al.
Published: (2025)
MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
by: Han, Taolin, et al.
Published: (2026)
by: Han, Taolin, et al.
Published: (2026)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
by: Yu, Suhao, et al.
Published: (2025)
by: Yu, Suhao, et al.
Published: (2025)
ControlMed: Adding Reasoning Control to Medical Language Model
by: Lee, Sung-Min, et al.
Published: (2025)
by: Lee, Sung-Min, et al.
Published: (2025)
The Lessons of Developing Process Reward Models in Mathematical Reasoning
by: Zhang, Zhenru, et al.
Published: (2025)
by: Zhang, Zhenru, et al.
Published: (2025)
MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
by: Ding, Jinru, et al.
Published: (2025)
by: Ding, Jinru, et al.
Published: (2025)
GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
by: Zhang, Jianghangfan, et al.
Published: (2025)
by: Zhang, Jianghangfan, et al.
Published: (2025)
MedExQA: Medical Question Answering Benchmark with Multiple Explanations
by: Kim, Yunsoo, et al.
Published: (2024)
by: Kim, Yunsoo, et al.
Published: (2024)
MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts
by: Iwase, Naoto, et al.
Published: (2025)
by: Iwase, Naoto, et al.
Published: (2025)
ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models
by: Chen, Bin, et al.
Published: (2025)
by: Chen, Bin, et al.
Published: (2025)
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
by: Zuo, Yuxin, et al.
Published: (2025)
by: Zuo, Yuxin, et al.
Published: (2025)
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation
by: Chen, Sirry, et al.
Published: (2026)
by: Chen, Sirry, et al.
Published: (2026)
The Bidirectional Process Reward Model
by: Zhang, Lingyin, et al.
Published: (2025)
by: Zhang, Lingyin, et al.
Published: (2025)
Similar Items
-
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
by: Song, Mingyang, et al.
Published: (2025) -
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
by: Li, Xiang, et al.
Published: (2025) -
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
by: Zheng, Xiang, et al.
Published: (2026) -
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards
by: Yun, Jaehoon, et al.
Published: (2025) -
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)