Saved in:
| Main Authors: | Iwase, Naoto, Okuyama, Hiroki, Iwasawa, Junichiro |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.00421 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization
by: Kawakami, Wataru, et al.
Published: (2025)
by: Kawakami, Wataru, et al.
Published: (2025)
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes
by: Abacha, Asma Ben, et al.
Published: (2024)
by: Abacha, Asma Ben, et al.
Published: (2024)
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
by: Minegishi, Gouki, et al.
Published: (2025)
by: Minegishi, Gouki, et al.
Published: (2025)
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
by: Zuo, Yuxin, et al.
Published: (2025)
by: Zuo, Yuxin, et al.
Published: (2025)
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
by: Pandit, Shrey, et al.
Published: (2025)
by: Pandit, Shrey, et al.
Published: (2025)
MedVAL: Toward Expert-Level Medical Text Validation with Language Models
by: Aali, Asad, et al.
Published: (2025)
by: Aali, Asad, et al.
Published: (2025)
PromptMind Team at MEDIQA-CORR 2024: Improving Clinical Text Correction with Error Categorization and LLM Ensembles
by: Gundabathula, Satya Kesav, et al.
Published: (2024)
by: Gundabathula, Satya Kesav, et al.
Published: (2024)
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
by: Fujisawa, Ippei, et al.
Published: (2024)
by: Fujisawa, Ippei, et al.
Published: (2024)
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
by: Oh, Jungwoo, et al.
Published: (2026)
by: Oh, Jungwoo, et al.
Published: (2026)
The Mouth is Not the Brain: Bridging Energy-Based World Models and Language Generation
by: Niimi, Junichiro
Published: (2026)
by: Niimi, Junichiro
Published: (2026)
MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA
by: Wen, Shengtao, et al.
Published: (2025)
by: Wen, Shengtao, et al.
Published: (2025)
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
by: Lin, Zicheng, et al.
Published: (2024)
by: Lin, Zicheng, et al.
Published: (2024)
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction
by: Li, Xiaoyuan, et al.
Published: (2024)
by: Li, Xiaoyuan, et al.
Published: (2024)
IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning
by: Joshi, Abhinav, et al.
Published: (2024)
by: Joshi, Abhinav, et al.
Published: (2024)
uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization
by: Nagar, Aishik, et al.
Published: (2024)
by: Nagar, Aishik, et al.
Published: (2024)
Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks
by: Gambardella, Andrew, et al.
Published: (2024)
by: Gambardella, Andrew, et al.
Published: (2024)
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
by: Zhu, Yinghao, et al.
Published: (2025)
by: Zhu, Yinghao, et al.
Published: (2025)
MedLM: Exploring Language Models for Medical Question Answering Systems
by: Yagnik, Niraj, et al.
Published: (2024)
by: Yagnik, Niraj, et al.
Published: (2024)
Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks
by: AlDahoul, Nouar, et al.
Published: (2025)
by: AlDahoul, Nouar, et al.
Published: (2025)
MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare
by: Nigam, Shubham Kumar, et al.
Published: (2026)
by: Nigam, Shubham Kumar, et al.
Published: (2026)
MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models
by: Kim, Junmo, et al.
Published: (2025)
by: Kim, Junmo, et al.
Published: (2025)
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
by: Gambardella, Andrew, et al.
Published: (2025)
by: Gambardella, Andrew, et al.
Published: (2025)
Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning
by: Yu, Erxin, et al.
Published: (2025)
by: Yu, Erxin, et al.
Published: (2025)
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain
by: García-Ferrero, Iker, et al.
Published: (2024)
by: García-Ferrero, Iker, et al.
Published: (2024)
MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models
by: Kim, Soo Yong, et al.
Published: (2025)
by: Kim, Soo Yong, et al.
Published: (2025)
The CLRS-Text Algorithmic Reasoning Language Benchmark
by: Markeeva, Larisa, et al.
Published: (2024)
by: Markeeva, Larisa, et al.
Published: (2024)
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
by: Yoa, Seungdong, et al.
Published: (2026)
by: Yoa, Seungdong, et al.
Published: (2026)
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science
by: Xu, Ran, et al.
Published: (2025)
by: Xu, Ran, et al.
Published: (2025)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
by: Potamitis, Nearchos, et al.
Published: (2025)
by: Potamitis, Nearchos, et al.
Published: (2025)
Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction
by: Kovalchuk, Roman, et al.
Published: (2025)
by: Kovalchuk, Roman, et al.
Published: (2025)
Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
by: Cao, Qi, et al.
Published: (2026)
by: Cao, Qi, et al.
Published: (2026)
Temporal Consistency for LLM Reasoning Process Error Identification
by: Guo, Jiacheng, et al.
Published: (2025)
by: Guo, Jiacheng, et al.
Published: (2025)
ProcessBench: Identifying Process Errors in Mathematical Reasoning
by: Zheng, Chujie, et al.
Published: (2024)
by: Zheng, Chujie, et al.
Published: (2024)
Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
by: Gu, Xiaojie, et al.
Published: (2026)
by: Gu, Xiaojie, et al.
Published: (2026)
GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification
by: Khamis, Ahmed Khaled
Published: (2026)
by: Khamis, Ahmed Khaled
Published: (2026)
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
by: Zhao, Zehua, et al.
Published: (2025)
by: Zhao, Zehua, et al.
Published: (2025)
A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration
by: Cui, Yingqian, et al.
Published: (2024)
by: Cui, Yingqian, et al.
Published: (2024)
Contextual Drag: How Errors in the Context Affect LLM Reasoning
by: Cheng, Yun, et al.
Published: (2026)
by: Cheng, Yun, et al.
Published: (2026)
Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding
by: Xiao, Feng, et al.
Published: (2025)
by: Xiao, Feng, et al.
Published: (2025)
Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI
by: Kozlova, Anna, et al.
Published: (2026)
by: Kozlova, Anna, et al.
Published: (2026)
Similar Items
-
Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization
by: Kawakami, Wataru, et al.
Published: (2025) -
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes
by: Abacha, Asma Ben, et al.
Published: (2024) -
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
by: Minegishi, Gouki, et al.
Published: (2025) -
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
by: Zuo, Yuxin, et al.
Published: (2025) -
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
by: Pandit, Shrey, et al.
Published: (2025)