Saved in:
| Main Authors: | Hisada, Shohei, Sunao, Endo, Yamato, Himi, Wakamiya, Shoko, Aramaki, Eiji |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.17444 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
by: Ketir, Si-Belkacem Yamine, et al.
Published: (2026)
by: Ketir, Si-Belkacem Yamine, et al.
Published: (2026)
A Case Study of HealthBench in the Japanese Medical Context
by: HISADA, Shohei
Published: (2025)
by: HISADA, Shohei
Published: (2025)
HealthBench: Evaluating Large Language Models Towards Improved Human Health
by: Arora, Rahul K., et al.
Published: (2025)
by: Arora, Rahul K., et al.
Published: (2025)
HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats
by: Hicks, Rebecca Soskin, et al.
Published: (2026)
by: Hicks, Rebecca Soskin, et al.
Published: (2026)
EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels
by: She, Wan Jou, et al.
Published: (2026)
by: She, Wan Jou, et al.
Published: (2026)
NAIST Academic Travelogue Dataset
by: Ouchi, Hiroki, et al.
Published: (2023)
by: Ouchi, Hiroki, et al.
Published: (2023)
Investigating Neurons and Heads in Transformer-based LLMs for Typographical Errors
by: Tsuji, Kohei, et al.
Published: (2025)
by: Tsuji, Kohei, et al.
Published: (2025)
AMR-RE: Abstract Meaning Representations for Retrieval-Based In-Context Learning in Relation Extraction
by: Han, Peitao, et al.
Published: (2024)
by: Han, Peitao, et al.
Published: (2024)
Decomposing Physician Disagreement in HealthBench
by: Borgohain, Satya, et al.
Published: (2026)
by: Borgohain, Satya, et al.
Published: (2026)
Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark
by: Oka, Shoko
Published: (2025)
by: Oka, Shoko
Published: (2025)
JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models
by: Jiang, Junfeng, et al.
Published: (2024)
by: Jiang, Junfeng, et al.
Published: (2024)
CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
by: Higashiyama, Shohei, et al.
Published: (2026)
by: Higashiyama, Shohei, et al.
Published: (2026)
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
by: Inoue, Yuichi, et al.
Published: (2024)
by: Inoue, Yuichi, et al.
Published: (2024)
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
by: Raithel, Lisa, et al.
Published: (2024)
by: Raithel, Lisa, et al.
Published: (2024)
CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports
by: Zhang, Xiao Yu Cindy, et al.
Published: (2025)
by: Zhang, Xiao Yu Cindy, et al.
Published: (2025)
Speakers Fill Lexical Semantic Gaps with Context
by: Pimentel, Tiago, et al.
Published: (2020)
by: Pimentel, Tiago, et al.
Published: (2020)
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
by: Higashiyama, Shohei, et al.
Published: (2026)
by: Higashiyama, Shohei, et al.
Published: (2026)
BenchBench: Benchmarking Automated Benchmark Generation
by: Zheng, Yandan, et al.
Published: (2026)
by: Zheng, Yandan, et al.
Published: (2026)
Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek
by: Mamasaidov, Mukhammadsaid, et al.
Published: (2025)
by: Mamasaidov, Mukhammadsaid, et al.
Published: (2025)
Gap-Filling Prompting Enhances Code-Assisted Mathematical Reasoning
by: Mohammadkhani, Mohammad Ghiasvand
Published: (2024)
by: Mohammadkhani, Mohammad Ghiasvand
Published: (2024)
GLEN-Bench: A Graph-Language based Benchmark for Nutritional Health
by: Huang, Jiatan, et al.
Published: (2026)
by: Huang, Jiatan, et al.
Published: (2026)
Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench
by: Mutisya, Fred, et al.
Published: (2025)
by: Mutisya, Fred, et al.
Published: (2025)
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)
by: Perlitz, Yotam, et al.
Published: (2024)
Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models
by: Feng, Shangbin, et al.
Published: (2023)
by: Feng, Shangbin, et al.
Published: (2023)
Fill In The Gaps: Model Calibration and Generalization with Synthetic Data
by: Ba, Yang, et al.
Published: (2024)
by: Ba, Yang, et al.
Published: (2024)
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
by: Arai, Yamato, et al.
Published: (2026)
by: Arai, Yamato, et al.
Published: (2026)
Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?
by: Jayaweera, Chathuri, et al.
Published: (2025)
by: Jayaweera, Chathuri, et al.
Published: (2025)
JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
by: Liu, Junyu, et al.
Published: (2026)
by: Liu, Junyu, et al.
Published: (2026)
FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment
by: Xiong, Betty, et al.
Published: (2026)
by: Xiong, Betty, et al.
Published: (2026)
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
by: Liu, Weixin, et al.
Published: (2026)
by: Liu, Weixin, et al.
Published: (2026)
CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
by: Grundmann, Paul, et al.
Published: (2025)
by: Grundmann, Paul, et al.
Published: (2025)
70B-parameter large language models in Japanese medical question-answering
by: Sukeda, Issey, et al.
Published: (2024)
by: Sukeda, Issey, et al.
Published: (2024)
Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap
by: Yang, Chun-Hao, et al.
Published: (2025)
by: Yang, Chun-Hao, et al.
Published: (2025)
Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?
by: Desai, Atrey, et al.
Published: (2026)
by: Desai, Atrey, et al.
Published: (2026)
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions
by: Wang, Ziyue, et al.
Published: (2023)
by: Wang, Ziyue, et al.
Published: (2023)
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents
by: Li, Bobo, et al.
Published: (2025)
by: Li, Bobo, et al.
Published: (2025)
LongHealth: A Question Answering Benchmark with Long Clinical Documents
by: Adams, Lisa, et al.
Published: (2024)
by: Adams, Lisa, et al.
Published: (2024)
ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
by: Li, Yuchong, et al.
Published: (2025)
by: Li, Yuchong, et al.
Published: (2025)
LegalRikai: Open Benchmark -- Benchmark for Complex Japanese Corporate Legal Tasks
by: Fujita, Shogo, et al.
Published: (2025)
by: Fujita, Shogo, et al.
Published: (2025)
Ebisu: Benchmarking Large Language Models in Japanese Finance
by: Peng, Xueqing, et al.
Published: (2026)
by: Peng, Xueqing, et al.
Published: (2026)
Similar Items
-
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
by: Ketir, Si-Belkacem Yamine, et al.
Published: (2026) -
A Case Study of HealthBench in the Japanese Medical Context
by: HISADA, Shohei
Published: (2025) -
HealthBench: Evaluating Large Language Models Towards Improved Human Health
by: Arora, Rahul K., et al.
Published: (2025) -
HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats
by: Hicks, Rebecca Soskin, et al.
Published: (2026) -
EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels
by: She, Wan Jou, et al.
Published: (2026)