Guardado en:
Detalles Bibliográficos
Autores principales: Zhou, Junting, Miao, Tingjia, Liao, Yiyan, Wang, Qichao, Wen, Zhoufutu, Wang, Yanqin, Huang, Yunjie, Yan, Ge, Wang, Leqi, Xia, Yucheng, Gao, Hongwan, Zeng, Yuansong, Zheng, Renjie, Dun, Chen, Liang, Yitao, Yang, Tong, Huang, Wenhao, Zhang, Ge
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2506.12909
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866913895395885056
author Zhou, Junting
Miao, Tingjia
Liao, Yiyan
Wang, Qichao
Wen, Zhoufutu
Wang, Yanqin
Huang, Yunjie
Yan, Ge
Wang, Leqi
Xia, Yucheng
Gao, Hongwan
Zeng, Yuansong
Zheng, Renjie
Dun, Chen
Liang, Yitao
Yang, Tong
Huang, Wenhao
Zhang, Ge
author_facet Zhou, Junting
Miao, Tingjia
Liao, Yiyan
Wang, Qichao
Wen, Zhoufutu
Wang, Yanqin
Huang, Yunjie
Yan, Ge
Wang, Leqi
Xia, Yucheng
Gao, Hongwan
Zeng, Yuansong
Zheng, Renjie
Dun, Chen
Liang, Yitao
Yang, Tong
Huang, Wenhao
Zhang, Ge
contents Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at https://huggingface.co/datasets/m-a-p/SciDA
format Preprint
id arxiv_https___arxiv_org_abs_2506_12909
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SciDA: Scientific Dynamic Assessor of LLMs
Zhou, Junting
Miao, Tingjia
Liao, Yiyan
Wang, Qichao
Wen, Zhoufutu
Wang, Yanqin
Huang, Yunjie
Yan, Ge
Wang, Leqi
Xia, Yucheng
Gao, Hongwan
Zeng, Yuansong
Zheng, Renjie
Dun, Chen
Liang, Yitao
Yang, Tong
Huang, Wenhao
Zhang, Ge
Computation and Language
Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at https://huggingface.co/datasets/m-a-p/SciDA
title SciDA: Scientific Dynamic Assessor of LLMs
topic Computation and Language
url https://arxiv.org/abs/2506.12909