Saved in:
| Main Authors: | Zhang, Jie, Petrui, Cezara, Nikolić, Kristina, Tramèr, Florian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.12575 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
by: Nikolić, Kristina, et al.
Published: (2025)
by: Nikolić, Kristina, et al.
Published: (2025)
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
by: Lyu, Zicheng, et al.
Published: (2026)
by: Lyu, Zicheng, et al.
Published: (2026)
DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models
by: Chen, Xiaoyang, et al.
Published: (2025)
by: Chen, Xiaoyang, et al.
Published: (2025)
Pitfalls in Evaluating Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2025)
by: Paleka, Daniel, et al.
Published: (2025)
Learning to Inject: Automated Prompt Injection via Reinforcement Learning
by: Chen, Xin, et al.
Published: (2026)
by: Chen, Xin, et al.
Published: (2026)
CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents
by: Foerster, Hanna, et al.
Published: (2026)
by: Foerster, Hanna, et al.
Published: (2026)
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models
by: Liu, Yan, et al.
Published: (2024)
by: Liu, Yan, et al.
Published: (2024)
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
by: Glazer, Elliot, et al.
Published: (2024)
by: Glazer, Elliot, et al.
Published: (2024)
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
by: Wu, Yanan, et al.
Published: (2024)
by: Wu, Yanan, et al.
Published: (2024)
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
by: Zou, Chengke, et al.
Published: (2024)
by: Zou, Chengke, et al.
Published: (2024)
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
by: Fang, Meng, et al.
Published: (2024)
by: Fang, Meng, et al.
Published: (2024)
TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving
by: Colle, Vincenzo, et al.
Published: (2025)
by: Colle, Vincenzo, et al.
Published: (2025)
RoMath: A Mathematical Reasoning Benchmark in Romanian
by: Cosma, Adrian, et al.
Published: (2024)
by: Cosma, Adrian, et al.
Published: (2024)
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions
by: Liang, Zhenwen, et al.
Published: (2024)
by: Liang, Zhenwen, et al.
Published: (2024)
Universal Jailbreak Backdoors from Poisoned Human Feedback
by: Rando, Javier, et al.
Published: (2023)
by: Rando, Javier, et al.
Published: (2023)
VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
by: Li, Can, et al.
Published: (2025)
by: Li, Can, et al.
Published: (2025)
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
by: Yang, Zhen, et al.
Published: (2024)
by: Yang, Zhen, et al.
Published: (2024)
MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
by: O'Brien, Dayyán, et al.
Published: (2025)
by: O'Brien, Dayyán, et al.
Published: (2025)
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
by: Yang, Shu-Xun, et al.
Published: (2025)
by: Yang, Shu-Xun, et al.
Published: (2025)
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
by: Yu, Longhui, et al.
Published: (2023)
by: Yu, Longhui, et al.
Published: (2023)
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
by: Peng, Shuai, et al.
Published: (2024)
by: Peng, Shuai, et al.
Published: (2024)
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
by: Carlini, Nicholas, et al.
Published: (2025)
by: Carlini, Nicholas, et al.
Published: (2025)
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
by: Lu, Zimu, et al.
Published: (2024)
by: Lu, Zimu, et al.
Published: (2024)
Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?
by: Aerni, Michael, et al.
Published: (2025)
by: Aerni, Michael, et al.
Published: (2025)
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
by: Shao, Zhihong, et al.
Published: (2024)
by: Shao, Zhihong, et al.
Published: (2024)
EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing
by: Wang, Shengbo, et al.
Published: (2025)
by: Wang, Shengbo, et al.
Published: (2025)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
Gradient-based Jailbreak Images for Multimodal Fusion Models
by: Rando, Javier, et al.
Published: (2024)
by: Rando, Javier, et al.
Published: (2024)
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
by: Chernyshev, Konstantin, et al.
Published: (2024)
by: Chernyshev, Konstantin, et al.
Published: (2024)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
by: Zolkowski, Artur, et al.
Published: (2025)
by: Zolkowski, Artur, et al.
Published: (2025)
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
by: Li, Zhong-Zhi, et al.
Published: (2024)
by: Li, Zhong-Zhi, et al.
Published: (2024)
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning
by: Ma, Jingkun, et al.
Published: (2024)
by: Ma, Jingkun, et al.
Published: (2024)
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
by: Alshammari, Shaden, et al.
Published: (2026)
by: Alshammari, Shaden, et al.
Published: (2026)
Consistency Checks for Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2024)
by: Paleka, Daniel, et al.
Published: (2024)
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning
by: Qiao, Runqi, et al.
Published: (2025)
by: Qiao, Runqi, et al.
Published: (2025)
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
by: Lu, Pan, et al.
Published: (2023)
by: Lu, Pan, et al.
Published: (2023)
PARAMANU-GANITA: Can Small Math Language Models Rival with Large Language Models on Mathematical Reasoning?
by: Niyogi, Mitodru, et al.
Published: (2024)
by: Niyogi, Mitodru, et al.
Published: (2024)
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
by: Luo, Haipeng, et al.
Published: (2023)
by: Luo, Haipeng, et al.
Published: (2023)
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
by: Amjad, Husnain, et al.
Published: (2026)
by: Amjad, Husnain, et al.
Published: (2026)
MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers
by: Zhou, Chenyue, et al.
Published: (2026)
by: Zhou, Chenyue, et al.
Published: (2026)
Similar Items
-
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
by: Nikolić, Kristina, et al.
Published: (2025) -
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
by: Lyu, Zicheng, et al.
Published: (2026) -
DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models
by: Chen, Xiaoyang, et al.
Published: (2025) -
Pitfalls in Evaluating Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2025) -
Learning to Inject: Automated Prompt Injection via Reinforcement Learning
by: Chen, Xin, et al.
Published: (2026)