:: Library Catalog

$Cover Image$

Saved in:

Bibliographic Details
Main Authors:	Yousefzadeh, Roozbeh, Cao, Xuenan, Ospanov, Azim
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2411.18872
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs
by: Yousefzadeh, Roozbeh, et al.
Published: (2025)

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning
by: Ospanov, Azim, et al.
Published: (2025)

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward
by: Ospanov, Azim, et al.
Published: (2025)

Towards a Scalable Reference-Free Evaluation of Generative Models
by: Ospanov, Azim, et al.
Published: (2024)

Do Vendi Scores Converge with Finite Samples? Truncated Vendi Score for Finite-Sample Convergence Guarantees
by: Ospanov, Azim, et al.
Published: (2024)

Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation
by: Mahdavi, Sadegh, et al.
Published: (2025)

Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo
by: Feng, Shengyu, et al.
Published: (2024)

MathWriting: A Dataset For Handwritten Mathematical Expression Recognition
by: Gervais, Philippe, et al.
Published: (2024)

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
by: Ravi, Nikil, et al.
Published: (2026)

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
by: Opedal, Andreas, et al.
Published: (2024)

MathChat: Converse to Tackle Challenging Math Problems with LLM Agents
by: Wu, Yiran, et al.
Published: (2023)

Exposing Diversity Bias in Deep Generative Models: Statistical Origins and Correction of Diversity Error
by: Farnia, Farzan, et al.
Published: (2026)

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
by: Pandit, Shrey, et al.
Published: (2025)

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
by: Toshniwal, Shubham, et al.
Published: (2024)

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
by: Mahabadi, Rabeeh Karimi, et al.
Published: (2025)

Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
by: Albalak, Alon, et al.
Published: (2025)

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
by: Petrov, Ivo, et al.
Published: (2025)

Conditional Vendi Score: An Information-Theoretic Approach to Diversity Evaluation of Prompt-based Generative Models
by: Jalali, Mohammad, et al.
Published: (2024)

MegaMath: Pushing the Limits of Open Math Corpora
by: Zhou, Fan, et al.
Published: (2025)

MathPile: A Billion-Token-Scale Pretraining Corpus for Math
by: Wang, Zengzhi, et al.
Published: (2023)

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation
by: Ouyang, Jialin
Published: (2025)

ControlMath: Controllable Data Generation Promotes Math Generalist Models
by: Chen, Nuo, et al.
Published: (2024)

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning
by: Li, Chengpeng, et al.
Published: (2023)

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
by: Zhang, Renrui, et al.
Published: (2024)

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
by: Huang, Kaixuan, et al.
Published: (2025)

Guiding Through Complexity: What Makes Good Supervision for Hard Math Reasoning Tasks?
by: He, Xuan, et al.
Published: (2024)

Solving Formal Math Problems by Decomposition and Iterative Reflection
by: Zhou, Yichi, et al.
Published: (2025)

EasyMath: A 0-shot Math Benchmark for SLMs
by: Karki, Drishya, et al.
Published: (2025)

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
by: Liu, Zihan, et al.
Published: (2024)

ProofSketcher: Hybrid LLM + Lightweight Proof Checker for Reliable Math/Logic Reasoning
by: Kommuru, Kranthi, et al.
Published: (2026)

DOoM: Difficult Olympiads of Math
by: Kuleshov, Ilya, et al.
Published: (2025)

Augmenting Math Word Problems via Iterative Question Composing
by: Liu, Haoxiong, et al.
Published: (2024)

†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
by: Nazi, Zabir Al, et al.
Published: (2026)

Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
by: Li, Junsong, et al.
Published: (2025)

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
by: Wang, Peiyi, et al.
Published: (2023)

Enhancing Math Reasoning in Small-sized LLMs via Preview Difficulty-Aware Intervention
by: Di, Xinhan, et al.
Published: (2025)

SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance
by: Singh, Kunal, et al.
Published: (2025)

A Small Math Model: Recasting Strategy Choice Theory in an LLM-Inspired Architecture
by: Rahman, Roussel, et al.
Published: (2025)

Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
by: Qin, Tian, et al.
Published: (2025)

Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems
by: Khan, Zaid, et al.
Published: (2025)