Saved in:
| Main Authors: | Nathani, Deepak, Madaan, Lovish, Roberts, Nicholas, Bashlykov, Nikolay, Menon, Ajay, Moens, Vincent, Budhiraja, Amar, Magka, Despoina, Vorotilov, Vladislav, Chaurasia, Gaurav, Hupkes, Dieuwke, Cabral, Ricardo Silveira, Shavrina, Tatiana, Foerster, Jakob, Bachrach, Yoram, Wang, William Yang, Raileanu, Roberta |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.14499 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
by: Maiti, Shalini, et al.
Published: (2025)
by: Maiti, Shalini, et al.
Published: (2025)
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
by: Madaan, Lovish, et al.
Published: (2024)
by: Madaan, Lovish, et al.
Published: (2024)
APRES: An Agentic Paper Revision and Evaluation System
by: Zhao, Bingchen, et al.
Published: (2026)
by: Zhao, Bingchen, et al.
Published: (2026)
Quantifying Variance in Evaluation Benchmarks
by: Madaan, Lovish, et al.
Published: (2024)
by: Madaan, Lovish, et al.
Published: (2024)
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
by: Hupkes, Dieuwke, et al.
Published: (2025)
by: Hupkes, Dieuwke, et al.
Published: (2025)
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
by: Pepe, Alberto, et al.
Published: (2026)
by: Pepe, Alberto, et al.
Published: (2026)
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency
by: Ohmer, Xenia, et al.
Published: (2024)
by: Ohmer, Xenia, et al.
Published: (2024)
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
by: Audran-Reiss, Alexis, et al.
Published: (2025)
by: Audran-Reiss, Alexis, et al.
Published: (2025)
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
by: Zhao, Bingchen, et al.
Published: (2025)
by: Zhao, Bingchen, et al.
Published: (2025)
Interpretability of Language Models via Task Spaces
by: Weber, Lucas, et al.
Published: (2024)
by: Weber, Lucas, et al.
Published: (2024)
AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
by: Lupidi, Alisia, et al.
Published: (2026)
by: Lupidi, Alisia, et al.
Published: (2026)
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
by: Mondorf, Philipp, et al.
Published: (2026)
by: Mondorf, Philipp, et al.
Published: (2026)
AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
by: Toledo, Edan, et al.
Published: (2025)
by: Toledo, Edan, et al.
Published: (2025)
Bootstrapping Task Spaces for Self-Improvement
by: Jiang, Minqi, et al.
Published: (2025)
by: Jiang, Minqi, et al.
Published: (2025)
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
by: Schaeffer, Rylan, et al.
Published: (2025)
by: Schaeffer, Rylan, et al.
Published: (2025)
Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data
by: Tang, Yunhao, et al.
Published: (2025)
by: Tang, Yunhao, et al.
Published: (2025)
Compute Optimal Scaling of Skills: Knowledge vs Reasoning
by: Roberts, Nicholas, et al.
Published: (2025)
by: Roberts, Nicholas, et al.
Published: (2025)
Asking the Right Questions: Improving Reasoning with Generated Stepping Stones
by: Hu, Hengyuan, et al.
Published: (2026)
by: Hu, Hengyuan, et al.
Published: (2026)
Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
by: Thöni, Anna C. M., et al.
Published: (2025)
by: Thöni, Anna C. M., et al.
Published: (2025)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
Scaling Small Agents Through Strategy Auctions
by: Alazraki, Lisa, et al.
Published: (2026)
by: Alazraki, Lisa, et al.
Published: (2026)
AIRA_2: Overcoming Bottlenecks in AI Research Agents
by: Hambardzumyan, Karen, et al.
Published: (2026)
by: Hambardzumyan, Karen, et al.
Published: (2026)
HARP: A challenging human-annotated math reasoning benchmark
by: Yue, Albert S., et al.
Published: (2024)
by: Yue, Albert S., et al.
Published: (2024)
Adversarial Training for Process Reward Models
by: Juneja, Gurusha, et al.
Published: (2025)
by: Juneja, Gurusha, et al.
Published: (2025)
Crowd IQ -- Aggregating Opinions to Boost Performance
by: Kosinski, Michal, et al.
Published: (2024)
by: Kosinski, Michal, et al.
Published: (2024)
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
by: Singh, Aaditya K., et al.
Published: (2024)
by: Singh, Aaditya K., et al.
Published: (2024)
A detailed study of the variations found in the chrysalises of Aglais caschmirensis Kollar, 1844 (Lepidoptera: Papilionoidea, Nymphalidae)
by: Lovish Garlani
Published: (2023)
by: Lovish Garlani
Published: (2023)
Annotated Checklist of Rhopalocera of Himachal Pradesh, India (Insecta: Lepidoptera)
by: Lovish Garlan
Published: (2024)
by: Lovish Garlan
Published: (2024)
First record of Celaenorrhinus ratna daphne Evans, 1949 from Himachal Pradesh and its first photographic record from the Western Himalayas (Lepidoptera: Hesperiidae, Pyrginae)
by: Lovish Garlani
Published: (2022)
by: Lovish Garlani
Published: (2022)
Unveiling the Hidden Gem: An Observational Report, Taxonomic Insights and First Photographic Evidence of Pseudochazara baldiva Moore, 1865, from India (Lepidoptera: Nymphalidae)
by: Lovish Garlani
Published: (2024)
by: Lovish Garlani
Published: (2024)
Epistemic Dissonance and Modal Boundaries
by: Raileanu, Dragos
Published: (2025)
by: Raileanu, Dragos
Published: (2025)
From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars
by: Kornilov, Albert, et al.
Published: (2024)
by: Kornilov, Albert, et al.
Published: (2024)
A Comparative Study of Transfer Learning for Emotion Recognition using CNN and Modified VGG16 Models
by: Nathani, Samay
Published: (2024)
by: Nathani, Samay
Published: (2024)
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples
by: Jiralerspong, Marco, et al.
Published: (2023)
by: Jiralerspong, Marco, et al.
Published: (2023)
Modelling Chemical Reaction Networks using Neural Ordinary Differential Equations
by: Thöni, Anna C. M., et al.
Published: (2025)
by: Thöni, Anna C. M., et al.
Published: (2025)
Understanding the Effects of Domain Finetuning on LLMs
by: Tanwar, Eshaan, et al.
Published: (2025)
by: Tanwar, Eshaan, et al.
Published: (2025)
On Some Extensions of the Boué-Dupuis Variational Formula
by: Budhiraja, A.
Published: (2024)
by: Budhiraja, A.
Published: (2024)
Hyperagents
by: Zhang, Jenny, et al.
Published: (2026)
by: Zhang, Jenny, et al.
Published: (2026)
DUAS FACES DO PODER
by: Peter Bachrach
Published: (2011)
by: Peter Bachrach
Published: (2011)
Rethinking Thinking Tokens: LLMs as Improvement Operators
by: Madaan, Lovish, et al.
Published: (2025)
by: Madaan, Lovish, et al.
Published: (2025)
Similar Items
-
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
by: Maiti, Shalini, et al.
Published: (2025) -
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
by: Madaan, Lovish, et al.
Published: (2024) -
APRES: An Agentic Paper Revision and Evaluation System
by: Zhao, Bingchen, et al.
Published: (2026) -
Quantifying Variance in Evaluation Benchmarks
by: Madaan, Lovish, et al.
Published: (2024) -
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
by: Hupkes, Dieuwke, et al.
Published: (2025)