Saved in:
| Main Authors: | Feng, Kaiyue, Zhao, Yilun, Liu, Yixin, Yang, Tianyu, Zhao, Chen, Sous, John, Cohan, Arman |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.21821 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
by: Li, Chuhan, et al.
Published: (2024)
by: Li, Chuhan, et al.
Published: (2024)
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
by: Liu, Hongjun, et al.
Published: (2025)
by: Liu, Hongjun, et al.
Published: (2025)
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
by: Long, Yitao, et al.
Published: (2025)
by: Long, Yitao, et al.
Published: (2025)
Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)
by: Deng, Chunyuan, et al.
Published: (2023)
ANCHOR: Branch-Point Data Generation for GUI Agents
by: Wei, Jinbiao, et al.
Published: (2026)
by: Wei, Jinbiao, et al.
Published: (2026)
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)
by: Liu, Yixin, et al.
Published: (2025)
A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning
by: Yang, Tianyu, et al.
Published: (2026)
by: Yang, Tianyu, et al.
Published: (2026)
Step-level Optimization for Efficient Computer-use Agents
by: Wei, Jinbiao, et al.
Published: (2026)
by: Wei, Jinbiao, et al.
Published: (2026)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by: Shangguan, Ziyao, et al.
Published: (2024)
by: Shangguan, Ziyao, et al.
Published: (2024)
Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
by: Zheng, Shunfeng, et al.
Published: (2025)
by: Zheng, Shunfeng, et al.
Published: (2025)
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by: Zhao, Haochen, et al.
Published: (2024)
by: Zhao, Haochen, et al.
Published: (2024)
SciMDR: Advancing Scientific Multimodal Document Reasoning
by: Chen, Ziyu, et al.
Published: (2026)
by: Chen, Ziyu, et al.
Published: (2026)
ReIFE: Re-evaluating Instruction-Following Evaluation
by: Liu, Yixin, et al.
Published: (2024)
by: Liu, Yixin, et al.
Published: (2024)
Calibrating Long-form Generations from Large Language Models
by: Huang, Yukun, et al.
Published: (2024)
by: Huang, Yukun, et al.
Published: (2024)
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
by: Wei, Jinbiao, et al.
Published: (2026)
by: Wei, Jinbiao, et al.
Published: (2026)
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
by: Wu, Sihong, et al.
Published: (2026)
by: Wu, Sihong, et al.
Published: (2026)
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
by: Liu, Yixin, et al.
Published: (2024)
by: Liu, Yixin, et al.
Published: (2024)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2025)
by: Tang, Xiangru, et al.
Published: (2025)
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
by: Garikaparthi, Aniketh, et al.
Published: (2026)
by: Garikaparthi, Aniketh, et al.
Published: (2026)
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
by: Dunefsky, Jacob, et al.
Published: (2025)
by: Dunefsky, Jacob, et al.
Published: (2025)
Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future
by: Wu, Sihong, et al.
Published: (2026)
by: Wu, Sihong, et al.
Published: (2026)
MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise
by: Deng, Chunyuan, et al.
Published: (2024)
by: Deng, Chunyuan, et al.
Published: (2024)
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
by: Tang, Xiangru, et al.
Published: (2025)
by: Tang, Xiangru, et al.
Published: (2025)
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
by: Gao, Mingqi, et al.
Published: (2024)
by: Gao, Mingqi, et al.
Published: (2024)
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
by: Ma, Chiyu, et al.
Published: (2025)
by: Ma, Chiyu, et al.
Published: (2025)
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
by: Zhou, Xiyuan, et al.
Published: (2025)
by: Zhou, Xiyuan, et al.
Published: (2025)
AlphaResearch: Accelerating New Algorithm Discovery with Language Models
by: Yu, Zhaojian, et al.
Published: (2025)
by: Yu, Zhaojian, et al.
Published: (2025)
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)
Step-Back Profiling: Distilling User History for Personalized Scientific Writing
by: Tang, Xiangru, et al.
Published: (2024)
by: Tang, Xiangru, et al.
Published: (2024)
References Improve LLM Alignment in Non-Verifiable Domains
by: Shi, Kejian, et al.
Published: (2026)
by: Shi, Kejian, et al.
Published: (2026)
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
Survey on Evaluation of LLM-based Agents
by: Yehudai, Asaf, et al.
Published: (2025)
by: Yehudai, Asaf, et al.
Published: (2025)
MIR: Methodology Inspiration Retrieval for Scientific Research Problems
by: Garikaparthi, Aniketh, et al.
Published: (2025)
by: Garikaparthi, Aniketh, et al.
Published: (2025)
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
by: Hu, Tiansheng, et al.
Published: (2026)
by: Hu, Tiansheng, et al.
Published: (2026)
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
by: Li, Alan, et al.
Published: (2025)
by: Li, Alan, et al.
Published: (2025)
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
by: Gao, Yicheng, et al.
Published: (2024)
by: Gao, Yicheng, et al.
Published: (2024)
Similar Items
-
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
by: Li, Chuhan, et al.
Published: (2024) -
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
by: Liu, Hongjun, et al.
Published: (2025) -
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
by: Long, Yitao, et al.
Published: (2025) -
Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023) -
ANCHOR: Branch-Point Data Generation for GUI Agents
by: Wei, Jinbiao, et al.
Published: (2026)