Saved in:
| Main Authors: | Yang, Zheyuan, Chen, Lyuhao, Cohan, Arman, Zhao, Yilun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.23621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
by: Zhao, Yilun, et al.
Published: (2023)
by: Zhao, Yilun, et al.
Published: (2023)
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
by: Liu, Hongjun, et al.
Published: (2025)
by: Liu, Hongjun, et al.
Published: (2025)
LimRank: Less is More for Reasoning-Intensive Information Reranking
by: Song, Tingyu, et al.
Published: (2025)
by: Song, Tingyu, et al.
Published: (2025)
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
by: Hu, Yunhai, et al.
Published: (2025)
by: Hu, Yunhai, et al.
Published: (2025)
Z1: Efficient Test-time Scaling with Code
by: Yu, Zhaojian, et al.
Published: (2025)
by: Yu, Zhaojian, et al.
Published: (2025)
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
by: Shen, Yifei, et al.
Published: (2026)
by: Shen, Yifei, et al.
Published: (2026)
FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains
by: Zhao, Yilun, et al.
Published: (2023)
by: Zhao, Yilun, et al.
Published: (2023)
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
by: Zhao, Yilun, et al.
Published: (2026)
by: Zhao, Yilun, et al.
Published: (2026)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
by: Long, Yitao, et al.
Published: (2025)
by: Long, Yitao, et al.
Published: (2025)
SciMDR: Advancing Scientific Multimodal Document Reasoning
by: Chen, Ziyu, et al.
Published: (2026)
by: Chen, Ziyu, et al.
Published: (2026)
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
by: Hu, Tiansheng, et al.
Published: (2026)
by: Hu, Tiansheng, et al.
Published: (2026)
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
by: Yu, Zhaojian, et al.
Published: (2024)
by: Yu, Zhaojian, et al.
Published: (2024)
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
by: Xu, Zhijian, et al.
Published: (2025)
by: Xu, Zhijian, et al.
Published: (2025)
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
by: Wang, Chengye, et al.
Published: (2025)
by: Wang, Chengye, et al.
Published: (2025)
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
by: Yang, Zheyuan, et al.
Published: (2026)
by: Yang, Zheyuan, et al.
Published: (2026)
SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature
by: Ding, Hang, et al.
Published: (2025)
by: Ding, Hang, et al.
Published: (2025)
Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)
by: Deng, Chunyuan, et al.
Published: (2023)
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
by: Hu, Tiansheng, et al.
Published: (2025)
by: Hu, Tiansheng, et al.
Published: (2025)
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
by: Zhang, Siyue, et al.
Published: (2025)
by: Zhang, Siyue, et al.
Published: (2025)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by: Shangguan, Ziyao, et al.
Published: (2024)
by: Shangguan, Ziyao, et al.
Published: (2024)
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
by: Long, Yitao, et al.
Published: (2025)
by: Long, Yitao, et al.
Published: (2025)
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
AlphaResearch: Accelerating New Algorithm Discovery with Language Models
by: Yu, Zhaojian, et al.
Published: (2025)
by: Yu, Zhaojian, et al.
Published: (2025)
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
by: Li, Chuhan, et al.
Published: (2024)
by: Li, Chuhan, et al.
Published: (2024)
Observable Propagation: Uncovering Feature Vectors in Transformers
by: Dunefsky, Jacob, et al.
Published: (2023)
by: Dunefsky, Jacob, et al.
Published: (2023)
TESS 2: A Large-Scale Generalist Diffusion Language Model
by: Tae, Jaesung, et al.
Published: (2025)
by: Tae, Jaesung, et al.
Published: (2025)
MSRS: Evaluating Multi-Source Retrieval-Augmented Generation
by: Phanse, Rohan, et al.
Published: (2025)
by: Phanse, Rohan, et al.
Published: (2025)
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
by: Deng, Chunyuan, et al.
Published: (2024)
by: Deng, Chunyuan, et al.
Published: (2024)
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
by: Wu, Sihong, et al.
Published: (2026)
by: Wu, Sihong, et al.
Published: (2026)
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
by: Li, Alan, et al.
Published: (2025)
by: Li, Alan, et al.
Published: (2025)
HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning
by: Han, Simeng, et al.
Published: (2024)
by: Han, Simeng, et al.
Published: (2024)
YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization
by: Jang, Dongsuk, et al.
Published: (2025)
by: Jang, Dongsuk, et al.
Published: (2025)
SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
by: Singh, Shruti, et al.
Published: (2024)
by: Singh, Shruti, et al.
Published: (2024)
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
by: Yang, Zheyuan, et al.
Published: (2025)
by: Yang, Zheyuan, et al.
Published: (2025)
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
by: Lee, Jinu, et al.
Published: (2025)
by: Lee, Jinu, et al.
Published: (2025)
Understanding Reference Policies in Direct Preference Optimization
by: Liu, Yixin, et al.
Published: (2024)
by: Liu, Yixin, et al.
Published: (2024)
On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)
Similar Items
-
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
by: Zhao, Yilun, et al.
Published: (2023) -
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
by: Liu, Hongjun, et al.
Published: (2025) -
LimRank: Less is More for Reasoning-Intensive Information Reranking
by: Song, Tingyu, et al.
Published: (2025) -
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
by: Hu, Yunhai, et al.
Published: (2025) -
Z1: Efficient Test-time Scaling with Code
by: Yu, Zhaojian, et al.
Published: (2025)