Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.17807 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915171555868672 |
|---|---|
| author | Zhuang, Tianyi Kuang, Chuqiao Li, Xiaoguang Teng, Yihua Wu, Jihao Wang, Yasheng Shang, Lifeng |
| author_facet | Zhuang, Tianyi Kuang, Chuqiao Li, Xiaoguang Teng, Yihua Wu, Jihao Wang, Yasheng Shang, Lifeng |
| contents | We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2502_17807 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities Zhuang, Tianyi Kuang, Chuqiao Li, Xiaoguang Teng, Yihua Wu, Jihao Wang, Yasheng Shang, Lifeng Artificial Intelligence We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation. |
| title | DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2502.17807 |