Saved in:
| Main Authors: | Ghoshal, Subha, Al-Bustami, Ali |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.02663 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
by: Ding, Bowen, et al.
Published: (2025)
by: Ding, Bowen, et al.
Published: (2025)
Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow
by: Du, Y., et al.
Published: (2025)
by: Du, Y., et al.
Published: (2025)
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
by: Khan, Sheraz, et al.
Published: (2025)
by: Khan, Sheraz, et al.
Published: (2025)
When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
by: Zheng, Mingqian, et al.
Published: (2023)
by: Zheng, Mingqian, et al.
Published: (2023)
The Cost of Thinking: Increased Jailbreak Risk in Large Language Models
by: Yang, Fan
Published: (2025)
by: Yang, Fan
Published: (2025)
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
by: Zhang, Hanlei, et al.
Published: (2025)
by: Zhang, Hanlei, et al.
Published: (2025)
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
by: Zhang, Yuxiang, et al.
Published: (2024)
by: Zhang, Yuxiang, et al.
Published: (2024)
Learning When to Think While Listening in Large Audio-Language Models
by: Song, Zhiyuan, et al.
Published: (2026)
by: Song, Zhiyuan, et al.
Published: (2026)
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
by: Zheng, Yu, et al.
Published: (2025)
by: Zheng, Yu, et al.
Published: (2025)
WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models
by: Ning, Kangyun, et al.
Published: (2024)
by: Ning, Kangyun, et al.
Published: (2024)
When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
by: Zheng, Yingming, et al.
Published: (2025)
by: Zheng, Yingming, et al.
Published: (2025)
Modeling Hierarchical Thinking in Large Reasoning Models
by: Shahariar, G M, et al.
Published: (2025)
by: Shahariar, G M, et al.
Published: (2025)
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
by: Shi, Zhengliang, et al.
Published: (2025)
by: Shi, Zhengliang, et al.
Published: (2025)
BALSAM: A Platform for Benchmarking Arabic Large Language Models
by: Al-Matham, Rawan, et al.
Published: (2025)
by: Al-Matham, Rawan, et al.
Published: (2025)
Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models
by: Chen, Qiguang, et al.
Published: (2025)
by: Chen, Qiguang, et al.
Published: (2025)
Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?
by: Zhong, Chengzhi, et al.
Published: (2024)
by: Zhong, Chengzhi, et al.
Published: (2024)
Exploring and Benchmarking the Planning Capabilities of Large Language Models
by: Bohnet, Bernd, et al.
Published: (2024)
by: Bohnet, Bernd, et al.
Published: (2024)
Chain of Thought Still Thinks Fast: APriCoT Helps with Thinking Slow
by: Moore, Kyle, et al.
Published: (2024)
by: Moore, Kyle, et al.
Published: (2024)
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models
by: Hijazi, Faris, et al.
Published: (2024)
by: Hijazi, Faris, et al.
Published: (2024)
Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning
by: Xu, Ningning, et al.
Published: (2025)
by: Xu, Ningning, et al.
Published: (2025)
HonestLLM: Toward an Honest and Helpful Large Language Model
by: Gao, Chujie, et al.
Published: (2024)
by: Gao, Chujie, et al.
Published: (2024)
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
by: Alzahrani, Norah, et al.
Published: (2024)
by: Alzahrani, Norah, et al.
Published: (2024)
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
by: Ye, Junjie, et al.
Published: (2024)
by: Ye, Junjie, et al.
Published: (2024)
Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking
by: Zheng, Tianpeng, et al.
Published: (2026)
by: Zheng, Tianpeng, et al.
Published: (2026)
EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models
by: Lee, Donggyu, et al.
Published: (2025)
by: Lee, Donggyu, et al.
Published: (2025)
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
by: Kim, Junho, et al.
Published: (2024)
by: Kim, Junho, et al.
Published: (2024)
Do Retrieval Augmented Language Models Know When They Don't Know?
by: Zhou, Youchao, et al.
Published: (2025)
by: Zhou, Youchao, et al.
Published: (2025)
Do Language Models Know When They're Hallucinating References?
by: Agrawal, Ayush, et al.
Published: (2023)
by: Agrawal, Ayush, et al.
Published: (2023)
CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models
by: Sun, Guangzhi, et al.
Published: (2025)
by: Sun, Guangzhi, et al.
Published: (2025)
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time
by: Herel, David, et al.
Published: (2024)
by: Herel, David, et al.
Published: (2024)
TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning
by: Zhu, Rongzhi, et al.
Published: (2025)
by: Zhu, Rongzhi, et al.
Published: (2025)
Can Large Language Models Predict the Outcome of Judicial Decisions?
by: Kmainasi, Mohamed Bayan, et al.
Published: (2025)
by: Kmainasi, Mohamed Bayan, et al.
Published: (2025)
A Survey of Large Language Models for Arabic Language and its Dialects
by: Mashaabi, Malak, et al.
Published: (2024)
by: Mashaabi, Malak, et al.
Published: (2024)
Importance Weighting Can Help Large Language Models Self-Improve
by: Jiang, Chunyang, et al.
Published: (2024)
by: Jiang, Chunyang, et al.
Published: (2024)
AdaptThink: Reasoning Models Can Learn When to Think
by: Zhang, Jiajie, et al.
Published: (2025)
by: Zhang, Jiajie, et al.
Published: (2025)
Tool Learning with Large Language Models: A Survey
by: Qu, Changle, et al.
Published: (2024)
by: Qu, Changle, et al.
Published: (2024)
Large Language Models Align with the Human Brain during Creative Thinking
by: Ismayilzada, Mete, et al.
Published: (2026)
by: Ismayilzada, Mete, et al.
Published: (2026)
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models
by: Ghosal, Soumya Suvra, et al.
Published: (2025)
by: Ghosal, Soumya Suvra, et al.
Published: (2025)
Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths
by: Nair, Inderjeet, et al.
Published: (2025)
by: Nair, Inderjeet, et al.
Published: (2025)
Similar Items
-
Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
by: Ding, Bowen, et al.
Published: (2025) -
Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow
by: Du, Y., et al.
Published: (2025) -
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
by: Khan, Sheraz, et al.
Published: (2025) -
When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
by: Zheng, Mingqian, et al.
Published: (2023) -
The Cost of Thinking: Increased Jailbreak Risk in Large Language Models
by: Yang, Fan
Published: (2025)