Saved in:
| Main Authors: | Matinez, Yago Romano, Roberts, Jesse |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.09867 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multiplayer Nash Preference Optimization
by: Wu, Fang, et al.
Published: (2025)
by: Wu, Fang, et al.
Published: (2025)
Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information
by: Yim, Yauwai, et al.
Published: (2024)
by: Yim, Yauwai, et al.
Published: (2024)
The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks
by: Pinhanez, Claudio, et al.
Published: (2025)
by: Pinhanez, Claudio, et al.
Published: (2025)
Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models
by: Moore, Kyle, et al.
Published: (2025)
by: Moore, Kyle, et al.
Published: (2025)
Chain of Thought Still Thinks Fast: APriCoT Helps with Thinking Slow
by: Moore, Kyle, et al.
Published: (2024)
by: Moore, Kyle, et al.
Published: (2024)
Are LLMs complicated ethical dilemma analyzers?
by: Jiashen, et al.
Published: (2025)
by: Jiashen, et al.
Published: (2025)
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
by: Zhang, Guibin, et al.
Published: (2025)
by: Zhang, Guibin, et al.
Published: (2025)
Computer Environments Elicit General Agentic Intelligence in LLMs
by: Cheng, Daixuan, et al.
Published: (2026)
by: Cheng, Daixuan, et al.
Published: (2026)
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
by: Moore, Kyle, et al.
Published: (2024)
by: Moore, Kyle, et al.
Published: (2024)
Large Language Model Recall Uncertainty is Modulated by the Fan Effect
by: Roberts, Jesse, et al.
Published: (2024)
by: Roberts, Jesse, et al.
Published: (2024)
GEM: A Gym for Agentic LLMs
by: Liu, Zichen, et al.
Published: (2025)
by: Liu, Zichen, et al.
Published: (2025)
Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
by: Cavalin, Paulo, et al.
Published: (2025)
by: Cavalin, Paulo, et al.
Published: (2025)
Player-Driven Emergence in LLM-Driven Game Narrative
by: Peng, Xiangyu, et al.
Published: (2024)
by: Peng, Xiangyu, et al.
Published: (2024)
TxGemma: Efficient and Agentic LLMs for Therapeutics
by: Wang, Eric, et al.
Published: (2025)
by: Wang, Eric, et al.
Published: (2025)
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
by: Li, Yangning, et al.
Published: (2025)
by: Li, Yangning, et al.
Published: (2025)
Agentic Adversarial QA for Improving Domain-Specific LLMs
by: Grari, Vincent, et al.
Published: (2026)
by: Grari, Vincent, et al.
Published: (2026)
Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
by: Zhao, Minda, et al.
Published: (2026)
by: Zhao, Minda, et al.
Published: (2026)
Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs
by: Barone, Mariano, et al.
Published: (2026)
by: Barone, Mariano, et al.
Published: (2026)
Collaborative Quest Completion with LLM-driven Non-Player Characters in Minecraft
by: Rao, Sudha, et al.
Published: (2024)
by: Rao, Sudha, et al.
Published: (2024)
PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering
by: Nahid, Md Mahadi Hasan, et al.
Published: (2025)
by: Nahid, Md Mahadi Hasan, et al.
Published: (2025)
Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset
by: Henkel, Owen, et al.
Published: (2023)
by: Henkel, Owen, et al.
Published: (2023)
Tool Preferences in Agentic LLMs are Unreliable
by: Faghih, Kazem, et al.
Published: (2025)
by: Faghih, Kazem, et al.
Published: (2025)
Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
by: Li, Yihan, et al.
Published: (2025)
by: Li, Yihan, et al.
Published: (2025)
Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning
by: Fan, Wei, et al.
Published: (2026)
by: Fan, Wei, et al.
Published: (2026)
Targeted Visualization of the Backbone of Encoder LLMs
by: Roberts, Isaac, et al.
Published: (2024)
by: Roberts, Isaac, et al.
Published: (2024)
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education
by: Henkel, Owen, et al.
Published: (2024)
by: Henkel, Owen, et al.
Published: (2024)
Toward Optimal LLM Alignments Using Two-Player Games
by: Zheng, Rui, et al.
Published: (2024)
by: Zheng, Rui, et al.
Published: (2024)
SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs
by: Wang, Sijia, et al.
Published: (2026)
by: Wang, Sijia, et al.
Published: (2026)
Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs
by: Dai, Hankun, et al.
Published: (2025)
by: Dai, Hankun, et al.
Published: (2025)
UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making
by: Duan, Jinhao, et al.
Published: (2025)
by: Duan, Jinhao, et al.
Published: (2025)
Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework
by: Guan, Zihan, et al.
Published: (2026)
by: Guan, Zihan, et al.
Published: (2026)
Agentic Confidence Calibration
by: Zhang, Jiaxin, et al.
Published: (2026)
by: Zhang, Jiaxin, et al.
Published: (2026)
Agentic Uncertainty Quantification
by: Zhang, Jiaxin, et al.
Published: (2026)
by: Zhang, Jiaxin, et al.
Published: (2026)
UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
by: Chen, Chen, et al.
Published: (2025)
by: Chen, Chen, et al.
Published: (2025)
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
by: Wu, Junde, et al.
Published: (2025)
by: Wu, Junde, et al.
Published: (2025)
AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
by: Piya, Fahmida Liza, et al.
Published: (2026)
by: Piya, Fahmida Liza, et al.
Published: (2026)
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration
by: Yadav, Advait, et al.
Published: (2026)
by: Yadav, Advait, et al.
Published: (2026)
Are Large Vision Language Models Good Game Players?
by: Wang, Xinyu, et al.
Published: (2025)
by: Wang, Xinyu, et al.
Published: (2025)
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
by: Xu, Qinwu, et al.
Published: (2026)
by: Xu, Qinwu, et al.
Published: (2026)
AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
by: Liu, Xianyang, et al.
Published: (2025)
by: Liu, Xianyang, et al.
Published: (2025)
Similar Items
-
Multiplayer Nash Preference Optimization
by: Wu, Fang, et al.
Published: (2025) -
Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information
by: Yim, Yauwai, et al.
Published: (2024) -
The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks
by: Pinhanez, Claudio, et al.
Published: (2025) -
Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models
by: Moore, Kyle, et al.
Published: (2025) -
Chain of Thought Still Thinks Fast: APriCoT Helps with Thinking Slow
by: Moore, Kyle, et al.
Published: (2024)