Saved in:
| Main Authors: | Nathan, Varun, Guha, Shreyas, Kumar, Ayush |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.14955 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Planning-Aware Code Infilling via Horizon-Length Prediction
by: Ding, Yifeng, et al.
Published: (2024)
by: Ding, Yifeng, et al.
Published: (2024)
ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
by: Liu, Marianne Menglin, et al.
Published: (2025)
by: Liu, Marianne Menglin, et al.
Published: (2025)
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs
by: Iskander, Shadi, et al.
Published: (2024)
by: Iskander, Shadi, et al.
Published: (2024)
PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation
by: Yoo, Jaeseok, et al.
Published: (2024)
by: Yoo, Jaeseok, et al.
Published: (2024)
Evaluation of Code LLMs on Geospatial Code Generation
by: Gramacki, Piotr, et al.
Published: (2024)
by: Gramacki, Piotr, et al.
Published: (2024)
LeDex: Training LLMs to Better Self-Debug and Explain Code
by: Jiang, Nan, et al.
Published: (2024)
by: Jiang, Nan, et al.
Published: (2024)
MathDuels: Evaluating LLMs as Problem Posers and Solvers
by: Xu, Zhiqiu, et al.
Published: (2026)
by: Xu, Zhiqiu, et al.
Published: (2026)
AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans
by: Zi, Yangtian, et al.
Published: (2025)
by: Zi, Yangtian, et al.
Published: (2025)
FairCoder: Evaluating Social Bias of LLMs in Code Generation
by: Du, Yongkang, et al.
Published: (2025)
by: Du, Yongkang, et al.
Published: (2025)
The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
by: Xu, Haoyuan, et al.
Published: (2026)
by: Xu, Haoyuan, et al.
Published: (2026)
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
by: Huang, Shiting, et al.
Published: (2025)
by: Huang, Shiting, et al.
Published: (2025)
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
by: Wang, Zihan, et al.
Published: (2025)
by: Wang, Zihan, et al.
Published: (2025)
Using Large Language Models for Student-Code Guided Test Case Generation in Computer Science Education
by: Kumar, Nischal Ashok, et al.
Published: (2024)
by: Kumar, Nischal Ashok, et al.
Published: (2024)
CodeScout: Contextual Problem Statement Enhancement for Software Agents
by: Suri, Manan, et al.
Published: (2026)
by: Suri, Manan, et al.
Published: (2026)
Privacy Policy Analysis through Prompt Engineering for LLMs
by: Goknil, Arda, et al.
Published: (2024)
by: Goknil, Arda, et al.
Published: (2024)
MATCH: Task-Driven Code Evaluation through Contrastive Learning
by: Ghoummaid, Marah, et al.
Published: (2025)
by: Ghoummaid, Marah, et al.
Published: (2025)
Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
by: Gong, Linyuan, et al.
Published: (2024)
by: Gong, Linyuan, et al.
Published: (2024)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
by: Huang, Yue, et al.
Published: (2023)
by: Huang, Yue, et al.
Published: (2023)
EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning
by: Huang, Dong, et al.
Published: (2024)
by: Huang, Dong, et al.
Published: (2024)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination
by: Zheng, Jiasheng, et al.
Published: (2026)
by: Zheng, Jiasheng, et al.
Published: (2026)
Benchmarking Failures in Tool-Augmented Language Models
by: Treviño, Eduardo, et al.
Published: (2025)
by: Treviño, Eduardo, et al.
Published: (2025)
From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
Evaluating Plan Compliance in Autonomous Programming Agents
by: Liu, Shuyang, et al.
Published: (2026)
by: Liu, Shuyang, et al.
Published: (2026)
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision
by: Lu, Yifei, et al.
Published: (2025)
by: Lu, Yifei, et al.
Published: (2025)
Sanskrit Knowledge-based Systems: Annotation and Computational Tools
by: Terdalkar, Hrishikesh
Published: (2024)
by: Terdalkar, Hrishikesh
Published: (2024)
ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs
by: Ding, Peng, et al.
Published: (2025)
by: Ding, Peng, et al.
Published: (2025)
A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning
by: Li, Xinzhe
Published: (2024)
by: Li, Xinzhe
Published: (2024)
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
by: Yang, Zheyuan, et al.
Published: (2025)
by: Yang, Zheyuan, et al.
Published: (2025)
Dont Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination
by: Choubey, Prafulla Kumar, et al.
Published: (2026)
by: Choubey, Prafulla Kumar, et al.
Published: (2026)
JEDI: Java Evaluation of Declarative and Imperative Queries
by: Schiavio, Filippo, et al.
Published: (2026)
by: Schiavio, Filippo, et al.
Published: (2026)
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
by: Chen, Kexin, et al.
Published: (2024)
by: Chen, Kexin, et al.
Published: (2024)
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
by: Chen, Shiqi, et al.
Published: (2026)
by: Chen, Shiqi, et al.
Published: (2026)
Multi-Programming Language Sandbox for LLMs
by: Dou, Shihan, et al.
Published: (2024)
by: Dou, Shihan, et al.
Published: (2024)
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval
by: Wang, Jiexin, et al.
Published: (2024)
by: Wang, Jiexin, et al.
Published: (2024)
LLMs in Mobile Apps: Practices, Challenges, and Opportunities
by: Hau, Kimberly, et al.
Published: (2025)
by: Hau, Kimberly, et al.
Published: (2025)
Learning Code Preference via Synthetic Evolution
by: Liu, Jiawei, et al.
Published: (2024)
by: Liu, Jiawei, et al.
Published: (2024)
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
by: Wang, Shuaiqi, et al.
Published: (2026)
by: Wang, Shuaiqi, et al.
Published: (2026)
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
by: Lu, Yuxuan, et al.
Published: (2026)
by: Lu, Yuxuan, et al.
Published: (2026)
Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents
by: Wang, Qihao, et al.
Published: (2026)
by: Wang, Qihao, et al.
Published: (2026)
Similar Items
-
Planning-Aware Code Infilling via Horizon-Length Prediction
by: Ding, Yifeng, et al.
Published: (2024) -
ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
by: Liu, Marianne Menglin, et al.
Published: (2025) -
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs
by: Iskander, Shadi, et al.
Published: (2024) -
PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation
by: Yoo, Jaeseok, et al.
Published: (2024) -
Evaluation of Code LLMs on Geospatial Code Generation
by: Gramacki, Piotr, et al.
Published: (2024)