:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gao, Yicheng, Zhou, Xiaolin, Li, Yahan, Zhao, Yue, Liu, Ruishan
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.07058
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
by: Zhou, Xiaolin, et al.
Published: (2026)

MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills
by: Yao, Zonghai, et al.
Published: (2024)

From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness
by: Cao, Linbo, et al.
Published: (2026)

Learning to Ask: When LLM Agents Meet Unclear Instruction
by: Wang, Wenxuan, et al.
Published: (2024)

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science
by: Xu, Ran, et al.
Published: (2025)

MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph
by: Zhang, Duzhen, et al.
Published: (2025)

KLong: Training LLM Agent for Extremely Long-horizon Tasks
by: Liu, Yue, et al.
Published: (2026)

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents
by: Wang, Xinyue, et al.
Published: (2026)

Ask-before-Plan: Proactive Language Agents for Real-World Planning
by: Zhang, Xuan, et al.
Published: (2024)

Training Proactive and Personalized LLM Agents
by: Sun, Weiwei, et al.
Published: (2025)

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2025)

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
by: Da, Jeff, et al.
Published: (2025)

AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
by: Su, Zhe, et al.
Published: (2024)

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
by: Xi, Zhiheng, et al.
Published: (2024)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2023)

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
by: Ma, Zerun, et al.
Published: (2026)

Training Versatile Coding Agents in Synthetic Environments
by: Zhu, Yiqi, et al.
Published: (2025)

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
by: Liu, Youwei, et al.
Published: (2026)

ExpeL: LLM Agents Are Experiential Learners
by: Zhao, Andrew, et al.
Published: (2023)

Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents
by: Xia, Fanzeng, et al.
Published: (2024)

Medchain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence
by: Liu, Jie, et al.
Published: (2024)

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
by: Liu, Yixin, et al.
Published: (2026)

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
by: Grand, Gabriel, et al.
Published: (2025)

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments
by: Tang, Wenjie, et al.
Published: (2025)

AgentAsk: Multi-Agent Systems Need to Ask
by: Lin, Bohan, et al.
Published: (2025)

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
by: Dong, Guanting, et al.
Published: (2026)

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
by: Yan, Weixiang, et al.
Published: (2024)

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
by: He, Lanshan, et al.
Published: (2026)

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
by: Mazumder, Aritra, et al.
Published: (2026)

Stance Detection with Collaborative Role-Infused LLM-Based Agents
by: Lan, Xiaochong, et al.
Published: (2023)

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
by: Li, Xirui, et al.
Published: (2026)

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
by: Zhong, Tianle, et al.
Published: (2026)

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
by: Cheng, Zihao, et al.
Published: (2026)

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
by: Song, Yueqi, et al.
Published: (2025)

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
by: Zhu, Yinghao, et al.
Published: (2025)

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
by: Yang, Zonghan, et al.
Published: (2025)

Long-term Task-oriented Agent: Proactive Long-term Intent Maintenance in Dynamic Environments
by: Shi, Qinglong, et al.
Published: (2026)

On the Structural Memory of LLM Agents
by: Zeng, Ruihong, et al.
Published: (2024)

Scalable Environments Drive Generalizable Agents
by: Zhang, Jiayi, et al.
Published: (2026)