:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lichkovski, Ilija, Müller, Alexander, Ibrahim, Mariam, Mhundwa, Tiwai
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.21524
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AI Agents Under EU Law
by: Nannini, Luca, et al.
Published: (2026)

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
by: Ferrao, Jeremias, et al.
Published: (2025)

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
by: Zheng, Tianshi, et al.
Published: (2025)

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
by: Wang, Ruipeng, et al.
Published: (2026)

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents
by: Mehta, Aman
Published: (2026)

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026)

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
by: Zheng, Junhao, et al.
Published: (2025)

Generative AI in EU Law: Liability, Privacy, Intellectual Property, and Cybersecurity
by: Novelli, Claudio, et al.
Published: (2024)

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
by: Prandi, Matteo, et al.
Published: (2025)

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
by: Guo, Zhengkang, et al.
Published: (2026)

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
by: Wiedermann-Möller, Jonas, et al.
Published: (2026)

Formally Specifying the High-Level Behavior of LLM-Based Agents
by: Crouse, Maxwell, et al.
Published: (2023)

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
by: Shang, Yu, et al.
Published: (2025)

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents
by: Nandi, Subhrangshu, et al.
Published: (2025)

MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them
by: Zhang, Weichen, et al.
Published: (2025)

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?
by: Chen, Wanyi, et al.
Published: (2026)

The Scaling Laws of Skills in LLM Agent Systems
by: Chen, Charles, et al.
Published: (2026)

LLM Agents in Law: Taxonomy, Applications, and Challenges
by: Liu, Shuang, et al.
Published: (2026)

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
by: Wang, Luyuan, et al.
Published: (2024)

FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
by: Zhang, Hanrong, et al.
Published: (2024)

Sequential Behavioral Watermarking for LLM Agents
by: An, Hyeseon, et al.
Published: (2026)

"My Kind of Woman": Analysing Gender Stereotypes in AI through The Averageness Theory and EU Law
by: Doh, Miriam, et al.
Published: (2024)

EU Trade-Related Measures against Illegal Fishing
by: Kadfak, Alin, et al.
Published: (2023)

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
by: Wu, Yunze, et al.
Published: (2025)

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
by: Liu, Ruoqi, et al.
Published: (2026)

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
by: Zhou, Yifan, et al.
Published: (2026)

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
by: Yin, Sheng, et al.
Published: (2024)

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
by: Costarelli, Anthony, et al.
Published: (2024)

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
by: Qiu, Jielin, et al.
Published: (2025)

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
by: Yao, Yilun, et al.
Published: (2026)

BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments
by: Li, Yuxuan, et al.
Published: (2026)

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
by: Müller, Robert, et al.
Published: (2026)

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
by: Fa, Dionizije, et al.
Published: (2026)

PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?
by: Hua, Dongdong, et al.
Published: (2026)

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
by: Yen, Thomson, et al.
Published: (2026)

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
by: Li, Xiangyi, et al.
Published: (2026)

Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure
by: Butt, Talal Ashraf, et al.
Published: (2026)

AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)