Saved in:
| Main Authors: | Lichkovski, Ilija, Müller, Alexander, Ibrahim, Mariam, Mhundwa, Tiwai |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.21524 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AI Agents Under EU Law
by: Nannini, Luca, et al.
Published: (2026)
by: Nannini, Luca, et al.
Published: (2026)
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
by: Ferrao, Jeremias, et al.
Published: (2025)
by: Ferrao, Jeremias, et al.
Published: (2025)
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
by: Zheng, Tianshi, et al.
Published: (2025)
by: Zheng, Tianshi, et al.
Published: (2025)
AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
by: Wang, Ruipeng, et al.
Published: (2026)
by: Wang, Ruipeng, et al.
Published: (2026)
When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents
by: Mehta, Aman
Published: (2026)
by: Mehta, Aman
Published: (2026)
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026)
by: Nguyen, Bang, et al.
Published: (2026)
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
by: Zheng, Junhao, et al.
Published: (2025)
by: Zheng, Junhao, et al.
Published: (2025)
Generative AI in EU Law: Liability, Privacy, Intellectual Property, and Cybersecurity
by: Novelli, Claudio, et al.
Published: (2024)
by: Novelli, Claudio, et al.
Published: (2024)
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
by: Prandi, Matteo, et al.
Published: (2025)
by: Prandi, Matteo, et al.
Published: (2025)
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
by: Guo, Zhengkang, et al.
Published: (2026)
by: Guo, Zhengkang, et al.
Published: (2026)
Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
by: Wiedermann-Möller, Jonas, et al.
Published: (2026)
by: Wiedermann-Möller, Jonas, et al.
Published: (2026)
Formally Specifying the High-Level Behavior of LLM-Based Agents
by: Crouse, Maxwell, et al.
Published: (2023)
by: Crouse, Maxwell, et al.
Published: (2023)
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
by: Shang, Yu, et al.
Published: (2025)
by: Shang, Yu, et al.
Published: (2025)
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents
by: Nandi, Subhrangshu, et al.
Published: (2025)
by: Nandi, Subhrangshu, et al.
Published: (2025)
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them
by: Zhang, Weichen, et al.
Published: (2025)
by: Zhang, Weichen, et al.
Published: (2025)
Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?
by: Chen, Wanyi, et al.
Published: (2026)
by: Chen, Wanyi, et al.
Published: (2026)
The Scaling Laws of Skills in LLM Agent Systems
by: Chen, Charles, et al.
Published: (2026)
by: Chen, Charles, et al.
Published: (2026)
LLM Agents in Law: Taxonomy, Applications, and Challenges
by: Liu, Shuang, et al.
Published: (2026)
by: Liu, Shuang, et al.
Published: (2026)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
by: Wang, Luyuan, et al.
Published: (2024)
by: Wang, Luyuan, et al.
Published: (2024)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
by: Zhang, Hanrong, et al.
Published: (2024)
by: Zhang, Hanrong, et al.
Published: (2024)
Sequential Behavioral Watermarking for LLM Agents
by: An, Hyeseon, et al.
Published: (2026)
by: An, Hyeseon, et al.
Published: (2026)
"My Kind of Woman": Analysing Gender Stereotypes in AI through The Averageness Theory and EU Law
by: Doh, Miriam, et al.
Published: (2024)
by: Doh, Miriam, et al.
Published: (2024)
EU Trade-Related Measures against Illegal Fishing
by: Kadfak, Alin, et al.
Published: (2023)
by: Kadfak, Alin, et al.
Published: (2023)
InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
by: Wu, Yunze, et al.
Published: (2025)
by: Wu, Yunze, et al.
Published: (2025)
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
by: Liu, Ruoqi, et al.
Published: (2026)
by: Liu, Ruoqi, et al.
Published: (2026)
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
by: Zhou, Yifan, et al.
Published: (2026)
by: Zhou, Yifan, et al.
Published: (2026)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
by: Yin, Sheng, et al.
Published: (2024)
by: Yin, Sheng, et al.
Published: (2024)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
by: Costarelli, Anthony, et al.
Published: (2024)
by: Costarelli, Anthony, et al.
Published: (2024)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
by: Qiu, Jielin, et al.
Published: (2025)
by: Qiu, Jielin, et al.
Published: (2025)
Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
by: Yao, Yilun, et al.
Published: (2026)
by: Yao, Yilun, et al.
Published: (2026)
BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments
by: Li, Yuxuan, et al.
Published: (2026)
by: Li, Yuxuan, et al.
Published: (2026)
Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
by: Müller, Robert, et al.
Published: (2026)
by: Müller, Robert, et al.
Published: (2026)
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
by: Fa, Dionizije, et al.
Published: (2026)
by: Fa, Dionizije, et al.
Published: (2026)
PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?
by: Hua, Dongdong, et al.
Published: (2026)
by: Hua, Dongdong, et al.
Published: (2026)
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
by: Yen, Thomson, et al.
Published: (2026)
by: Yen, Thomson, et al.
Published: (2026)
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
by: Li, Xiangyi, et al.
Published: (2026)
by: Li, Xiangyi, et al.
Published: (2026)
Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure
by: Butt, Talal Ashraf, et al.
Published: (2026)
by: Butt, Talal Ashraf, et al.
Published: (2026)
AgentBench: Evaluating LLMs as Agents
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
Similar Items
-
AI Agents Under EU Law
by: Nannini, Luca, et al.
Published: (2026) -
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
by: Ferrao, Jeremias, et al.
Published: (2025) -
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
by: Zheng, Tianshi, et al.
Published: (2025) -
AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
by: Wang, Ruipeng, et al.
Published: (2026) -
When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents
by: Mehta, Aman
Published: (2026)