:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Filali, Ali El, Bedar, Inès
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.18029
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
by: AlShikh, Waseem, et al.
Published: (2025)

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
by: Tan, Haoran, et al.
Published: (2025)

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
by: Abdelnasser, Omar, et al.
Published: (2026)

Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language
by: Paul, Ronny, et al.
Published: (2024)

SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints
by: Li, Zekun, et al.
Published: (2025)

Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications
by: Shu, Raphael, et al.
Published: (2024)

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
by: Röttger, Paul, et al.
Published: (2024)

OLMES: A Standard for Language Model Evaluations
by: Gu, Yuling, et al.
Published: (2024)

Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
by: Qian, Cheng, et al.
Published: (2024)

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
by: Liu, Zhiwei, et al.
Published: (2025)

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)

Holistic Evaluation and Failure Diagnosis of AI Agents
by: Madvil, Netta, et al.
Published: (2026)

Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents
by: Komoravolu, Sameer, et al.
Published: (2025)

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
by: Al-Tawaha, Ahmad, et al.
Published: (2026)

Evaluating Multimodal Generative AI with Korean Educational Standards
by: Park, Sanghee, et al.
Published: (2025)

Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires
by: Münker, Simon
Published: (2025)

From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM
by: Luo, Chanyong, et al.
Published: (2026)

Towards stable AI systems for Evaluating Arabic Pronunciations
by: Zaatiti, Hadi, et al.
Published: (2025)

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
by: Shu, Yiheng, et al.
Published: (2026)

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
by: Kartik, NVJK, et al.
Published: (2025)

HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
by: Mekky, Ali, et al.
Published: (2025)

Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
by: Ding, Bowen, et al.
Published: (2025)

FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
by: Chen, Haorui, et al.
Published: (2025)

From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes
by: Garzón, Rubén, et al.
Published: (2026)

Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models
by: Iravani, Sahar, et al.
Published: (2024)

Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation
by: Wang, Xinyi, et al.
Published: (2026)

SMATCH++: Standardized and Extended Evaluation of Semantic Graphs
by: Opitz, Juri
Published: (2023)

More Agents Is All You Need
by: Li, Junyou, et al.
Published: (2024)

RoToR: Towards More Reliable Responses for Order-Invariant Inputs
by: Yoon, Soyoung, et al.
Published: (2025)

Role-Playing Evaluation for Large Language Models
by: Boudouri, Yassine El, et al.
Published: (2025)

Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
by: Zhu, Jiachen, et al.
Published: (2025)

Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To Word--Definition Alignment
by: ElBakry, Ahmed, et al.
Published: (2023)

AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
by: Carro, María Victoria, et al.
Published: (2025)

More Agents Improve Math Problem Solving but Adversarial Robustness Gap Persists
by: Alavi, Khashayar, et al.
Published: (2025)

Characteristic AI Agents via Large Language Models
by: Wang, Xi, et al.
Published: (2024)

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
by: Zhou, Karen, et al.
Published: (2025)

ACIArena: Toward Unified Evaluation for Agent Cascading Injection
by: An, Hengyu, et al.
Published: (2026)

StaICC: Standardized Evaluation for Classification Task in In-context Learning
by: Cho, Hakaze, et al.
Published: (2025)

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
by: Testini, Irene, et al.
Published: (2025)

Giving AI Personalities Leads to More Human-Like Reasoning
by: Nighojkar, Animesh, et al.
Published: (2025)