Saved in:
| Main Authors: | Filali, Ali El, Bedar, Inès |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.18029 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
by: AlShikh, Waseem, et al.
Published: (2025)
by: AlShikh, Waseem, et al.
Published: (2025)
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
by: Tan, Haoran, et al.
Published: (2025)
by: Tan, Haoran, et al.
Published: (2025)
SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
by: Abdelnasser, Omar, et al.
Published: (2026)
by: Abdelnasser, Omar, et al.
Published: (2026)
Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language
by: Paul, Ronny, et al.
Published: (2024)
by: Paul, Ronny, et al.
Published: (2024)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints
by: Li, Zekun, et al.
Published: (2025)
by: Li, Zekun, et al.
Published: (2025)
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications
by: Shu, Raphael, et al.
Published: (2024)
by: Shu, Raphael, et al.
Published: (2024)
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
by: Röttger, Paul, et al.
Published: (2024)
by: Röttger, Paul, et al.
Published: (2024)
OLMES: A Standard for Language Model Evaluations
by: Gu, Yuling, et al.
Published: (2024)
by: Gu, Yuling, et al.
Published: (2024)
Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
by: Qian, Cheng, et al.
Published: (2024)
by: Qian, Cheng, et al.
Published: (2024)
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
by: Liu, Zhiwei, et al.
Published: (2025)
by: Liu, Zhiwei, et al.
Published: (2025)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)
by: Kapoor, Sayash, et al.
Published: (2025)
Holistic Evaluation and Failure Diagnosis of AI Agents
by: Madvil, Netta, et al.
Published: (2026)
by: Madvil, Netta, et al.
Published: (2026)
Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents
by: Komoravolu, Sameer, et al.
Published: (2025)
by: Komoravolu, Sameer, et al.
Published: (2025)
Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
by: Al-Tawaha, Ahmad, et al.
Published: (2026)
by: Al-Tawaha, Ahmad, et al.
Published: (2026)
Evaluating Multimodal Generative AI with Korean Educational Standards
by: Park, Sanghee, et al.
Published: (2025)
by: Park, Sanghee, et al.
Published: (2025)
Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires
by: Münker, Simon
Published: (2025)
by: Münker, Simon
Published: (2025)
From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM
by: Luo, Chanyong, et al.
Published: (2026)
by: Luo, Chanyong, et al.
Published: (2026)
Towards stable AI systems for Evaluating Arabic Pronunciations
by: Zaatiti, Hadi, et al.
Published: (2025)
by: Zaatiti, Hadi, et al.
Published: (2025)
AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
by: Shu, Yiheng, et al.
Published: (2026)
by: Shu, Yiheng, et al.
Published: (2026)
AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
by: Kartik, NVJK, et al.
Published: (2025)
by: Kartik, NVJK, et al.
Published: (2025)
HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
by: Mekky, Ali, et al.
Published: (2025)
by: Mekky, Ali, et al.
Published: (2025)
Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
by: Ding, Bowen, et al.
Published: (2025)
by: Ding, Bowen, et al.
Published: (2025)
FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
by: Chen, Haorui, et al.
Published: (2025)
by: Chen, Haorui, et al.
Published: (2025)
From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes
by: Garzón, Rubén, et al.
Published: (2026)
by: Garzón, Rubén, et al.
Published: (2026)
Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models
by: Iravani, Sahar, et al.
Published: (2024)
by: Iravani, Sahar, et al.
Published: (2024)
Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation
by: Wang, Xinyi, et al.
Published: (2026)
by: Wang, Xinyi, et al.
Published: (2026)
SMATCH++: Standardized and Extended Evaluation of Semantic Graphs
by: Opitz, Juri
Published: (2023)
by: Opitz, Juri
Published: (2023)
More Agents Is All You Need
by: Li, Junyou, et al.
Published: (2024)
by: Li, Junyou, et al.
Published: (2024)
RoToR: Towards More Reliable Responses for Order-Invariant Inputs
by: Yoon, Soyoung, et al.
Published: (2025)
by: Yoon, Soyoung, et al.
Published: (2025)
Role-Playing Evaluation for Large Language Models
by: Boudouri, Yassine El, et al.
Published: (2025)
by: Boudouri, Yassine El, et al.
Published: (2025)
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
by: Zhu, Jiachen, et al.
Published: (2025)
by: Zhu, Jiachen, et al.
Published: (2025)
Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To Word--Definition Alignment
by: ElBakry, Ahmed, et al.
Published: (2023)
by: ElBakry, Ahmed, et al.
Published: (2023)
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
by: Carro, María Victoria, et al.
Published: (2025)
by: Carro, María Victoria, et al.
Published: (2025)
More Agents Improve Math Problem Solving but Adversarial Robustness Gap Persists
by: Alavi, Khashayar, et al.
Published: (2025)
by: Alavi, Khashayar, et al.
Published: (2025)
Characteristic AI Agents via Large Language Models
by: Wang, Xi, et al.
Published: (2024)
by: Wang, Xi, et al.
Published: (2024)
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
by: Zhou, Karen, et al.
Published: (2025)
by: Zhou, Karen, et al.
Published: (2025)
ACIArena: Toward Unified Evaluation for Agent Cascading Injection
by: An, Hengyu, et al.
Published: (2026)
by: An, Hengyu, et al.
Published: (2026)
StaICC: Standardized Evaluation for Classification Task in In-context Learning
by: Cho, Hakaze, et al.
Published: (2025)
by: Cho, Hakaze, et al.
Published: (2025)
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
by: Testini, Irene, et al.
Published: (2025)
by: Testini, Irene, et al.
Published: (2025)
Giving AI Personalities Leads to More Human-Like Reasoning
by: Nighojkar, Animesh, et al.
Published: (2025)
by: Nighojkar, Animesh, et al.
Published: (2025)
Similar Items
-
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
by: AlShikh, Waseem, et al.
Published: (2025) -
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
by: Tan, Haoran, et al.
Published: (2025) -
SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
by: Abdelnasser, Omar, et al.
Published: (2026) -
Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language
by: Paul, Ronny, et al.
Published: (2024) -
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints
by: Li, Zekun, et al.
Published: (2025)