:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Vu, Thanh, Nayak, Richi, Balasubramaniam, Thiru
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.08273
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN
by: Bashar, Md Abul, et al.
Published: (2023)

Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
by: Uniyal, Deepak, et al.
Published: (2026)

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
by: Ramesh, Guruprasad Viswanathan, et al.
Published: (2026)

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
by: Hong, Sirui, et al.
Published: (2026)

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
by: Song, Seyoung, et al.
Published: (2025)

Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel
by: Maddi, Aadyaa, et al.
Published: (2026)

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
by: Hathidara, Ashutosh, et al.
Published: (2026)

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
by: Ye, Bowen, et al.
Published: (2026)

ReviewEval: An Evaluation Framework for AI-Generated Reviews
by: Garg, Madhav Krishan, et al.
Published: (2025)

General Agent Evaluation
by: Bandel, Elron, et al.
Published: (2026)

AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents
by: Lu, Qinghua, et al.
Published: (2025)

GDPR Auto-Formalization with AI Agents and Human Verification
by: Nguyen, Ha Thanh, et al.
Published: (2026)

AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning
by: Fu, Suzhong, et al.
Published: (2026)

AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
by: Guo, Dongxin, et al.
Published: (2026)

Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents
by: Kar, Rishabh
Published: (2026)

Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models
by: Vishwakarma, Sanjay, et al.
Published: (2024)

AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents
by: Sun, Jiahui, et al.
Published: (2025)

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
by: Raj, Harsh, et al.
Published: (2026)

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
by: Isbarov, Jafar, et al.
Published: (2026)

Propensity-to-Pay: Machine Learning for Estimating Prediction Uncertainty
by: Bashar, Md Abul, et al.
Published: (2020)

SIDiffAgent: Self-Improving Diffusion Agent
by: Garg, Shivank, et al.
Published: (2026)

Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation
by: Her, Lim Chien, et al.
Published: (2025)

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
by: Chen, Yurun, et al.
Published: (2025)

Multi-Agent Framework for Controllable and Protected Generative Content Creation: Addressing Copyright and Provenance in AI-Generated Media
by: Khan, Haris, et al.
Published: (2026)

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation
by: Han, Shuhao, et al.
Published: (2024)

PerfGuard: A Performance-Aware Agent for Visual Content Generation
by: Chen, Zhipeng, et al.
Published: (2026)

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
by: Jiang, Sihang, et al.
Published: (2026)

Motion-to-Response Content Generation via Multi-Agent AI System with Real-Time Safety Verification
by: Lee, HyeYoung
Published: (2026)

SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs
by: Ye, Anbang, et al.
Published: (2025)

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
by: Huang, Yizheng, et al.
Published: (2026)

EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment
by: Wen, Xin-Cheng, et al.
Published: (2024)

TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
by: Khatun, Aisha, et al.
Published: (2024)

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
by: Yueh-Han, Chen, et al.
Published: (2025)

Agent Identity Evals: Measuring Agentic Identity
by: Perrier, Elija, et al.
Published: (2025)

Free Agent in Agent-Based Mixture-of-Experts Generative AI Framework
by: Liu, Jung-Hua
Published: (2025)

Towards a Science of AI Agent Reliability
by: Rabanser, Stephan, et al.
Published: (2026)

WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
by: Qian, Yaoyao, et al.
Published: (2025)

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents
by: Chen, Weiyi, et al.
Published: (2026)

VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series
by: Xu, Pengyu, et al.
Published: (2025)

Times2D: Multi-Period Decomposition and Derivative Mapping for General Time Series Forecasting
by: Nematirad, Reza, et al.
Published: (2025)