Saved in:
| Main Authors: | Vu, Thanh, Nayak, Richi, Balasubramaniam, Thiru |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.08273 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN
by: Bashar, Md Abul, et al.
Published: (2023)
by: Bashar, Md Abul, et al.
Published: (2023)
Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
by: Uniyal, Deepak, et al.
Published: (2026)
by: Uniyal, Deepak, et al.
Published: (2026)
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
by: Ramesh, Guruprasad Viswanathan, et al.
Published: (2026)
by: Ramesh, Guruprasad Viswanathan, et al.
Published: (2026)
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
by: Hong, Sirui, et al.
Published: (2026)
by: Hong, Sirui, et al.
Published: (2026)
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
by: Song, Seyoung, et al.
Published: (2025)
by: Song, Seyoung, et al.
Published: (2025)
Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel
by: Maddi, Aadyaa, et al.
Published: (2026)
by: Maddi, Aadyaa, et al.
Published: (2026)
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
by: Hathidara, Ashutosh, et al.
Published: (2026)
by: Hathidara, Ashutosh, et al.
Published: (2026)
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
by: Ye, Bowen, et al.
Published: (2026)
by: Ye, Bowen, et al.
Published: (2026)
ReviewEval: An Evaluation Framework for AI-Generated Reviews
by: Garg, Madhav Krishan, et al.
Published: (2025)
by: Garg, Madhav Krishan, et al.
Published: (2025)
General Agent Evaluation
by: Bandel, Elron, et al.
Published: (2026)
by: Bandel, Elron, et al.
Published: (2026)
AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents
by: Lu, Qinghua, et al.
Published: (2025)
by: Lu, Qinghua, et al.
Published: (2025)
GDPR Auto-Formalization with AI Agents and Human Verification
by: Nguyen, Ha Thanh, et al.
Published: (2026)
by: Nguyen, Ha Thanh, et al.
Published: (2026)
AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning
by: Fu, Suzhong, et al.
Published: (2026)
by: Fu, Suzhong, et al.
Published: (2026)
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
by: Guo, Dongxin, et al.
Published: (2026)
by: Guo, Dongxin, et al.
Published: (2026)
Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents
by: Kar, Rishabh
Published: (2026)
by: Kar, Rishabh
Published: (2026)
Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models
by: Vishwakarma, Sanjay, et al.
Published: (2024)
by: Vishwakarma, Sanjay, et al.
Published: (2024)
AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents
by: Sun, Jiahui, et al.
Published: (2025)
by: Sun, Jiahui, et al.
Published: (2025)
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
by: Raj, Harsh, et al.
Published: (2026)
by: Raj, Harsh, et al.
Published: (2026)
Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
by: Isbarov, Jafar, et al.
Published: (2026)
by: Isbarov, Jafar, et al.
Published: (2026)
Propensity-to-Pay: Machine Learning for Estimating Prediction Uncertainty
by: Bashar, Md Abul, et al.
Published: (2020)
by: Bashar, Md Abul, et al.
Published: (2020)
SIDiffAgent: Self-Improving Diffusion Agent
by: Garg, Shivank, et al.
Published: (2026)
by: Garg, Shivank, et al.
Published: (2026)
Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation
by: Her, Lim Chien, et al.
Published: (2025)
by: Her, Lim Chien, et al.
Published: (2025)
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
by: Chen, Yurun, et al.
Published: (2025)
by: Chen, Yurun, et al.
Published: (2025)
Multi-Agent Framework for Controllable and Protected Generative Content Creation: Addressing Copyright and Provenance in AI-Generated Media
by: Khan, Haris, et al.
Published: (2026)
by: Khan, Haris, et al.
Published: (2026)
EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation
by: Han, Shuhao, et al.
Published: (2024)
by: Han, Shuhao, et al.
Published: (2024)
PerfGuard: A Performance-Aware Agent for Visual Content Generation
by: Chen, Zhipeng, et al.
Published: (2026)
by: Chen, Zhipeng, et al.
Published: (2026)
SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
by: Jiang, Sihang, et al.
Published: (2026)
by: Jiang, Sihang, et al.
Published: (2026)
Motion-to-Response Content Generation via Multi-Agent AI System with Real-Time Safety Verification
by: Lee, HyeYoung
Published: (2026)
by: Lee, HyeYoung
Published: (2026)
SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs
by: Ye, Anbang, et al.
Published: (2025)
by: Ye, Anbang, et al.
Published: (2025)
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
by: Huang, Yizheng, et al.
Published: (2026)
by: Huang, Yizheng, et al.
Published: (2026)
EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment
by: Wen, Xin-Cheng, et al.
Published: (2024)
by: Wen, Xin-Cheng, et al.
Published: (2024)
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
by: Khatun, Aisha, et al.
Published: (2024)
by: Khatun, Aisha, et al.
Published: (2024)
SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
by: Yueh-Han, Chen, et al.
Published: (2025)
by: Yueh-Han, Chen, et al.
Published: (2025)
Agent Identity Evals: Measuring Agentic Identity
by: Perrier, Elija, et al.
Published: (2025)
by: Perrier, Elija, et al.
Published: (2025)
Free Agent in Agent-Based Mixture-of-Experts Generative AI Framework
by: Liu, Jung-Hua
Published: (2025)
by: Liu, Jung-Hua
Published: (2025)
Towards a Science of AI Agent Reliability
by: Rabanser, Stephan, et al.
Published: (2026)
by: Rabanser, Stephan, et al.
Published: (2026)
WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
by: Qian, Yaoyao, et al.
Published: (2025)
by: Qian, Yaoyao, et al.
Published: (2025)
TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents
by: Chen, Weiyi, et al.
Published: (2026)
by: Chen, Weiyi, et al.
Published: (2026)
VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series
by: Xu, Pengyu, et al.
Published: (2025)
by: Xu, Pengyu, et al.
Published: (2025)
Times2D: Multi-Period Decomposition and Derivative Mapping for General Time Series Forecasting
by: Nematirad, Reza, et al.
Published: (2025)
by: Nematirad, Reza, et al.
Published: (2025)
Similar Items
-
ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN
by: Bashar, Md Abul, et al.
Published: (2023) -
Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
by: Uniyal, Deepak, et al.
Published: (2026) -
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
by: Ramesh, Guruprasad Viswanathan, et al.
Published: (2026) -
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
by: Hong, Sirui, et al.
Published: (2026) -
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
by: Song, Seyoung, et al.
Published: (2025)