Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Lin, Lanbo, Liu, Jiayao, Yang, Tianyuan, Cai, Li, Xu, Yuanwu, Wei, Lei, Xie, Sicong, Zhang, Guannan
Formato:	Preprint
Publicado:	2026
Materias:	Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2602.06486
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866918325812985856
author	Lin, Lanbo Liu, Jiayao Yang, Tianyuan Cai, Li Xu, Yuanwu Wei, Lei Xie, Sicong Zhang, Guannan
author_facet	Lin, Lanbo Liu, Jiayao Yang, Tianyuan Cai, Li Xu, Yuanwu Wei, Lei Xie, Sicong Zhang, Guannan
contents	Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling-world/JADE.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_06486
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks Lin, Lanbo Liu, Jiayao Yang, Tianyuan Cai, Li Xu, Yuanwu Wei, Lei Xie, Sicong Zhang, Guannan Artificial Intelligence Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling-world/JADE.
title	JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.06486

Ejemplares similares