:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Liang, Wang, Junpeng, Yeh, Chin-chia Michael, Zheng, Yan, Sun, Jiarui, Fan, Xiran, Dai, Xin, Fan, Yujie, Cai, Yiwei
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence I.2.7
Online Access:	https://arxiv.org/abs/2602.05110
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
by: Wang, Xiaohua, et al.
Published: (2026)

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
by: Fostiropoulos, Iordanis, et al.
Published: (2026)

Toward Architecture-Aware Evaluation Metrics for LLM Agents
by: Souza, Débora, et al.
Published: (2026)

LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output
by: Karinshak, Elise, et al.
Published: (2024)

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
by: Wu, Dekun, et al.
Published: (2023)

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation
by: He, Kaifeng, et al.
Published: (2025)

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications
by: Dai, Dasen, et al.
Published: (2026)

SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions
by: Wang, Tianyu, et al.
Published: (2026)

Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset
by: Palit, Sayon, et al.
Published: (2025)

CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening
by: Lorenzoni, Giuliano, et al.
Published: (2026)

Evaluating LLM Metrics Through Real-World Capabilities
by: Miller, Justin K, et al.
Published: (2025)

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
by: Sandan, Isik Baran, et al.
Published: (2025)

The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
by: Shah, Aakriti, et al.
Published: (2025)

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability
by: Lucas, Tom, et al.
Published: (2026)

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
by: Nguyen, Huyen, et al.
Published: (2026)

ATANT: An Evaluation Framework for AI Continuity
by: Tanguturi, Samuel Sameer
Published: (2026)

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering
by: Ghosh, Shubhra, et al.
Published: (2025)

LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
by: Lai, Junyu, et al.
Published: (2025)

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
by: Iqbal, Hasan, et al.
Published: (2024)

Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
by: Thorne, William, et al.
Published: (2026)

AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
by: Loi, Dario, et al.
Published: (2025)

Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures
by: Shah, Siddharth, et al.
Published: (2025)

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
by: Hashemi, Helia, et al.
Published: (2024)

Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning
by: Lin, Xinxin, et al.
Published: (2026)

Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations
by: Hinterleitner, Lukas, et al.
Published: (2026)

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
by: Hu, Junyi, et al.
Published: (2024)

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
by: Li, Bowen, et al.
Published: (2026)

Evaluating Input Feature Explanations through a Unified Diagnostic Evaluation Framework
by: Sun, Jingyi, et al.
Published: (2024)

Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview
by: Han, Lifeng, et al.
Published: (2016)

Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning
by: Zhang, Li, et al.
Published: (2025)

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
by: Liu, Fangxin, et al.
Published: (2025)

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)

EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
by: Guo, Pengze, et al.
Published: (2026)

Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text
by: Oketunji, Abiodun Finbarrs
Published: (2023)

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors
by: Sun, Jiachen, et al.
Published: (2024)

Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias
by: Wu, Shuai, et al.
Published: (2026)

Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology
by: Da, Longchao, et al.
Published: (2025)

TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification
by: Yeh, Chin-Chia Michael, et al.
Published: (2025)

Comprehensive Evaluation and Insights into the Use of Large Language Models in the Automation of Behavior-Driven Development Acceptance Test Formulation
by: Karpurapu, Shanthi, et al.
Published: (2024)

Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation
by: Mehta, Manisha, et al.
Published: (2025)