Saved in:
| Main Authors: | Wang, Liang, Wang, Junpeng, Yeh, Chin-chia Michael, Zheng, Yan, Sun, Jiarui, Fan, Xiran, Dai, Xin, Fan, Yujie, Cai, Yiwei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.05110 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
by: Wang, Xiaohua, et al.
Published: (2026)
by: Wang, Xiaohua, et al.
Published: (2026)
GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
by: Fostiropoulos, Iordanis, et al.
Published: (2026)
by: Fostiropoulos, Iordanis, et al.
Published: (2026)
Toward Architecture-Aware Evaluation Metrics for LLM Agents
by: Souza, Débora, et al.
Published: (2026)
by: Souza, Débora, et al.
Published: (2026)
LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output
by: Karinshak, Elise, et al.
Published: (2024)
by: Karinshak, Elise, et al.
Published: (2024)
Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
by: Wu, Dekun, et al.
Published: (2023)
by: Wu, Dekun, et al.
Published: (2023)
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation
by: He, Kaifeng, et al.
Published: (2025)
by: He, Kaifeng, et al.
Published: (2025)
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications
by: Dai, Dasen, et al.
Published: (2026)
by: Dai, Dasen, et al.
Published: (2026)
SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions
by: Wang, Tianyu, et al.
Published: (2026)
by: Wang, Tianyu, et al.
Published: (2026)
Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset
by: Palit, Sayon, et al.
Published: (2025)
by: Palit, Sayon, et al.
Published: (2025)
CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening
by: Lorenzoni, Giuliano, et al.
Published: (2026)
by: Lorenzoni, Giuliano, et al.
Published: (2026)
Evaluating LLM Metrics Through Real-World Capabilities
by: Miller, Justin K, et al.
Published: (2025)
by: Miller, Justin K, et al.
Published: (2025)
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
by: Sandan, Isik Baran, et al.
Published: (2025)
by: Sandan, Isik Baran, et al.
Published: (2025)
The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
by: Shah, Aakriti, et al.
Published: (2025)
by: Shah, Aakriti, et al.
Published: (2025)
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability
by: Lucas, Tom, et al.
Published: (2026)
by: Lucas, Tom, et al.
Published: (2026)
LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
by: Nguyen, Huyen, et al.
Published: (2026)
by: Nguyen, Huyen, et al.
Published: (2026)
ATANT: An Evaluation Framework for AI Continuity
by: Tanguturi, Samuel Sameer
Published: (2026)
by: Tanguturi, Samuel Sameer
Published: (2026)
ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering
by: Ghosh, Shubhra, et al.
Published: (2025)
by: Ghosh, Shubhra, et al.
Published: (2025)
LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
by: Lai, Junyu, et al.
Published: (2025)
by: Lai, Junyu, et al.
Published: (2025)
OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
by: Iqbal, Hasan, et al.
Published: (2024)
by: Iqbal, Hasan, et al.
Published: (2024)
Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
by: Thorne, William, et al.
Published: (2026)
by: Thorne, William, et al.
Published: (2026)
AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
by: Loi, Dario, et al.
Published: (2025)
by: Loi, Dario, et al.
Published: (2025)
Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures
by: Shah, Siddharth, et al.
Published: (2025)
by: Shah, Siddharth, et al.
Published: (2025)
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
by: Hashemi, Helia, et al.
Published: (2024)
by: Hashemi, Helia, et al.
Published: (2024)
Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning
by: Lin, Xinxin, et al.
Published: (2026)
by: Lin, Xinxin, et al.
Published: (2026)
Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations
by: Hinterleitner, Lukas, et al.
Published: (2026)
by: Hinterleitner, Lukas, et al.
Published: (2026)
Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
by: Hu, Junyi, et al.
Published: (2024)
by: Hu, Junyi, et al.
Published: (2024)
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
by: Li, Bowen, et al.
Published: (2026)
by: Li, Bowen, et al.
Published: (2026)
Evaluating Input Feature Explanations through a Unified Diagnostic Evaluation Framework
by: Sun, Jingyi, et al.
Published: (2024)
by: Sun, Jingyi, et al.
Published: (2024)
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview
by: Han, Lifeng, et al.
Published: (2016)
by: Han, Lifeng, et al.
Published: (2016)
Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning
by: Zhang, Li, et al.
Published: (2025)
by: Zhang, Li, et al.
Published: (2025)
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
by: Liu, Fangxin, et al.
Published: (2025)
by: Liu, Fangxin, et al.
Published: (2025)
Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
by: Guo, Pengze, et al.
Published: (2026)
by: Guo, Pengze, et al.
Published: (2026)
Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text
by: Oketunji, Abiodun Finbarrs
Published: (2023)
by: Oketunji, Abiodun Finbarrs
Published: (2023)
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors
by: Sun, Jiachen, et al.
Published: (2024)
by: Sun, Jiachen, et al.
Published: (2024)
Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias
by: Wu, Shuai, et al.
Published: (2026)
by: Wu, Shuai, et al.
Published: (2026)
Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology
by: Da, Longchao, et al.
Published: (2025)
by: Da, Longchao, et al.
Published: (2025)
TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification
by: Yeh, Chin-Chia Michael, et al.
Published: (2025)
by: Yeh, Chin-Chia Michael, et al.
Published: (2025)
Comprehensive Evaluation and Insights into the Use of Large Language Models in the Automation of Behavior-Driven Development Acceptance Test Formulation
by: Karpurapu, Shanthi, et al.
Published: (2024)
by: Karpurapu, Shanthi, et al.
Published: (2024)
Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation
by: Mehta, Manisha, et al.
Published: (2025)
by: Mehta, Manisha, et al.
Published: (2025)
Similar Items
-
Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
by: Wang, Xiaohua, et al.
Published: (2026) -
GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
by: Fostiropoulos, Iordanis, et al.
Published: (2026) -
Toward Architecture-Aware Evaluation Metrics for LLM Agents
by: Souza, Débora, et al.
Published: (2026) -
LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output
by: Karinshak, Elise, et al.
Published: (2024) -
Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
by: Wu, Dekun, et al.
Published: (2023)