:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hu, Taojun, Zhou, Xiao-Hua
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2404.09135
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLM-based NLG Evaluation: Current Status and Challenges
by: Gao, Mingqi, et al.
Published: (2024)

Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
by: Hua, Yilun, et al.
Published: (2026)

Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)

Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
by: Long, Lingkun, et al.
Published: (2026)

LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
by: Han, Jieun, et al.
Published: (2023)

Faithful Model Evaluation for Model-Based Metrics
by: Goyal, Palash, et al.
Published: (2023)

Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews
by: Shin, Hyungyu, et al.
Published: (2025)

Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation
by: Wang, Yuhao, et al.
Published: (2025)

Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends
by: GUO, Jiaxin, et al.
Published: (2025)

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
by: Zhou, Xin, et al.
Published: (2025)

PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation
by: Hu, Jinpeng, et al.
Published: (2024)

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
by: Eigler, Lukáš, et al.
Published: (2026)

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)

Challenging the Evaluator: LLM Sycophancy Under User Rebuttal
by: Kim, Sungwon, et al.
Published: (2025)

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities
by: Imamura, Kenji, et al.
Published: (2026)

GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics
by: Filandrianos, Giorgos, et al.
Published: (2025)

A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
by: Xi, Yunjia, et al.
Published: (2025)

How to Get Your LLM to Generate Challenging Problems for Evaluation
by: Patel, Arkil, et al.
Published: (2025)

IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
by: Lin, Fan, et al.
Published: (2024)

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
by: Rao, Delip, et al.
Published: (2026)

LLM Inference Unveiled: Survey and Roofline Model Insights
by: Yuan, Zhihang, et al.
Published: (2024)

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
by: Deviyani, Athiya, et al.
Published: (2025)

PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics
by: Larionov, Daniil, et al.
Published: (2024)

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
by: Tang, Bangsheng, et al.
Published: (2025)

Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
by: Cho, Yousang, et al.
Published: (2025)

NLP for Local Governance Meeting Records: A Focus Article on Tasks, Datasets, Metrics and Benchmark
by: Campos, Ricardo, et al.
Published: (2026)

Developing a Multilingual Dataset and Evaluation Metrics for Code-Switching: A Focus on Hong Kong's Polylingual Dynamics
by: Xie, Peng, et al.
Published: (2023)

Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
by: Luo, Zhifan, et al.
Published: (2025)

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
by: Abeysinghe, Bhashithe, et al.
Published: (2024)

Evaluating Compositional Approaches for Focus and Sentiment Analysis
by: Kellert, Olga, et al.
Published: (2025)

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
by: Wang, Xiao, et al.
Published: (2025)

Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability
by: Duan, Xufeng, et al.
Published: (2024)

Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics
by: Park, Jin Hyun, et al.
Published: (2025)

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization
by: Guo, Yue, et al.
Published: (2023)

A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case
by: Meyer, Sonia, et al.
Published: (2024)

Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators
by: Chang, Jiayi, et al.
Published: (2025)

Evaluating Metrics for Bias in Word Embeddings
by: Schröder, Sarah, et al.
Published: (2021)

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding
by: Azime, Israel Abebe, et al.
Published: (2024)

A likelihood-based sensitivity analysis for addressing publication bias in meta-analysis of diagnostic studies using exact likelihood
by: Hu, Taojun, et al.
Published: (2024)