Saved in:
| Main Authors: | Hu, Taojun, Zhou, Xiao-Hua |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.09135 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLM-based NLG Evaluation: Current Status and Challenges
by: Gao, Mingqi, et al.
Published: (2024)
by: Gao, Mingqi, et al.
Published: (2024)
Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
by: Hua, Yilun, et al.
Published: (2026)
by: Hua, Yilun, et al.
Published: (2026)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
by: Long, Lingkun, et al.
Published: (2026)
by: Long, Lingkun, et al.
Published: (2026)
LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
by: Han, Jieun, et al.
Published: (2023)
by: Han, Jieun, et al.
Published: (2023)
Faithful Model Evaluation for Model-Based Metrics
by: Goyal, Palash, et al.
Published: (2023)
by: Goyal, Palash, et al.
Published: (2023)
Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews
by: Shin, Hyungyu, et al.
Published: (2025)
by: Shin, Hyungyu, et al.
Published: (2025)
Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation
by: Wang, Yuhao, et al.
Published: (2025)
by: Wang, Yuhao, et al.
Published: (2025)
Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends
by: GUO, Jiaxin, et al.
Published: (2025)
by: GUO, Jiaxin, et al.
Published: (2025)
An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
by: Zhou, Xin, et al.
Published: (2025)
by: Zhou, Xin, et al.
Published: (2025)
PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation
by: Hu, Jinpeng, et al.
Published: (2024)
by: Hu, Jinpeng, et al.
Published: (2024)
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)
by: Wei, Hui, et al.
Published: (2024)
LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
by: Eigler, Lukáš, et al.
Published: (2026)
by: Eigler, Lukáš, et al.
Published: (2026)
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)
by: Yang, Langqi, et al.
Published: (2025)
Challenging the Evaluator: LLM Sycophancy Under User Rebuttal
by: Kim, Sungwon, et al.
Published: (2025)
by: Kim, Sungwon, et al.
Published: (2025)
A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities
by: Imamura, Kenji, et al.
Published: (2026)
by: Imamura, Kenji, et al.
Published: (2026)
GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics
by: Filandrianos, Giorgos, et al.
Published: (2025)
by: Filandrianos, Giorgos, et al.
Published: (2025)
A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
by: Xi, Yunjia, et al.
Published: (2025)
by: Xi, Yunjia, et al.
Published: (2025)
How to Get Your LLM to Generate Challenging Problems for Evaluation
by: Patel, Arkil, et al.
Published: (2025)
by: Patel, Arkil, et al.
Published: (2025)
IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
by: Lin, Fan, et al.
Published: (2024)
by: Lin, Fan, et al.
Published: (2024)
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
by: Rao, Delip, et al.
Published: (2026)
by: Rao, Delip, et al.
Published: (2026)
LLM Inference Unveiled: Survey and Roofline Model Insights
by: Yuan, Zhihang, et al.
Published: (2024)
by: Yuan, Zhihang, et al.
Published: (2024)
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
by: Deviyani, Athiya, et al.
Published: (2025)
by: Deviyani, Athiya, et al.
Published: (2025)
PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics
by: Larionov, Daniil, et al.
Published: (2024)
by: Larionov, Daniil, et al.
Published: (2024)
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
by: Tang, Bangsheng, et al.
Published: (2025)
by: Tang, Bangsheng, et al.
Published: (2025)
Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
by: Cho, Yousang, et al.
Published: (2025)
by: Cho, Yousang, et al.
Published: (2025)
NLP for Local Governance Meeting Records: A Focus Article on Tasks, Datasets, Metrics and Benchmark
by: Campos, Ricardo, et al.
Published: (2026)
by: Campos, Ricardo, et al.
Published: (2026)
Developing a Multilingual Dataset and Evaluation Metrics for Code-Switching: A Focus on Hong Kong's Polylingual Dynamics
by: Xie, Peng, et al.
Published: (2023)
by: Xie, Peng, et al.
Published: (2023)
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
by: Luo, Zhifan, et al.
Published: (2025)
by: Luo, Zhifan, et al.
Published: (2025)
The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
by: Abeysinghe, Bhashithe, et al.
Published: (2024)
by: Abeysinghe, Bhashithe, et al.
Published: (2024)
Evaluating Compositional Approaches for Focus and Sentiment Analysis
by: Kellert, Olga, et al.
Published: (2025)
by: Kellert, Olga, et al.
Published: (2025)
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
by: Wang, Xiao, et al.
Published: (2025)
by: Wang, Xiao, et al.
Published: (2025)
Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability
by: Duan, Xufeng, et al.
Published: (2024)
by: Duan, Xufeng, et al.
Published: (2024)
Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics
by: Park, Jin Hyun, et al.
Published: (2025)
by: Park, Jin Hyun, et al.
Published: (2025)
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization
by: Guo, Yue, et al.
Published: (2023)
by: Guo, Yue, et al.
Published: (2023)
A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case
by: Meyer, Sonia, et al.
Published: (2024)
by: Meyer, Sonia, et al.
Published: (2024)
Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators
by: Chang, Jiayi, et al.
Published: (2025)
by: Chang, Jiayi, et al.
Published: (2025)
Evaluating Metrics for Bias in Word Embeddings
by: Schröder, Sarah, et al.
Published: (2021)
by: Schröder, Sarah, et al.
Published: (2021)
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding
by: Azime, Israel Abebe, et al.
Published: (2024)
by: Azime, Israel Abebe, et al.
Published: (2024)
A likelihood-based sensitivity analysis for addressing publication bias in meta-analysis of diagnostic studies using exact likelihood
by: Hu, Taojun, et al.
Published: (2024)
by: Hu, Taojun, et al.
Published: (2024)
Similar Items
-
LLM-based NLG Evaluation: Current Status and Challenges
by: Gao, Mingqi, et al.
Published: (2024) -
Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
by: Hua, Yilun, et al.
Published: (2026) -
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025) -
Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
by: Long, Lingkun, et al.
Published: (2026) -
LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
by: Han, Jieun, et al.
Published: (2023)