Saved in:
| Main Authors: | Wu, Kevin, Wu, Eric, Zou, James |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.10198 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?
by: Wu, Eric, et al.
Published: (2024)
by: Wu, Eric, et al.
Published: (2024)
TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy
by: McCammon, James
Published: (2025)
by: McCammon, James
Published: (2025)
AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data
by: Wu, JiaRu, et al.
Published: (2025)
by: Wu, JiaRu, et al.
Published: (2025)
System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection
by: Wu, Binglin, et al.
Published: (2025)
by: Wu, Binglin, et al.
Published: (2025)
How well do LLMs cite relevant medical references? An evaluation framework and analyses
by: Wu, Kevin, et al.
Published: (2024)
by: Wu, Kevin, et al.
Published: (2024)
Tug-of-war between idioms' figurative and literal interpretations in LLMs
by: Oh, Soyoung, et al.
Published: (2025)
by: Oh, Soyoung, et al.
Published: (2025)
PatentEval: Understanding Errors in Patent Generation
by: Zuo, You, et al.
Published: (2024)
by: Zuo, You, et al.
Published: (2024)
Measuring all the noises of LLM Evals
by: Wang, Sida
Published: (2025)
by: Wang, Sida
Published: (2025)
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
by: Khatun, Aisha, et al.
Published: (2024)
by: Khatun, Aisha, et al.
Published: (2024)
Data Compressibility Quantifies LLM Memorization
by: Huang, Yizhan, et al.
Published: (2025)
by: Huang, Yizhan, et al.
Published: (2025)
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
by: D'Souza, Jennifer, et al.
Published: (2025)
by: D'Souza, Jennifer, et al.
Published: (2025)
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)
by: Yang, Langqi, et al.
Published: (2025)
A Single Character can Make or Break Your LLM Evals
by: Su, Jingtong, et al.
Published: (2025)
by: Su, Jingtong, et al.
Published: (2025)
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
by: Khan, Haidar, et al.
Published: (2025)
by: Khan, Haidar, et al.
Published: (2025)
SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
by: Zhao, Jiahao, et al.
Published: (2025)
by: Zhao, Jiahao, et al.
Published: (2025)
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)
BriLLM: Brain-inspired Large Language Model
by: Zhao, Hai, et al.
Published: (2025)
by: Zhao, Hai, et al.
Published: (2025)
ReasonOps: Operator Segmentation for LLM Reasoning Traces
by: Lee, Daniel, et al.
Published: (2026)
by: Lee, Daniel, et al.
Published: (2026)
An evaluation of LLMs for political bias in Western media: Israel-Hamas and Ukraine-Russia wars
by: Chandra, Rohitash, et al.
Published: (2026)
by: Chandra, Rohitash, et al.
Published: (2026)
CausalEval: Towards Better Causal Reasoning in Language Models
by: Yu, Longxuan, et al.
Published: (2024)
by: Yu, Longxuan, et al.
Published: (2024)
AcademicEval: Live Long-Context LLM Benchmark
by: Zhang, Haozhen, et al.
Published: (2025)
by: Zhang, Haozhen, et al.
Published: (2025)
Disentangling Reasoning and Knowledge in Medical Large Language Models
by: Thapa, Rahul, et al.
Published: (2025)
by: Thapa, Rahul, et al.
Published: (2025)
ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
by: Nguyen, Trong-Hieu, et al.
Published: (2024)
by: Nguyen, Trong-Hieu, et al.
Published: (2024)
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
by: Alyahya, Hisham A., et al.
Published: (2025)
by: Alyahya, Hisham A., et al.
Published: (2025)
Universal Legal Article Prediction via Tight Collaboration between Supervised Classification Model and LLM
by: Chi, Xiao, et al.
Published: (2025)
by: Chi, Xiao, et al.
Published: (2025)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
by: Ye, Jiayi, et al.
Published: (2024)
by: Ye, Jiayi, et al.
Published: (2024)
RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
by: Zhang, Ziqian, et al.
Published: (2026)
by: Zhang, Ziqian, et al.
Published: (2026)
UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
by: Markchom, Thanet, et al.
Published: (2025)
by: Markchom, Thanet, et al.
Published: (2025)
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025)
by: Mohammadi, Hadi, et al.
Published: (2025)
RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
by: Huang, Zhongzhan, et al.
Published: (2025)
by: Huang, Zhongzhan, et al.
Published: (2025)
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
by: Yu, Linhao, et al.
Published: (2024)
by: Yu, Linhao, et al.
Published: (2024)
Quantifying Geospatial in the Common Crawl Corpus
by: Ilyankou, Ilya, et al.
Published: (2024)
by: Ilyankou, Ilya, et al.
Published: (2024)
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
by: Ni, Jinjie, et al.
Published: (2024)
by: Ni, Jinjie, et al.
Published: (2024)
keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection
by: Vemula, Saketh Reddy, et al.
Published: (2025)
by: Vemula, Saketh Reddy, et al.
Published: (2025)
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
by: Zhao, Jingqian, et al.
Published: (2025)
by: Zhao, Jingqian, et al.
Published: (2025)
GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework
by: Sansford, Hannah, et al.
Published: (2024)
by: Sansford, Hannah, et al.
Published: (2024)
Unlocking LLM Creativity in Science through Analogical Reasoning
by: Shen, Andrew, et al.
Published: (2026)
by: Shen, Andrew, et al.
Published: (2026)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation
by: Ma, Zi-Ao, et al.
Published: (2025)
by: Ma, Zi-Ao, et al.
Published: (2025)
SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection
by: Allen, Bradley P., et al.
Published: (2024)
by: Allen, Bradley P., et al.
Published: (2024)
To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis
by: Bianchi, Federico, et al.
Published: (2025)
by: Bianchi, Federico, et al.
Published: (2025)
Similar Items
-
FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?
by: Wu, Eric, et al.
Published: (2024) -
TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy
by: McCammon, James
Published: (2025) -
AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data
by: Wu, JiaRu, et al.
Published: (2025) -
System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection
by: Wu, Binglin, et al.
Published: (2025) -
How well do LLMs cite relevant medical references? An evaluation framework and analyses
by: Wu, Kevin, et al.
Published: (2024)