Saved in:
| Main Authors: | Liang, Yuanzhi, Zhu, Linchao, Yang, Yi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.06509 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models
by: Yue, Xihang, et al.
Published: (2024)
by: Yue, Xihang, et al.
Published: (2024)
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
by: Lu, Yu, et al.
Published: (2024)
by: Lu, Yu, et al.
Published: (2024)
DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
by: Liang, Chumeng, et al.
Published: (2025)
by: Liang, Chumeng, et al.
Published: (2025)
SocialEval: Evaluating Social Intelligence of Large Language Models
by: Zhou, Jinfeng, et al.
Published: (2025)
by: Zhou, Jinfeng, et al.
Published: (2025)
FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback
by: Chu, Seongyeub, et al.
Published: (2026)
by: Chu, Seongyeub, et al.
Published: (2026)
RepEval: Effective Text Evaluation with LLM Representation
by: Sheng, Shuqian, et al.
Published: (2024)
by: Sheng, Shuqian, et al.
Published: (2024)
AlphaEval: Evaluating Agents in Production
by: Lu, Pengrui, et al.
Published: (2026)
by: Lu, Pengrui, et al.
Published: (2026)
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
by: Zhao, Shuai, et al.
Published: (2025)
by: Zhao, Shuai, et al.
Published: (2025)
SpecEval: Evaluating Model Adherence to Behavior Specifications
by: Ahmed, Ahmed, et al.
Published: (2025)
by: Ahmed, Ahmed, et al.
Published: (2025)
BotEval: Facilitating Interactive Human Evaluation
by: Cho, Hyundong, et al.
Published: (2024)
by: Cho, Hyundong, et al.
Published: (2024)
One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
by: Shen, Chengyu, et al.
Published: (2026)
by: Shen, Chengyu, et al.
Published: (2026)
Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training
by: Zhao, Shuai, et al.
Published: (2024)
by: Zhao, Shuai, et al.
Published: (2024)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
by: Cheng, Zhili, et al.
Published: (2025)
by: Cheng, Zhili, et al.
Published: (2025)
HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
by: Zhu, Zhiying, et al.
Published: (2024)
by: Zhu, Zhiying, et al.
Published: (2024)
Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
by: Wang, Jiashuo, et al.
Published: (2025)
by: Wang, Jiashuo, et al.
Published: (2025)
PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
by: Tang, Yixuan, et al.
Published: (2025)
by: Tang, Yixuan, et al.
Published: (2025)
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
by: Wadhwa, Manya, et al.
Published: (2025)
by: Wadhwa, Manya, et al.
Published: (2025)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents
by: Gao, Wenbo, et al.
Published: (2026)
by: Gao, Wenbo, et al.
Published: (2026)
Evaluating Cultural and Social Awareness of LLM Web Agents
by: Qiu, Haoyi, et al.
Published: (2024)
by: Qiu, Haoyi, et al.
Published: (2024)
SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation
by: Zhou, Yixi, et al.
Published: (2026)
by: Zhou, Yixi, et al.
Published: (2026)
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
by: Zhou, Xuhui, et al.
Published: (2023)
by: Zhou, Xuhui, et al.
Published: (2023)
CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation
by: DeLorenzo, Matthew, et al.
Published: (2024)
by: DeLorenzo, Matthew, et al.
Published: (2024)
RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
by: Wei, Tianjun, et al.
Published: (2025)
by: Wei, Tianjun, et al.
Published: (2025)
EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation
by: Dejl, Adam, et al.
Published: (2026)
by: Dejl, Adam, et al.
Published: (2026)
PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?
by: Zhou, Lingfeng, et al.
Published: (2025)
by: Zhou, Lingfeng, et al.
Published: (2025)
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios
by: Mou, Xinyi, et al.
Published: (2024)
by: Mou, Xinyi, et al.
Published: (2024)
PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games
by: Zhu, Qinglin, et al.
Published: (2024)
by: Zhu, Qinglin, et al.
Published: (2024)
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
by: Khatun, Aisha, et al.
Published: (2024)
by: Khatun, Aisha, et al.
Published: (2024)
Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction
by: Kamoi, Ryo, et al.
Published: (2026)
by: Kamoi, Ryo, et al.
Published: (2026)
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding
by: Azime, Israel Abebe, et al.
Published: (2024)
by: Azime, Israel Abebe, et al.
Published: (2024)
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
by: Lior, Gili, et al.
Published: (2025)
by: Lior, Gili, et al.
Published: (2025)
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches
by: Zhou, Yuhang, et al.
Published: (2025)
by: Zhou, Yuhang, et al.
Published: (2025)
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
by: Tu, Quan, et al.
Published: (2024)
by: Tu, Quan, et al.
Published: (2024)
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
by: Zhang, Xiechi, et al.
Published: (2025)
by: Zhang, Xiechi, et al.
Published: (2025)
CiteEval: Principle-Driven Citation Evaluation for Source Attribution
by: Xu, Yumo, et al.
Published: (2025)
by: Xu, Yumo, et al.
Published: (2025)
MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries
by: Grolleau, François, et al.
Published: (2025)
by: Grolleau, François, et al.
Published: (2025)
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
by: Khan, Haidar, et al.
Published: (2025)
by: Khan, Haidar, et al.
Published: (2025)
SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
by: Zhao, Jiahao, et al.
Published: (2025)
by: Zhao, Jiahao, et al.
Published: (2025)
ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents
by: Liu, Tianjian, et al.
Published: (2025)
by: Liu, Tianjian, et al.
Published: (2025)
Similar Items
-
FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models
by: Yue, Xihang, et al.
Published: (2024) -
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
by: Lu, Yu, et al.
Published: (2024) -
DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
by: Liang, Chumeng, et al.
Published: (2025) -
SocialEval: Evaluating Social Intelligence of Large Language Models
by: Zhou, Jinfeng, et al.
Published: (2025) -
FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback
by: Chu, Seongyeub, et al.
Published: (2026)