Saved in:
| Main Authors: | Xia, Yinghui, Chen, Yuyan, Shi, Tianyu, Wang, Jun, Yang, Jinsong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.04712 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval
by: Wang, Jiexin, et al.
Published: (2024)
by: Wang, Jiexin, et al.
Published: (2024)
CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
by: Liang, Xuechen, et al.
Published: (2024)
by: Liang, Xuechen, et al.
Published: (2024)
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
by: Kammakomati, Mehant, et al.
Published: (2024)
by: Kammakomati, Mehant, et al.
Published: (2024)
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models
by: Guo, Xin, et al.
Published: (2023)
by: Guo, Xin, et al.
Published: (2023)
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
by: Chen, Yeheng, et al.
Published: (2026)
by: Chen, Yeheng, et al.
Published: (2026)
VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation
by: Vijayaraghavan, Prashanth, et al.
Published: (2024)
by: Vijayaraghavan, Prashanth, et al.
Published: (2024)
LanguaShrink: Reducing Token Overhead with Psycholinguistics
by: Liang, Xuechen, et al.
Published: (2024)
by: Liang, Xuechen, et al.
Published: (2024)
FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning
by: Dou, Shaoyu, et al.
Published: (2025)
by: Dou, Shaoyu, et al.
Published: (2025)
Recent Advancement of Emotion Cognition in Large Language Models
by: Chen, Yuyan, et al.
Published: (2024)
by: Chen, Yuyan, et al.
Published: (2024)
Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System
by: Feng, Yuheng, et al.
Published: (2024)
by: Feng, Yuheng, et al.
Published: (2024)
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
by: Raihan, Nishat, et al.
Published: (2024)
by: Raihan, Nishat, et al.
Published: (2024)
MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic
by: Zhou, Yuyan, et al.
Published: (2024)
by: Zhou, Yuyan, et al.
Published: (2024)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)
by: Tu, Shangqing, et al.
Published: (2024)
SocialEval: Evaluating Social Intelligence of Large Language Models
by: Zhou, Jinfeng, et al.
Published: (2025)
by: Zhou, Jinfeng, et al.
Published: (2025)
AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization
by: Afzal, Anum, et al.
Published: (2024)
by: Afzal, Anum, et al.
Published: (2024)
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models
by: Shi, Haoxiang, et al.
Published: (2024)
by: Shi, Haoxiang, et al.
Published: (2024)
Improving Natural Language Capability of Code Large Language Model
by: Li, Wei, et al.
Published: (2024)
by: Li, Wei, et al.
Published: (2024)
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
by: Yang, Pei, et al.
Published: (2026)
by: Yang, Pei, et al.
Published: (2026)
SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
by: Zhang, Taolin, et al.
Published: (2026)
by: Zhang, Taolin, et al.
Published: (2026)
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
by: Zhao, Yuwei, et al.
Published: (2024)
by: Zhao, Yuwei, et al.
Published: (2024)
LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models
by: Sun, Chongyan, et al.
Published: (2024)
by: Sun, Chongyan, et al.
Published: (2024)
Leveraging Print Debugging to Improve Code Generation in Large Language Models
by: Hu, Xueyu, et al.
Published: (2024)
by: Hu, Xueyu, et al.
Published: (2024)
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
by: Yu, Zhaojian, et al.
Published: (2024)
by: Yu, Zhaojian, et al.
Published: (2024)
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
by: Ye, Xi, et al.
Published: (2025)
by: Ye, Xi, et al.
Published: (2025)
AI Safety in Generative AI Large Language Models: A Survey
by: Chua, Jaymari, et al.
Published: (2024)
by: Chua, Jaymari, et al.
Published: (2024)
Improving Retrieval Augmented Language Model with Self-Reasoning
by: Xia, Yuan, et al.
Published: (2024)
by: Xia, Yuan, et al.
Published: (2024)
Self-evolving Agents with reflective and memory-augmented abilities
by: Liang, Xuechen, et al.
Published: (2024)
by: Liang, Xuechen, et al.
Published: (2024)
OJBench: A Competition Level Code Benchmark For Large Language Models
by: Wang, Zhexu, et al.
Published: (2025)
by: Wang, Zhexu, et al.
Published: (2025)
ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
by: Zhao, Haiquan, et al.
Published: (2024)
by: Zhao, Haiquan, et al.
Published: (2024)
CriticEval: Evaluating Large Language Model as Critic
by: Lan, Tian, et al.
Published: (2024)
by: Lan, Tian, et al.
Published: (2024)
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
by: Yang, Haote, et al.
Published: (2025)
by: Yang, Haote, et al.
Published: (2025)
VisEval: A Benchmark for Data Visualization in the Era of Large Language Models
by: Chen, Nan, et al.
Published: (2024)
by: Chen, Nan, et al.
Published: (2024)
Embracing Ambiguity: Improving Similarity-oriented Tasks with Contextual Synonym Knowledge
by: Li, Yangning, et al.
Published: (2022)
by: Li, Yangning, et al.
Published: (2022)
LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis
by: Cui, Tianyu, et al.
Published: (2024)
by: Cui, Tianyu, et al.
Published: (2024)
MdEval: Massively Multilingual Code Debugging
by: Liu, Shukai, et al.
Published: (2024)
by: Liu, Shukai, et al.
Published: (2024)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
by: Huang, Siming, et al.
Published: (2024)
by: Huang, Siming, et al.
Published: (2024)
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
by: Yuan, Xiaohan, et al.
Published: (2024)
by: Yuan, Xiaohan, et al.
Published: (2024)
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning
by: Hu, Tianxiang, et al.
Published: (2024)
by: Hu, Tianxiang, et al.
Published: (2024)
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models
by: Shi, Ling, et al.
Published: (2024)
by: Shi, Ling, et al.
Published: (2024)
UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models
by: Wu, Jiajun, et al.
Published: (2025)
by: Wu, Jiajun, et al.
Published: (2025)
Similar Items
-
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval
by: Wang, Jiexin, et al.
Published: (2024) -
CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
by: Liang, Xuechen, et al.
Published: (2024) -
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
by: Kammakomati, Mehant, et al.
Published: (2024) -
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models
by: Guo, Xin, et al.
Published: (2023) -
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
by: Chen, Yeheng, et al.
Published: (2026)