Saved in:
| Main Authors: | Jia, Qi, Yue, Xiang, Zheng, Tianyu, Huang, Jie, Lin, Bill Yuchen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.07641 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
by: Jia, Qi, et al.
Published: (2024)
by: Jia, Qi, et al.
Published: (2024)
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
by: Huang, Chengsong, et al.
Published: (2023)
by: Huang, Chengsong, et al.
Published: (2023)
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
by: Zheng, Tianyu, et al.
Published: (2024)
by: Zheng, Tianyu, et al.
Published: (2024)
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
by: Hu, Xiaomeng, et al.
Published: (2026)
by: Hu, Xiaomeng, et al.
Published: (2026)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
by: Liu, Junpeng, et al.
Published: (2024)
by: Liu, Junpeng, et al.
Published: (2024)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
by: Leng, Jixuan, et al.
Published: (2025)
by: Leng, Jixuan, et al.
Published: (2025)
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
by: Song, Yifan, et al.
Published: (2024)
by: Song, Yifan, et al.
Published: (2024)
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
by: Wang, Zilong, et al.
Published: (2024)
by: Wang, Zilong, et al.
Published: (2024)
Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction
by: Yue, Shengbin, et al.
Published: (2025)
by: Yue, Shengbin, et al.
Published: (2025)
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
by: Guo, Jiawei, et al.
Published: (2024)
by: Guo, Jiawei, et al.
Published: (2024)
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
by: Lin, Bill Yuchen, et al.
Published: (2024)
by: Lin, Bill Yuchen, et al.
Published: (2024)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
by: Lin, Liuhao, et al.
Published: (2025)
by: Lin, Liuhao, et al.
Published: (2025)
SafetyBench: Evaluating the Safety of Large Language Models
by: Zhang, Zhexin, et al.
Published: (2023)
by: Zhang, Zhexin, et al.
Published: (2023)
CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation
by: Wang, Danqing, et al.
Published: (2026)
by: Wang, Danqing, et al.
Published: (2026)
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
by: Fein, Daniel, et al.
Published: (2025)
by: Fein, Daniel, et al.
Published: (2025)
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
by: Feng, Jie, et al.
Published: (2024)
by: Feng, Jie, et al.
Published: (2024)
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
by: Lu, Yujie, et al.
Published: (2024)
by: Lu, Yujie, et al.
Published: (2024)
TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
by: Jiang, Dongfu, et al.
Published: (2023)
by: Jiang, Dongfu, et al.
Published: (2023)
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
by: Kim, Seungone, et al.
Published: (2024)
by: Kim, Seungone, et al.
Published: (2024)
MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching
by: Yue, Liang, et al.
Published: (2025)
by: Yue, Liang, et al.
Published: (2025)
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
by: Zheng, Yu, et al.
Published: (2025)
by: Zheng, Yu, et al.
Published: (2025)
DuetSim: Building User Simulator with Dual Large Language Models for Task-Oriented Dialogues
by: Luo, Xiang, et al.
Published: (2024)
by: Luo, Xiang, et al.
Published: (2024)
CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity
by: Hou, Zhaoyi Joey, et al.
Published: (2025)
by: Hou, Zhaoyi Joey, et al.
Published: (2025)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
by: Song, Yifan, et al.
Published: (2024)
by: Song, Yifan, et al.
Published: (2024)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
by: Kim, Seungone, et al.
Published: (2024)
by: Kim, Seungone, et al.
Published: (2024)
From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents
by: Mou, Xinyi, et al.
Published: (2024)
by: Mou, Xinyi, et al.
Published: (2024)
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
by: Jiang, Fengqing, et al.
Published: (2025)
by: Jiang, Fengqing, et al.
Published: (2025)
On Memorization of Large Language Models in Logical Reasoning
by: Xie, Chulin, et al.
Published: (2024)
by: Xie, Chulin, et al.
Published: (2024)
Vision Language Models Cannot Plan, but Can They Formalize?
by: He, Muyu, et al.
Published: (2025)
by: He, Muyu, et al.
Published: (2025)
RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models
by: Yan, Jianhao, et al.
Published: (2024)
by: Yan, Jianhao, et al.
Published: (2024)
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
by: Hu, Tiancheng, et al.
Published: (2025)
by: Hu, Tiancheng, et al.
Published: (2025)
Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models
by: Li, Haoran, et al.
Published: (2024)
by: Li, Haoran, et al.
Published: (2024)
Representation Bias in Political Sample Simulations with Large Language Models
by: Qi, Weihong, et al.
Published: (2024)
by: Qi, Weihong, et al.
Published: (2024)
ElectionSim: Massive Population Election Simulation Powered by Large Language Model Driven Agents
by: Zhang, Xinnong, et al.
Published: (2024)
by: Zhang, Xinnong, et al.
Published: (2024)
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
by: Qian, Cheng, et al.
Published: (2026)
by: Qian, Cheng, et al.
Published: (2026)
Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models
by: Nakajima, Kumiko, et al.
Published: (2026)
by: Nakajima, Kumiko, et al.
Published: (2026)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models
by: Samarinas, Chris, et al.
Published: (2024)
by: Samarinas, Chris, et al.
Published: (2024)
Similar Items
-
ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
by: Jia, Qi, et al.
Published: (2024) -
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
by: Huang, Chengsong, et al.
Published: (2023) -
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
by: Li, Lei, et al.
Published: (2024) -
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
by: Zheng, Tianyu, et al.
Published: (2024) -
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
by: Hu, Xiaomeng, et al.
Published: (2026)