Saved in:
| Main Authors: | Xu, Shuhang, Deng, Weijian, Zhou, Yixuan, Zhong, Fangwei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.17512 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games
by: Xu, Shuhang, et al.
Published: (2025)
by: Xu, Shuhang, et al.
Published: (2025)
Reinforced Context Order Recovery for Adaptive Reasoning and Planning
by: Ma, Long, et al.
Published: (2025)
by: Ma, Long, et al.
Published: (2025)
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
by: Yang, Wenkai, et al.
Published: (2024)
by: Yang, Wenkai, et al.
Published: (2024)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents
by: Lei, Yuzhen, et al.
Published: (2025)
by: Lei, Yuzhen, et al.
Published: (2025)
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)
by: Yuan, Tongxin, et al.
Published: (2024)
Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?
by: Liu, Xiaoze, et al.
Published: (2026)
by: Liu, Xiaoze, et al.
Published: (2026)
How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
by: Choi, Hyeong Kyu, et al.
Published: (2025)
by: Choi, Hyeong Kyu, et al.
Published: (2025)
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)
by: Yang, Langqi, et al.
Published: (2025)
DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments
by: Tang, Wenjie, et al.
Published: (2025)
by: Tang, Wenjie, et al.
Published: (2025)
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
by: Lu, Yi, et al.
Published: (2025)
by: Lu, Yi, et al.
Published: (2025)
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
by: Zhou, Zhihan, et al.
Published: (2023)
by: Zhou, Zhihan, et al.
Published: (2023)
Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena
by: Chen, Jiangjie, et al.
Published: (2023)
by: Chen, Jiangjie, et al.
Published: (2023)
Chain-of-Thought Tokens are Computer Program Variables
by: Zhu, Fangwei, et al.
Published: (2025)
by: Zhu, Fangwei, et al.
Published: (2025)
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
by: Yu, Sungduk, et al.
Published: (2025)
by: Yu, Sungduk, et al.
Published: (2025)
MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
by: Li, Yixuan, et al.
Published: (2026)
by: Li, Yixuan, et al.
Published: (2026)
Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification
by: Demir, M. Mikail, et al.
Published: (2026)
by: Demir, M. Mikail, et al.
Published: (2026)
How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve
by: Reda, Waleed, et al.
Published: (2025)
by: Reda, Waleed, et al.
Published: (2025)
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents
by: Gioacchini, Luca, et al.
Published: (2024)
by: Gioacchini, Luca, et al.
Published: (2024)
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
by: Shao, Shuai, et al.
Published: (2025)
by: Shao, Shuai, et al.
Published: (2025)
ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology
by: Zhang, Junlei, et al.
Published: (2023)
by: Zhang, Junlei, et al.
Published: (2023)
KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework
by: Li, Cheng, et al.
Published: (2025)
by: Li, Cheng, et al.
Published: (2025)
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
by: Wu, Yanan, et al.
Published: (2024)
by: Wu, Yanan, et al.
Published: (2024)
Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)
by: Li, Xiaochuan, et al.
Published: (2026)
Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics
by: Zhou, Yuan, et al.
Published: (2024)
by: Zhou, Yuan, et al.
Published: (2024)
Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment
by: Pei, Jiahuan, et al.
Published: (2025)
by: Pei, Jiahuan, et al.
Published: (2025)
JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation
by: Shi, Zhichao, et al.
Published: (2025)
by: Shi, Zhichao, et al.
Published: (2025)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
by: Wu, Yusen, et al.
Published: (2025)
by: Wu, Yusen, et al.
Published: (2025)
Robotouille: An Asynchronous Planning Benchmark for LLM Agents
by: Gonzalez-Pumariega, Gonzalo, et al.
Published: (2025)
by: Gonzalez-Pumariega, Gonzalo, et al.
Published: (2025)
Benchmarking and Improving LLM Robustness for Personalized Generation
by: Okite, Chimaobi, et al.
Published: (2025)
by: Okite, Chimaobi, et al.
Published: (2025)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
What Really is Commonsense Knowledge?
by: Do, Quyet V., et al.
Published: (2024)
by: Do, Quyet V., et al.
Published: (2024)
DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs
by: Hu, Ziyu, et al.
Published: (2025)
by: Hu, Ziyu, et al.
Published: (2025)
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild
by: Yang, Bufang, et al.
Published: (2025)
by: Yang, Bufang, et al.
Published: (2025)
MIRIX: Multi-Agent Memory System for LLM-Based Agents
by: Wang, Yu, et al.
Published: (2025)
by: Wang, Yu, et al.
Published: (2025)
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
by: Liu, Fan, et al.
Published: (2024)
by: Liu, Fan, et al.
Published: (2024)
Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
by: Liu, Yuliang, et al.
Published: (2026)
by: Liu, Yuliang, et al.
Published: (2026)
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026)
by: Nguyen, Bang, et al.
Published: (2026)
Exemplar-Guided Planing: Enhanced LLM Agent for KGQA
by: Xu, Jingao, et al.
Published: (2025)
by: Xu, Jingao, et al.
Published: (2025)
Similar Items
-
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games
by: Xu, Shuhang, et al.
Published: (2025) -
Reinforced Context Order Recovery for Adaptive Reasoning and Planning
by: Ma, Long, et al.
Published: (2025) -
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
by: Yang, Wenkai, et al.
Published: (2024) -
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024) -
MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents
by: Lei, Yuzhen, et al.
Published: (2025)