:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Zuhao, Yu, Chengyue, Li, Yuante, Zhuang, Chenyi, Mo, Linjian, Li, Shuai
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.09652
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM
by: Yu, Chengyue, et al.
Published: (2024)

Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld
by: Xie, Zhitian, et al.
Published: (2025)

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025)

LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
by: Li, Hao, et al.
Published: (2026)

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
by: Zheng, Xiangqing, et al.
Published: (2025)

VoiceBench: Benchmarking LLM-Based Voice Assistants
by: Chen, Yiming, et al.
Published: (2024)

$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
by: Zhang, Haoran, et al.
Published: (2026)

AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment
by: Sun, Nan, et al.
Published: (2024)

Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions
by: Huang, Ji, et al.
Published: (2025)

AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development?
by: Ran, Dezhi, et al.
Published: (2025)

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
by: Wu, Jiajun, et al.
Published: (2026)

Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors
by: Geng, Binzong, et al.
Published: (2024)

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
by: Long, Wentao, et al.
Published: (2026)

Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach
by: Gao, Ziqi, et al.
Published: (2026)

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
by: Zheng, Junhao, et al.
Published: (2025)

Alleviating LLM-based Generative Retrieval Hallucination in Alipay Search
by: Shen, Yedan, et al.
Published: (2025)

AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
by: Lu, Xiaotian, et al.
Published: (2024)

PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM
by: Fu, Kelin, et al.
Published: (2024)

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
by: Yang, Haoyue, et al.
Published: (2026)

SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic
by: Chen, Jikai, et al.
Published: (2025)

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
by: Li, Lingfeng, et al.
Published: (2026)

OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries
by: Ravichandran, Sandhanakrishnan, et al.
Published: (2025)

ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants
by: Wang, Pei, et al.
Published: (2026)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
by: Chen, Tongbo, et al.
Published: (2026)

Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree
by: Johnson, Sam, et al.
Published: (2025)

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution
by: He, Kaiwen, et al.
Published: (2025)

PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
by: Wang, Wei, et al.
Published: (2026)

MiniLLM: On-Policy Distillation of Large Language Models
by: Gu, Yuxian, et al.
Published: (2023)

TokenPowerBench: Benchmarking the Power Consumption of LLM Inference
by: Niu, Chenxu, et al.
Published: (2025)

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
by: Xu, Zelai, et al.
Published: (2025)

ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
by: Dejl, Adam, et al.
Published: (2026)

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
by: Li, Hanyu, et al.
Published: (2025)

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
by: Gui, Rui, et al.
Published: (2025)

LLM App Squatting and Cloning
by: Xie, Yinglin, et al.
Published: (2024)

On the (In)Security of LLM App Stores
by: Hou, Xinyi, et al.
Published: (2024)

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
by: Guo, Zhengkang, et al.
Published: (2026)

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
by: Li, Xiangyi, et al.
Published: (2026)

WaLLM -- Insights from an LLM-Powered Chatbot deployment via WhatsApp
by: Eltigani, Hiba, et al.
Published: (2025)