Saved in:
| Main Authors: | Zhang, Zuhao, Yu, Chengyue, Li, Yuante, Zhuang, Chenyi, Mo, Linjian, Li, Shuai |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.09652 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM
by: Yu, Chengyue, et al.
Published: (2024)
by: Yu, Chengyue, et al.
Published: (2024)
Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld
by: Xie, Zhitian, et al.
Published: (2025)
by: Xie, Zhitian, et al.
Published: (2025)
MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025)
by: Fan, Shiqing, et al.
Published: (2025)
LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
by: Zheng, Xiangqing, et al.
Published: (2025)
by: Zheng, Xiangqing, et al.
Published: (2025)
VoiceBench: Benchmarking LLM-Based Voice Assistants
by: Chen, Yiming, et al.
Published: (2024)
by: Chen, Yiming, et al.
Published: (2024)
$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
by: Zhang, Haoran, et al.
Published: (2026)
by: Zhang, Haoran, et al.
Published: (2026)
AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment
by: Sun, Nan, et al.
Published: (2024)
by: Sun, Nan, et al.
Published: (2024)
Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions
by: Huang, Ji, et al.
Published: (2025)
by: Huang, Ji, et al.
Published: (2025)
AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development?
by: Ran, Dezhi, et al.
Published: (2025)
by: Ran, Dezhi, et al.
Published: (2025)
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)
by: Long, Xiang, et al.
Published: (2026)
HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
by: Wu, Jiajun, et al.
Published: (2026)
by: Wu, Jiajun, et al.
Published: (2026)
Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors
by: Geng, Binzong, et al.
Published: (2024)
by: Geng, Binzong, et al.
Published: (2024)
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
by: Long, Wentao, et al.
Published: (2026)
by: Long, Wentao, et al.
Published: (2026)
Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach
by: Gao, Ziqi, et al.
Published: (2026)
by: Gao, Ziqi, et al.
Published: (2026)
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
by: Zheng, Junhao, et al.
Published: (2025)
by: Zheng, Junhao, et al.
Published: (2025)
Alleviating LLM-based Generative Retrieval Hallucination in Alipay Search
by: Shen, Yedan, et al.
Published: (2025)
by: Shen, Yedan, et al.
Published: (2025)
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
by: Lu, Xiaotian, et al.
Published: (2024)
by: Lu, Xiaotian, et al.
Published: (2024)
PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM
by: Fu, Kelin, et al.
Published: (2024)
by: Fu, Kelin, et al.
Published: (2024)
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
by: Yang, Haoyue, et al.
Published: (2026)
by: Yang, Haoyue, et al.
Published: (2026)
SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic
by: Chen, Jikai, et al.
Published: (2025)
by: Chen, Jikai, et al.
Published: (2025)
BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
by: Li, Lingfeng, et al.
Published: (2026)
by: Li, Lingfeng, et al.
Published: (2026)
OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries
by: Ravichandran, Sandhanakrishnan, et al.
Published: (2025)
by: Ravichandran, Sandhanakrishnan, et al.
Published: (2025)
ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants
by: Wang, Pei, et al.
Published: (2026)
by: Wang, Pei, et al.
Published: (2026)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
by: Chen, Tongbo, et al.
Published: (2026)
by: Chen, Tongbo, et al.
Published: (2026)
Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree
by: Johnson, Sam, et al.
Published: (2025)
by: Johnson, Sam, et al.
Published: (2025)
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution
by: He, Kaiwen, et al.
Published: (2025)
by: He, Kaiwen, et al.
Published: (2025)
PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
by: Wang, Wei, et al.
Published: (2026)
by: Wang, Wei, et al.
Published: (2026)
MiniLLM: On-Policy Distillation of Large Language Models
by: Gu, Yuxian, et al.
Published: (2023)
by: Gu, Yuxian, et al.
Published: (2023)
TokenPowerBench: Benchmarking the Power Consumption of LLM Inference
by: Niu, Chenxu, et al.
Published: (2025)
by: Niu, Chenxu, et al.
Published: (2025)
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
by: Xu, Zelai, et al.
Published: (2025)
by: Xu, Zelai, et al.
Published: (2025)
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
by: Dejl, Adam, et al.
Published: (2026)
by: Dejl, Adam, et al.
Published: (2026)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
by: Li, Hanyu, et al.
Published: (2025)
by: Li, Hanyu, et al.
Published: (2025)
TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
by: Gui, Rui, et al.
Published: (2025)
by: Gui, Rui, et al.
Published: (2025)
LLM App Squatting and Cloning
by: Xie, Yinglin, et al.
Published: (2024)
by: Xie, Yinglin, et al.
Published: (2024)
On the (In)Security of LLM App Stores
by: Hou, Xinyi, et al.
Published: (2024)
by: Hou, Xinyi, et al.
Published: (2024)
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
by: Guo, Zhengkang, et al.
Published: (2026)
by: Guo, Zhengkang, et al.
Published: (2026)
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
by: Li, Xiangyi, et al.
Published: (2026)
by: Li, Xiangyi, et al.
Published: (2026)
WaLLM -- Insights from an LLM-Powered Chatbot deployment via WhatsApp
by: Eltigani, Hiba, et al.
Published: (2025)
by: Eltigani, Hiba, et al.
Published: (2025)
Similar Items
-
CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM
by: Yu, Chengyue, et al.
Published: (2024) -
Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld
by: Xie, Zhitian, et al.
Published: (2025) -
MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025) -
LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
by: Li, Hao, et al.
Published: (2026) -
LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
by: Zheng, Xiangqing, et al.
Published: (2025)