:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xue, Tianci, Liao, Zeyi, Shi, Tianneng, Wang, Zilu, Zhang, Kai, Song, Dawn, Su, Yu, Sun, Huan
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.10356
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

An Illusion of Progress? Assessing the Current State of Web Agents
by: Xue, Tianci, et al.
Published: (2025)

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
by: Liao, Zeyi, et al.
Published: (2025)

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
by: Cai, Will, et al.
Published: (2025)

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
by: Liao, Zeyi, et al.
Published: (2024)

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents
by: Mo, Lingbo, et al.
Published: (2024)

WebGuard: Building a Generalizable Guardrail for Web Agents
by: Zheng, Boyuan, et al.
Published: (2025)

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
by: Sun, Zeyi, et al.
Published: (2025)

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
by: Jones, Jaylen, et al.
Published: (2026)

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
by: Xie, Jingxu, et al.
Published: (2025)

Improving LLM Safety Alignment with Dual-Objective Optimization
by: Zhao, Xuandong, et al.
Published: (2025)

Can LLMs Ask Good Questions?
by: Zhang, Yueheng, et al.
Published: (2025)

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
by: Xie, Jian, et al.
Published: (2026)

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
by: Kumar, Vishal, et al.
Published: (2024)

AdvAgent: Controllable Blackbox Red-teaming on Web Agents
by: Xu, Chejian, et al.
Published: (2024)

AttributionBench: How Hard is Automatic Attribution Evaluation?
by: Li, Yifei, et al.
Published: (2024)

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
by: Shu, Yiheng, et al.
Published: (2026)

Agent Learning via Early Experience
by: Zhang, Kai, et al.
Published: (2025)

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
by: Gou, Boyu, et al.
Published: (2025)

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
by: Wang, Zhun, et al.
Published: (2025)

SafePred: A Predictive Guardrail for Computer-Using Agents via World Models
by: Chen, Yurun, et al.
Published: (2026)

EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
by: Liao, Zeyi, et al.
Published: (2024)

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
by: Chen, Ziru, et al.
Published: (2024)

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents
by: Ning, Yuting, et al.
Published: (2026)

Progent: Securing AI Agents with Privilege Control
by: Shi, Tianneng, et al.
Published: (2025)

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
by: Liu, Jiayu, et al.
Published: (2025)

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
by: Li, Xu, et al.
Published: (2026)

Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)

Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents
by: Wu, Bin, et al.
Published: (2026)

Multi-Agent Computer Use
by: Koh, Jing Yu, et al.
Published: (2026)

DeServe: Towards Affordable Offline LLM Inference via Decentralization
by: Wu, Linyu, et al.
Published: (2025)

DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction
by: Inoue, Yoshitaka, et al.
Published: (2024)

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)

Adaptive Vision-Language Model Routing for Computer Use Agents
by: Liu, Xunzhuo, et al.
Published: (2026)

HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology
by: Zhou, Shuang, et al.
Published: (2026)

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
by: Kong, Quyu, et al.
Published: (2025)

Mistake Notebook Learning: Batch-Clustered Failures for Training-Free Agent Adaptation
by: Su, Xuanbo, et al.
Published: (2025)

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
by: Zhang, Kai, et al.
Published: (2023)

AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization
by: Liao, Yusheng, et al.
Published: (2026)

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
by: Shi, Yucheng, et al.
Published: (2025)

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
by: Zhang, Jiayi, et al.
Published: (2025)