Saved in:
Bibliographic Details
Main Authors: Chen, Valerie, Malhotra, Rohit, Wang, Xingyao, Michelini, Juan, Zhou, Xuhui, Soni, Aditya Bharat, Tran, Hoang H., Smith, Calvin, Talwalkar, Ameet, Neubig, Graham
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.09801
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917059195043840
author Chen, Valerie
Malhotra, Rohit
Wang, Xingyao
Michelini, Juan
Zhou, Xuhui
Soni, Aditya Bharat
Tran, Hoang H.
Smith, Calvin
Talwalkar, Ameet
Neubig, Graham
author_facet Chen, Valerie
Malhotra, Rohit
Wang, Xingyao
Michelini, Juan
Zhou, Xuhui
Soni, Aditya Bharat
Tran, Hoang H.
Smith, Calvin
Talwalkar, Ameet
Neubig, Graham
contents LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions -- choice of LLM backbone, planning strategy, and memory mechanisms -- impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.
format Preprint
id arxiv_https___arxiv_org_abs_2510_09801
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle How can we assess human-agent interactions? Case studies in software agent design
Chen, Valerie
Malhotra, Rohit
Wang, Xingyao
Michelini, Juan
Zhou, Xuhui
Soni, Aditya Bharat
Tran, Hoang H.
Smith, Calvin
Talwalkar, Ameet
Neubig, Graham
Artificial Intelligence
LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions -- choice of LLM backbone, planning strategy, and memory mechanisms -- impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.
title How can we assess human-agent interactions? Case studies in software agent design
topic Artificial Intelligence
url https://arxiv.org/abs/2510.09801