Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Vattikonda, Dheeraj, Ravichandran, Santhoshi, Penaloza, Emiliano, Nekoei, Hadi, Thakkar, Megh, de Chezelles, Thibault Le Sellier, Gontier, Nicolas, Muñoz-Mármol, Miguel, Shayegan, Sahar Omidi, Raimondo, Stefania, Liu, Xue, Drouin, Alexandre, Charlin, Laurent, Piché, Alexandre, Lacoste, Alexandre, Caccia, Massimo
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2507.04103
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915796701151232
author	Vattikonda, Dheeraj Ravichandran, Santhoshi Penaloza, Emiliano Nekoei, Hadi Thakkar, Megh de Chezelles, Thibault Le Sellier Gontier, Nicolas Muñoz-Mármol, Miguel Shayegan, Sahar Omidi Raimondo, Stefania Liu, Xue Drouin, Alexandre Charlin, Laurent Piché, Alexandre Lacoste, Alexandre Caccia, Massimo
author_facet	Vattikonda, Dheeraj Ravichandran, Santhoshi Penaloza, Emiliano Nekoei, Hadi Thakkar, Megh de Chezelles, Thibault Le Sellier Gontier, Nicolas Muñoz-Mármol, Miguel Shayegan, Sahar Omidi Raimondo, Stefania Liu, Xue Drouin, Alexandre Charlin, Laurent Piché, Alexandre Lacoste, Alexandre Caccia, Massimo
contents	LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_04103
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	How to Train Your LLM Web Agent: A Statistical Diagnosis Vattikonda, Dheeraj Ravichandran, Santhoshi Penaloza, Emiliano Nekoei, Hadi Thakkar, Megh de Chezelles, Thibault Le Sellier Gontier, Nicolas Muñoz-Mármol, Miguel Shayegan, Sahar Omidi Raimondo, Stefania Liu, Xue Drouin, Alexandre Charlin, Laurent Piché, Alexandre Lacoste, Alexandre Caccia, Massimo Artificial Intelligence Machine Learning LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
title	How to Train Your LLM Web Agent: A Statistical Diagnosis
topic	Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2507.04103

Similar Items