Saved in:
Bibliographic Details
Main Author: Song, Meng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.15693
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917963138859008
author Song, Meng
author_facet Song, Meng
contents Supervised learning (SL) and reinforcement learning (RL) are both widely used to train general-purpose agents for complex tasks, yet their generalization capabilities and underlying mechanisms are not yet fully understood. In this paper, we provide a direct comparison between SL and RL in terms of zero-shot generalization. Using the Habitat visual navigation task as a testbed, we evaluate Proximal Policy Optimization (PPO) and Behavior Cloning (BC) agents across two levels of generalization: state-goal pair generalization within seen environments and generalization to unseen environments. Our experiments show that PPO consistently outperforms BC across both zero-shot settings and performance metrics-success rate and SPL. Interestingly, even though additional optimal training data enables BC to match PPO's zero-shot performance in SPL, it still falls significantly behind in success rate. We attribute this to a fundamental difference in how models trained by these algorithms generalize: BC-trained models generalize by imitating successful trajectories, whereas TD-based RL-trained models generalize through combinatorial experience stitching-leveraging fragments of past trajectories (mostly failed ones) to construct solutions for new tasks. This allows RL to efficiently find solutions in vast state space and discover novel strategies beyond the scope of human knowledge. Besides providing empirical evidence and understanding, we also propose practical guidelines for improving the generalization capabilities of RL and SL through algorithm design.
format Preprint
id arxiv_https___arxiv_org_abs_2503_15693
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better
Song, Meng
Machine Learning
Supervised learning (SL) and reinforcement learning (RL) are both widely used to train general-purpose agents for complex tasks, yet their generalization capabilities and underlying mechanisms are not yet fully understood. In this paper, we provide a direct comparison between SL and RL in terms of zero-shot generalization. Using the Habitat visual navigation task as a testbed, we evaluate Proximal Policy Optimization (PPO) and Behavior Cloning (BC) agents across two levels of generalization: state-goal pair generalization within seen environments and generalization to unseen environments. Our experiments show that PPO consistently outperforms BC across both zero-shot settings and performance metrics-success rate and SPL. Interestingly, even though additional optimal training data enables BC to match PPO's zero-shot performance in SPL, it still falls significantly behind in success rate. We attribute this to a fundamental difference in how models trained by these algorithms generalize: BC-trained models generalize by imitating successful trajectories, whereas TD-based RL-trained models generalize through combinatorial experience stitching-leveraging fragments of past trajectories (mostly failed ones) to construct solutions for new tasks. This allows RL to efficiently find solutions in vast state space and discover novel strategies beyond the scope of human knowledge. Besides providing empirical evidence and understanding, we also propose practical guidelines for improving the generalization capabilities of RL and SL through algorithm design.
title Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better
topic Machine Learning
url https://arxiv.org/abs/2503.15693