Saved in:
Bibliographic Details
Main Authors: Harwood, Alfred, Faustino, Jose, Altair, Alex
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.12963
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914328379129856
author Harwood, Alfred
Faustino, Jose
Altair, Alex
author_facet Harwood, Alfred
Faustino, Jose
Altair, Alex
contents An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.
format Preprint
id arxiv_https___arxiv_org_abs_2602_12963
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Information-theoretic analysis of world models in optimal reward maximizers
Harwood, Alfred
Faustino, Jose
Altair, Alex
Artificial Intelligence
An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.
title Information-theoretic analysis of world models in optimal reward maximizers
topic Artificial Intelligence
url https://arxiv.org/abs/2602.12963