Saved in:
Bibliographic Details
Main Author: Eriksson, Edward
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.08769
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911656249917440
author Eriksson, Edward
author_facet Eriksson, Edward
contents Given $n$ i.i.d. samples from an unknown discrete distribution over an unknown set, the unseen species problem is to predict how many new outcomes would be observed in $m$ additional samples. For small $m$ we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate. We resolve the open problem of constructing principled prediction intervals for it. For intermediate $m$ we propose a new estimator which has a vastly improved worst case MSE compared to competing methods and we expect that our method can be applied to other species sampling problems. For large $m$ we follow previous authors in assuming a power law tail and show that a simple estimator achieves the same rate and better empirical performance than a recent sophisticated method. Moreover, we give pre-asymptotic guarantees. We extend the rate guarantees to incidence data, without further independence assumptions, provided that the sets are of bounded size. In the process we use Stein's method to obtain concentration inequalities for some natural functionals of sequences of i.i.d. discrete-set-valued random variables which are of independent interest.
format Preprint
id arxiv_https___arxiv_org_abs_2602_08769
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Unseen Species Problem Revisited
Eriksson, Edward
Statistics Theory
62G05, 62G15
Given $n$ i.i.d. samples from an unknown discrete distribution over an unknown set, the unseen species problem is to predict how many new outcomes would be observed in $m$ additional samples. For small $m$ we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate. We resolve the open problem of constructing principled prediction intervals for it. For intermediate $m$ we propose a new estimator which has a vastly improved worst case MSE compared to competing methods and we expect that our method can be applied to other species sampling problems. For large $m$ we follow previous authors in assuming a power law tail and show that a simple estimator achieves the same rate and better empirical performance than a recent sophisticated method. Moreover, we give pre-asymptotic guarantees. We extend the rate guarantees to incidence data, without further independence assumptions, provided that the sets are of bounded size. In the process we use Stein's method to obtain concentration inequalities for some natural functionals of sequences of i.i.d. discrete-set-valued random variables which are of independent interest.
title The Unseen Species Problem Revisited
topic Statistics Theory
62G05, 62G15
url https://arxiv.org/abs/2602.08769