Saved in:
Bibliographic Details
Main Authors: Kolberg, Christopher, Kreuer, Jules, Huurdeman, Jonas, Ouaari, Sofiane, Eggensperger, Katharina, Pfeifer, Nico
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.06162
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910080617676800
author Kolberg, Christopher
Kreuer, Jules
Huurdeman, Jonas
Ouaari, Sofiane
Eggensperger, Katharina
Pfeifer, Nico
author_facet Kolberg, Christopher
Kreuer, Jules
Huurdeman, Jonas
Ouaari, Sofiane
Eggensperger, Katharina
Pfeifer, Nico
contents Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional tabular machine learning approaches. While prior-data fitted networks emerge as foundation models for predictive tabular data tasks, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance, while exhibiting improved robustness to noise. It seamlessly scales beyond 30,000 categorical and continuous features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results demonstrate that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world omics datasets, we show that many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
format Preprint
id arxiv_https___arxiv_org_abs_2510_06162
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
Kolberg, Christopher
Kreuer, Jules
Huurdeman, Jonas
Ouaari, Sofiane
Eggensperger, Katharina
Pfeifer, Nico
Machine Learning
Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional tabular machine learning approaches. While prior-data fitted networks emerge as foundation models for predictive tabular data tasks, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance, while exhibiting improved robustness to noise. It seamlessly scales beyond 30,000 categorical and continuous features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results demonstrate that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world omics datasets, we show that many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
title TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
topic Machine Learning
url https://arxiv.org/abs/2510.06162