Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kolberg, Christopher, Kreuer, Jules, Huurdeman, Jonas, Ouaari, Sofiane, Eggensperger, Katharina, Pfeifer, Nico
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.06162
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910080617676800
author	Kolberg, Christopher Kreuer, Jules Huurdeman, Jonas Ouaari, Sofiane Eggensperger, Katharina Pfeifer, Nico
author_facet	Kolberg, Christopher Kreuer, Jules Huurdeman, Jonas Ouaari, Sofiane Eggensperger, Katharina Pfeifer, Nico
contents	Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional tabular machine learning approaches. While prior-data fitted networks emerge as foundation models for predictive tabular data tasks, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance, while exhibiting improved robustness to noise. It seamlessly scales beyond 30,000 categorical and continuous features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results demonstrate that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world omics datasets, we show that many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_06162
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts Kolberg, Christopher Kreuer, Jules Huurdeman, Jonas Ouaari, Sofiane Eggensperger, Katharina Pfeifer, Nico Machine Learning Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional tabular machine learning approaches. While prior-data fitted networks emerge as foundation models for predictive tabular data tasks, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance, while exhibiting improved robustness to noise. It seamlessly scales beyond 30,000 categorical and continuous features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results demonstrate that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world omics datasets, we show that many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
title	TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
topic	Machine Learning
url	https://arxiv.org/abs/2510.06162

Similar Items