Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Catrina, Darius, Bepler, Christian, Sledzieski, Samuel, Singh, Rohit
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Biomolecules
Online Access:	https://arxiv.org/abs/2603.07710
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918379469668352
author	Catrina, Darius Bepler, Christian Sledzieski, Samuel Singh, Rohit
author_facet	Catrina, Darius Bepler, Christian Sledzieski, Samuel Singh, Rohit
contents	Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse-distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_07710
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Reverse Distillation: Consistently Scaling Protein Language Model Representations Catrina, Darius Bepler, Christian Sledzieski, Samuel Singh, Rohit Machine Learning Biomolecules Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse-distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
title	Reverse Distillation: Consistently Scaling Protein Language Model Representations
topic	Machine Learning Biomolecules
url	https://arxiv.org/abs/2603.07710

Similar Items