Saved in:
Bibliographic Details
Main Authors: Glaser, Pierre, Paul, Steffanie, Hummer, Alissa M., Deane, Charlotte M., Marks, Debora S., Amin, Alan N.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.15601
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909852724363264
author Glaser, Pierre
Paul, Steffanie
Hummer, Alissa M.
Deane, Charlotte M.
Marks, Debora S.
Amin, Alan N.
author_facet Glaser, Pierre
Paul, Steffanie
Hummer, Alissa M.
Deane, Charlotte M.
Marks, Debora S.
Amin, Alan N.
contents We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.
format Preprint
id arxiv_https___arxiv_org_abs_2510_15601
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Kernel-Based Evaluation of Conditional Biological Sequence Models
Glaser, Pierre
Paul, Steffanie
Hummer, Alissa M.
Deane, Charlotte M.
Marks, Debora S.
Amin, Alan N.
Machine Learning
We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.
title Kernel-Based Evaluation of Conditional Biological Sequence Models
topic Machine Learning
url https://arxiv.org/abs/2510.15601