Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Suriyakumar, Vinith M., Sekhari, Ayush, Stempfle, Lena, Wang, Robertson, Simpson, Michael, Portnoff, Rebecca, Ghassemi, Marzyeh, Wilson, Ashia C.
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computers and Society
Online Access:	https://arxiv.org/abs/2604.25119
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908997151358976
author	Suriyakumar, Vinith M. Sekhari, Ayush Stempfle, Lena Wang, Robertson Simpson, Michael Portnoff, Rebecca Ghassemi, Marzyeh Wilson, Ashia C.
author_facet	Suriyakumar, Vinith M. Sekhari, Ayush Stempfle, Lena Wang, Robertson Simpson, Michael Portnoff, Rebecca Ghassemi, Marzyeh Wilson, Ashia C.
contents	Auditing the fine-tunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, generative evaluation via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like CSAM where generation is legally constrained. This motivates the Evaluation without Generation problem: assessing model capabilities without producing outputs. We argue that in such settings, capability must be inferred from the model's state, either its parameters or internal representations, rather than its outputs. We introduce Gaussian probing, a method that characterizes how LoRA adaptors perturb a model's internal representations by measuring responses to Gaussian latent ensembles. Unlike raw-weight baselines, Gaussian probing reliably distinguishes benign from harmful specialization without sampling outputs. We demonstrate effectiveness in high-risk domains, including detecting models specialized for child sexual abuse material (CSAM), where output-based evaluation is legally and ethically constrained. Our results show that Gaussian probing provides a scalable non-generative alternative for evaluating high-risk generative systems and remains robust to weight rescaling, a representative adversarial manipulation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_25119
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM Suriyakumar, Vinith M. Sekhari, Ayush Stempfle, Lena Wang, Robertson Simpson, Michael Portnoff, Rebecca Ghassemi, Marzyeh Wilson, Ashia C. Machine Learning Computers and Society Auditing the fine-tunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, generative evaluation via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like CSAM where generation is legally constrained. This motivates the Evaluation without Generation problem: assessing model capabilities without producing outputs. We argue that in such settings, capability must be inferred from the model's state, either its parameters or internal representations, rather than its outputs. We introduce Gaussian probing, a method that characterizes how LoRA adaptors perturb a model's internal representations by measuring responses to Gaussian latent ensembles. Unlike raw-weight baselines, Gaussian probing reliably distinguishes benign from harmful specialization without sampling outputs. We demonstrate effectiveness in high-risk domains, including detecting models specialized for child sexual abuse material (CSAM), where output-based evaluation is legally and ethically constrained. Our results show that Gaussian probing provides a scalable non-generative alternative for evaluating high-risk generative systems and remains robust to weight rescaling, a representative adversarial manipulation.
title	Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
topic	Machine Learning Computers and Society
url	https://arxiv.org/abs/2604.25119

Similar Items