Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Lip, Edward Lue Chee, Channg, Anthony, Kim, Diana, Sandoval, Aaron, Zhu, Kevin
Format:	Preprint
Publié:	2025
Sujets:	Cryptography and Security Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2512.14745
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866911323852374016
author	Lip, Edward Lue Chee Channg, Anthony Kim, Diana Sandoval, Aaron Zhu, Kevin
author_facet	Lip, Edward Lue Chee Channg, Anthony Kim, Diana Sandoval, Aaron Zhu, Kevin
contents	As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_14745
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Factor(U,T): Controlling Untrusted AI by Monitoring their Plans Lip, Edward Lue Chee Channg, Anthony Kim, Diana Sandoval, Aaron Zhu, Kevin Cryptography and Security Artificial Intelligence As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.
title	Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2512.14745

Documents similaires