Enregistré dans:
Détails bibliographiques
Auteurs principaux: Lip, Edward Lue Chee, Channg, Anthony, Kim, Diana, Sandoval, Aaron, Zhu, Kevin
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2512.14745
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866911323852374016
author Lip, Edward Lue Chee
Channg, Anthony
Kim, Diana
Sandoval, Aaron
Zhu, Kevin
author_facet Lip, Edward Lue Chee
Channg, Anthony
Kim, Diana
Sandoval, Aaron
Zhu, Kevin
contents As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.
format Preprint
id arxiv_https___arxiv_org_abs_2512_14745
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
Lip, Edward Lue Chee
Channg, Anthony
Kim, Diana
Sandoval, Aaron
Zhu, Kevin
Cryptography and Security
Artificial Intelligence
As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.
title Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
topic Cryptography and Security
Artificial Intelligence
url https://arxiv.org/abs/2512.14745