Salvato in:
Dettagli Bibliografici
Autore principale: Leung, Wan Sheng
Natura: Recurso digital
Lingua:
Pubblicazione: Zenodo 2026
Soggetti:
Accesso online:https://doi.org/10.5281/zenodo.19990741
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
Sommario:
  • I investigate whether looking inside a model's activations can catch poisoned MCP tool descriptions better than text scanning. On a dataset where safe and malicious descriptions cover the same topics with heavily overlapping vocabulary, text classifiers top out at 72-79%. A simple logistic regression trained on GPT-2's internal activations hits 97-98.5% and stays at 97% even after removing the effect of text length. Statistically significant (p=0.005). But this is GPT-2, not Claude, and 200 LLM-generated samples, not production data. The next step is SAE analysis on a real model.