Sommario: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autore principale:	Leung, Wan Sheng
Natura:	Recurso digital
Lingua:
Pubblicazione:	Zenodo 2026
Soggetti:	activation probes MCP security tool poisoning model internals AI safety mechanistic interpretability
Accesso online:	https://doi.org/10.5281/zenodo.19990741
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Sommario:

I investigate whether looking inside a model's activations can catch poisoned MCP tool descriptions better than text scanning. On a dataset where safe and malicious descriptions cover the same topics with heavily overlapping vocabulary, text classifiers top out at 72-79%. A simple logistic regression trained on GPT-2's internal activations hits 97-98.5% and stays at 97% even after removing the effect of text length. Statistically significant (p=0.005). But this is GPT-2, not Claude, and 200 LLM-generated samples, not production data. The next step is SAE analysis on a real model.

Documenti analoghi