Salvato in:
| Autore principale: | |
|---|---|
| Natura: | Recurso digital |
| Lingua: | |
| Pubblicazione: |
Zenodo
2026
|
| Soggetti: | |
| Accesso online: | https://doi.org/10.5281/zenodo.19990741 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
Sommario:
- I investigate whether looking inside a model's activations can catch poisoned MCP tool descriptions better than text scanning. On a dataset where safe and malicious descriptions cover the same topics with heavily overlapping vocabulary, text classifiers top out at 72-79%. A simple logistic regression trained on GPT-2's internal activations hits 97-98.5% and stays at 97% even after removing the effect of text length. Statistically significant (p=0.005). But this is GPT-2, not Claude, and 200 LLM-generated samples, not production data. The next step is SAE analysis on a real model.