Saved in:
Bibliographic Details
Main Authors: Lin, Xiao, Acharya, Manoj, Roy, Anirban, Jha, Susmit
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.20228
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908284969025536
author Lin, Xiao
Acharya, Manoj
Roy, Anirban
Jha, Susmit
author_facet Lin, Xiao
Acharya, Manoj
Roy, Anirban
Jha, Susmit
contents Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.
format Preprint
id arxiv_https___arxiv_org_abs_2503_20228
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle TeleLoRA: Teleporting Model-Specific Alignment Across LLMs
Lin, Xiao
Acharya, Manoj
Roy, Anirban
Jha, Susmit
Machine Learning
Computation and Language
Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.
title TeleLoRA: Teleporting Model-Specific Alignment Across LLMs
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2503.20228