Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.02144 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917975602233344 |
|---|---|
| author | Patel, Oam Wang, Jason Nayak, Nikhil Shivakumar Srinivas, Suraj Lakkaraju, Himabindu |
| author_facet | Patel, Oam Wang, Jason Nayak, Nikhil Shivakumar Srinivas, Suraj Lakkaraju, Himabindu |
| contents | Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_02144 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Towards Interpretable Soft Prompts Patel, Oam Wang, Jason Nayak, Nikhil Shivakumar Srinivas, Suraj Lakkaraju, Himabindu Machine Learning Artificial Intelligence Computation and Language 68T50 I.2.0; G.3 Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy. |
| title | Towards Interpretable Soft Prompts |
| topic | Machine Learning Artificial Intelligence Computation and Language 68T50 I.2.0; G.3 |
| url | https://arxiv.org/abs/2504.02144 |