Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Siddique, Zara, Turner, Liam D., Espinosa-Anke, Luis
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.06262
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910981962072064
author	Siddique, Zara Turner, Liam D. Espinosa-Anke, Luis
author_facet	Siddique, Zara Turner, Liam D. Espinosa-Anke, Luis
contents	We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable language generation. Dialz enables faster research cycles and facilitates insights into model interpretability, paving the way for safer, more transparent, and more reliable AI systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_06262
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Dialz: A Python Toolkit for Steering Vectors Siddique, Zara Turner, Liam D. Espinosa-Anke, Luis Machine Learning Artificial Intelligence We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable language generation. Dialz enables faster research cycles and facilitates insights into model interpretability, paving the way for safer, more transparent, and more reliable AI systems.
title	Dialz: A Python Toolkit for Steering Vectors
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2505.06262

Similar Items