Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bartoszcze, Lukasz, Munshi, Sarthak, Sukidi, Bryan, Yen, Jennifer, Yang, Zejia, Williams-King, David, Le, Linh, Asuzu, Kosi, Maple, Carsten
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.17601
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910843167309824
author	Bartoszcze, Lukasz Munshi, Sarthak Sukidi, Bryan Yen, Jennifer Yang, Zejia Williams-King, David Le, Linh Asuzu, Kosi Maple, Carsten
author_facet	Bartoszcze, Lukasz Munshi, Sarthak Sukidi, Bryan Yen, Jennifer Yang, Zejia Williams-King, David Le, Linh Asuzu, Kosi Maple, Carsten
contents	Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_17601
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Representation Engineering for Large-Language Models: Survey and Research Challenges Bartoszcze, Lukasz Munshi, Sarthak Sukidi, Bryan Yen, Jennifer Yang, Zejia Williams-King, David Le, Linh Asuzu, Kosi Maple, Carsten Artificial Intelligence Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.
title	Representation Engineering for Large-Language Models: Survey and Research Challenges
topic	Artificial Intelligence
url	https://arxiv.org/abs/2502.17601

Similar Items