Saved in:
Bibliographic Details
Main Authors: Bartoszcze, Lukasz, Munshi, Sarthak, Sukidi, Bryan, Yen, Jennifer, Yang, Zejia, Williams-King, David, Le, Linh, Asuzu, Kosi, Maple, Carsten
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.17601
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910843167309824
author Bartoszcze, Lukasz
Munshi, Sarthak
Sukidi, Bryan
Yen, Jennifer
Yang, Zejia
Williams-King, David
Le, Linh
Asuzu, Kosi
Maple, Carsten
author_facet Bartoszcze, Lukasz
Munshi, Sarthak
Sukidi, Bryan
Yen, Jennifer
Yang, Zejia
Williams-King, David
Le, Linh
Asuzu, Kosi
Maple, Carsten
contents Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2502_17601
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Representation Engineering for Large-Language Models: Survey and Research Challenges
Bartoszcze, Lukasz
Munshi, Sarthak
Sukidi, Bryan
Yen, Jennifer
Yang, Zejia
Williams-King, David
Le, Linh
Asuzu, Kosi
Maple, Carsten
Artificial Intelligence
Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.
title Representation Engineering for Large-Language Models: Survey and Research Challenges
topic Artificial Intelligence
url https://arxiv.org/abs/2502.17601