Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kim, Su-Hyeon, Han, Yo-Sub
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.09875
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909031037140992
author	Kim, Su-Hyeon Han, Yo-Sub
author_facet	Kim, Su-Hyeon Han, Yo-Sub
contents	Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_09875
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations Kim, Su-Hyeon Han, Yo-Sub Artificial Intelligence Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
title	Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.09875

Similar Items