Saved in:
Bibliographic Details
Main Authors: Kim, Su-Hyeon, Han, Yo-Sub
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.09875
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909031037140992
author Kim, Su-Hyeon
Han, Yo-Sub
author_facet Kim, Su-Hyeon
Han, Yo-Sub
contents Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
format Preprint
id arxiv_https___arxiv_org_abs_2605_09875
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Kim, Su-Hyeon
Han, Yo-Sub
Artificial Intelligence
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
title Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
topic Artificial Intelligence
url https://arxiv.org/abs/2605.09875