:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Treutlein, Johannes, Choi, Dami, Betley, Jan, Marks, Samuel, Anil, Cem, Grosse, Roger, Evans, Owain
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2406.14546
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
by: Taylor, Mia, et al.
Published: (2025)

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
by: Chua, James, et al.
Published: (2026)

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)

Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
by: Cloud, Alex, et al.
Published: (2025)

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
by: Betley, Jan, et al.
Published: (2025)

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
by: Chua, James, et al.
Published: (2025)

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
by: Laine, Rudolf, et al.
Published: (2024)

Lessons from Studying Two-Hop Latent Reasoning
by: Balesni, Mikita, et al.
Published: (2024)

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
by: Karvonen, Adam, et al.
Published: (2025)

Are DeepSeek R1 And Other Reasoning Models More Faithful?
by: Chua, James, et al.
Published: (2025)

Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
by: Xia, Yuxi, et al.
Published: (2026)

Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
by: Huang, Vincent, et al.
Published: (2025)

Can Language Models Explain Their Own Classification Behavior?
by: Sherburn, Dane, et al.
Published: (2024)

Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs
by: Peng, Zhuoyi, et al.
Published: (2024)

Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations
by: Sun, Yiyou, et al.
Published: (2025)

Inferring the presence and abundance of rare waterbirds species from scarce data
by: Bricout, Barbara, et al.
Published: (2026)

Can We Infer Confidential Properties of Training Data from LLMs?
by: Huang, Pengrun, et al.
Published: (2025)

Cylinder decompositions on geometric armadillo tails
by: Lee, Dami, et al.
Published: (2024)

Estrutura de propriedade no Brasil: Evidências empíricas no grau de concentração acionária
by: Anamélia Borges Tannus Dami
Published: (2023)

ESTRUTURA DE PROPRIEDADE NO BRASIL: EVIDÊNCIAS EMPÍRICAS NO GRAU DE CONCENTRAÇÃO ACIONÁRIA
by: Anamélia Borges Tannus Dami
Published: (2007)

Training Data Attribution via Approximate Unrolled Differentiation
by: Bae, Juhan, et al.
Published: (2024)

Connecting the Dots in News Analysis: Bridging the Cross-Disciplinary Disparities in Media Bias and Framing
by: Vallejo, Gisela, et al.
Published: (2023)

Discrete Vector Bundles with Connection
by: Berwick-Evans, Daniel, et al.
Published: (2021)

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
by: Marks, Samuel, et al.
Published: (2023)

Effective Faraday interaction between light and nuclear spins of Helium-3 in its ground state: a semiclassical study
by: Fadel, Matteo, et al.
Published: (2024)

Kinetic analysis of phase transformations during continuous heating: Crystallization of glass-forming liquids
by: Houghton, Owain S.
Published: (2025)

ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
by: Mousavi, Pooneh, et al.
Published: (2025)

Resonant Structures in $p{}^7\mathrm{Be}$ Scattering and Their Connection to the Astrophysical $S$-Factor
by: Khachi, Anil
Published: (2025)

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
by: Berglund, Lukas, et al.
Published: (2023)

On Verbalized Confidence Scores for LLMs
by: Yang, Daniel, et al.
Published: (2024)

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
by: Shilov, Igor, et al.
Published: (2025)

Bayesian Tensor Decomposition for Clustering Latent Symptom Profiles for Verbal Autopsy Data
by: Yu Zhu, et al.
Published: (2026)

Negation Neglect: When models fail to learn negations in training
by: Mayne, Harry, et al.
Published: (2026)

The growth of the mussel Mytilus californianus
by: Richards, Owain Westmacott
Published: (1928)

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language
by: Requeima, James, et al.
Published: (2024)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by: Hubinger, Evan, et al.
Published: (2024)

Persona Vectors: Monitoring and Controlling Character Traits in Language Models
by: Chen, Runjin, et al.
Published: (2025)

Put Aside Your Pencil: How Talk Becomes Writing Through Verbal Rehearsal
by: Kristen I. Evans
Published: (2026)