Saved in:
| Main Authors: | Treutlein, Johannes, Choi, Dami, Betley, Jan, Marks, Samuel, Anil, Cem, Grosse, Roger, Evans, Owain |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.14546 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
by: Taylor, Mia, et al.
Published: (2025)
by: Taylor, Mia, et al.
Published: (2025)
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
by: Chua, James, et al.
Published: (2026)
by: Chua, James, et al.
Published: (2026)
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026)
by: Dubiński, Jan, et al.
Published: (2026)
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
by: Cloud, Alex, et al.
Published: (2025)
by: Cloud, Alex, et al.
Published: (2025)
Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
by: Chua, James, et al.
Published: (2025)
by: Chua, James, et al.
Published: (2025)
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
by: Laine, Rudolf, et al.
Published: (2024)
by: Laine, Rudolf, et al.
Published: (2024)
Lessons from Studying Two-Hop Latent Reasoning
by: Balesni, Mikita, et al.
Published: (2024)
by: Balesni, Mikita, et al.
Published: (2024)
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
Are DeepSeek R1 And Other Reasoning Models More Faithful?
by: Chua, James, et al.
Published: (2025)
by: Chua, James, et al.
Published: (2025)
Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
by: Xia, Yuxi, et al.
Published: (2026)
by: Xia, Yuxi, et al.
Published: (2026)
Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
by: Huang, Vincent, et al.
Published: (2025)
by: Huang, Vincent, et al.
Published: (2025)
Can Language Models Explain Their Own Classification Behavior?
by: Sherburn, Dane, et al.
Published: (2024)
by: Sherburn, Dane, et al.
Published: (2024)
Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs
by: Peng, Zhuoyi, et al.
Published: (2024)
by: Peng, Zhuoyi, et al.
Published: (2024)
Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations
by: Sun, Yiyou, et al.
Published: (2025)
by: Sun, Yiyou, et al.
Published: (2025)
Inferring the presence and abundance of rare waterbirds species from scarce data
by: Bricout, Barbara, et al.
Published: (2026)
by: Bricout, Barbara, et al.
Published: (2026)
Can We Infer Confidential Properties of Training Data from LLMs?
by: Huang, Pengrun, et al.
Published: (2025)
by: Huang, Pengrun, et al.
Published: (2025)
Cylinder decompositions on geometric armadillo tails
by: Lee, Dami, et al.
Published: (2024)
by: Lee, Dami, et al.
Published: (2024)
Estrutura de propriedade no Brasil: Evidências empíricas no grau de concentração acionária
by: Anamélia Borges Tannus Dami
Published: (2023)
by: Anamélia Borges Tannus Dami
Published: (2023)
ESTRUTURA DE PROPRIEDADE NO BRASIL: EVIDÊNCIAS EMPÍRICAS NO GRAU DE CONCENTRAÇÃO ACIONÁRIA
by: Anamélia Borges Tannus Dami
Published: (2007)
by: Anamélia Borges Tannus Dami
Published: (2007)
Training Data Attribution via Approximate Unrolled Differentiation
by: Bae, Juhan, et al.
Published: (2024)
by: Bae, Juhan, et al.
Published: (2024)
Connecting the Dots in News Analysis: Bridging the Cross-Disciplinary Disparities in Media Bias and Framing
by: Vallejo, Gisela, et al.
Published: (2023)
by: Vallejo, Gisela, et al.
Published: (2023)
Discrete Vector Bundles with Connection
by: Berwick-Evans, Daniel, et al.
Published: (2021)
by: Berwick-Evans, Daniel, et al.
Published: (2021)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
by: Marks, Samuel, et al.
Published: (2023)
by: Marks, Samuel, et al.
Published: (2023)
Effective Faraday interaction between light and nuclear spins of Helium-3 in its ground state: a semiclassical study
by: Fadel, Matteo, et al.
Published: (2024)
by: Fadel, Matteo, et al.
Published: (2024)
Kinetic analysis of phase transformations during continuous heating: Crystallization of glass-forming liquids
by: Houghton, Owain S.
Published: (2025)
by: Houghton, Owain S.
Published: (2025)
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
by: Mousavi, Pooneh, et al.
Published: (2025)
by: Mousavi, Pooneh, et al.
Published: (2025)
Resonant Structures in $p{}^7\mathrm{Be}$ Scattering and Their Connection to the Astrophysical $S$-Factor
by: Khachi, Anil
Published: (2025)
by: Khachi, Anil
Published: (2025)
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
by: Berglund, Lukas, et al.
Published: (2023)
by: Berglund, Lukas, et al.
Published: (2023)
On Verbalized Confidence Scores for LLMs
by: Yang, Daniel, et al.
Published: (2024)
by: Yang, Daniel, et al.
Published: (2024)
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
by: Shilov, Igor, et al.
Published: (2025)
by: Shilov, Igor, et al.
Published: (2025)
Bayesian Tensor Decomposition for Clustering Latent Symptom Profiles for Verbal Autopsy Data
by: Yu Zhu, et al.
Published: (2026)
by: Yu Zhu, et al.
Published: (2026)
Negation Neglect: When models fail to learn negations in training
by: Mayne, Harry, et al.
Published: (2026)
by: Mayne, Harry, et al.
Published: (2026)
The growth of the mussel Mytilus californianus
by: Richards, Owain Westmacott
Published: (1928)
by: Richards, Owain Westmacott
Published: (1928)
LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language
by: Requeima, James, et al.
Published: (2024)
by: Requeima, James, et al.
Published: (2024)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by: Hubinger, Evan, et al.
Published: (2024)
by: Hubinger, Evan, et al.
Published: (2024)
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
by: Chen, Runjin, et al.
Published: (2025)
by: Chen, Runjin, et al.
Published: (2025)
Put Aside Your Pencil: How Talk Becomes Writing Through Verbal Rehearsal
by: Kristen I. Evans
Published: (2026)
by: Kristen I. Evans
Published: (2026)
Similar Items
-
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
by: Taylor, Mia, et al.
Published: (2025) -
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
by: Chua, James, et al.
Published: (2026) -
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026) -
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025) -
Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)