Saved in:
| Main Authors: | Jain, Samyak, Kirk, Robert, Lubana, Ekdeep Singh, Dick, Robert P., Tanaka, Hidenori, Grefenstette, Edward, Rocktäschel, Tim, Krueger, David Scott |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.12786 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language
by: Lubana, Ekdeep Singh, et al.
Published: (2024)
by: Lubana, Ekdeep Singh, et al.
Published: (2024)
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
by: Okawa, Maya, et al.
Published: (2023)
by: Okawa, Maya, et al.
Published: (2023)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
by: Jain, Samyak, et al.
Published: (2024)
by: Jain, Samyak, et al.
Published: (2024)
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
by: Ramesh, Rahul, et al.
Published: (2023)
by: Ramesh, Rahul, et al.
Published: (2023)
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
by: Jaipersaud, Brandon, et al.
Published: (2025)
by: Jaipersaud, Brandon, et al.
Published: (2025)
In-Context Learning Dynamics with Random Binary Sequences
by: Bigelow, Eric J., et al.
Published: (2023)
by: Bigelow, Eric J., et al.
Published: (2023)
Competition Dynamics Shape Algorithmic Phases of In-Context Learning
by: Park, Core Francisco, et al.
Published: (2024)
by: Park, Core Francisco, et al.
Published: (2024)
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
by: Pres, Itamar, et al.
Published: (2024)
by: Pres, Itamar, et al.
Published: (2024)
Analyzing (In)Abilities of SAEs via Formal Languages
by: Menon, Abhinav, et al.
Published: (2024)
by: Menon, Abhinav, et al.
Published: (2024)
Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model
by: Khona, Mikail, et al.
Published: (2024)
by: Khona, Mikail, et al.
Published: (2024)
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
by: Park, Core Francisco, et al.
Published: (2024)
by: Park, Core Francisco, et al.
Published: (2024)
minimax: Efficient Baselines for Autocurricula in JAX
by: Jiang, Minqi, et al.
Published: (2023)
by: Jiang, Minqi, et al.
Published: (2023)
Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing
by: Nishi, Kento, et al.
Published: (2024)
by: Nishi, Kento, et al.
Published: (2024)
Swing-by Dynamics in Concept Learning and Compositional Generalization
by: Yang, Yongyi, et al.
Published: (2024)
by: Yang, Yongyi, et al.
Published: (2024)
In-Context Learning Strategies Emerge Rationally
by: Wurgaft, Daniel, et al.
Published: (2025)
by: Wurgaft, Daniel, et al.
Published: (2025)
Abrupt Learning in Transformers: A Case Study on Matrix Completion
by: Gopalani, Pulkit, et al.
Published: (2024)
by: Gopalani, Pulkit, et al.
Published: (2024)
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
by: Bigelow, Eric, et al.
Published: (2025)
by: Bigelow, Eric, et al.
Published: (2025)
Emergence of Hierarchical Emotion Organization in Large Language Models
by: Zhao, Bo, et al.
Published: (2025)
by: Zhao, Bo, et al.
Published: (2025)
Investigating Non-Transitivity in LLM-as-a-Judge
by: Xu, Yi, et al.
Published: (2025)
by: Xu, Yi, et al.
Published: (2025)
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
by: Ruis, Laura, et al.
Published: (2024)
by: Ruis, Laura, et al.
Published: (2024)
ICLR: In-Context Learning of Representations
by: Park, Core Francisco, et al.
Published: (2024)
by: Park, Core Francisco, et al.
Published: (2024)
Detecting High-Stakes Interactions with Activation Probes
by: McKenzie, Alex, et al.
Published: (2025)
by: McKenzie, Alex, et al.
Published: (2025)
Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
by: Zur, Amir, et al.
Published: (2025)
by: Zur, Amir, et al.
Published: (2025)
Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders
by: Bohacek, Matyas, et al.
Published: (2025)
by: Bohacek, Matyas, et al.
Published: (2025)
Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
by: Rosser, J, et al.
Published: (2026)
by: Rosser, J, et al.
Published: (2026)
Scaling Opponent Shaping to High Dimensional Games
by: Khan, Akbir, et al.
Published: (2023)
by: Khan, Akbir, et al.
Published: (2023)
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
by: Hindupur, Sai Sumedh R., et al.
Published: (2025)
by: Hindupur, Sai Sumedh R., et al.
Published: (2025)
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
by: Costa, Valérie, et al.
Published: (2025)
by: Costa, Valérie, et al.
Published: (2025)
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
by: Costa, Valérie, et al.
Published: (2025)
by: Costa, Valérie, et al.
Published: (2025)
The Impact of Off-Policy Training Data on Probe Generalisation
by: Kirch, Nathalie, et al.
Published: (2025)
by: Kirch, Nathalie, et al.
Published: (2025)
From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
by: Mueller, Aaron, et al.
Published: (2025)
by: Mueller, Aaron, et al.
Published: (2025)
Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents
by: Paglieri, Davide, et al.
Published: (2025)
by: Paglieri, Davide, et al.
Published: (2025)
Understanding the Effects of RLHF on LLM Generalisation and Diversity
by: Kirk, Robert, et al.
Published: (2023)
by: Kirk, Robert, et al.
Published: (2023)
Reward Model Ensembles Help Mitigate Overoptimization
by: Coste, Thomas, et al.
Published: (2023)
by: Coste, Thomas, et al.
Published: (2023)
Force Dipole Interactions in Tubular Fluid Membranes
by: Jain, Samyak, et al.
Published: (2023)
by: Jain, Samyak, et al.
Published: (2023)
Nuclear stability and the Fold Catastrophe
by: Jain, Samyak, et al.
Published: (2023)
by: Jain, Samyak, et al.
Published: (2023)
Tunneling half-lives in macroscopic-microscopic picture
by: Jain, Samyak, et al.
Published: (2024)
by: Jain, Samyak, et al.
Published: (2024)
Catastrophe theoretic approach to the Higgs Mechanism
by: Jain, Samyak, et al.
Published: (2023)
by: Jain, Samyak, et al.
Published: (2023)
Debating with More Persuasive LLMs Leads to More Truthful Answers
by: Khan, Akbir, et al.
Published: (2024)
by: Khan, Akbir, et al.
Published: (2024)
Interaction Dynamics as a Reward Signal for LLMs
by: Gooding, Sian, et al.
Published: (2025)
by: Gooding, Sian, et al.
Published: (2025)
Similar Items
-
A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language
by: Lubana, Ekdeep Singh, et al.
Published: (2024) -
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
by: Okawa, Maya, et al.
Published: (2023) -
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
by: Jain, Samyak, et al.
Published: (2024) -
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
by: Ramesh, Rahul, et al.
Published: (2023) -
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
by: Jaipersaud, Brandon, et al.
Published: (2025)