Saved in:
| Main Authors: | Lo, Michelle, Cohen, Shay B., Barez, Fazl |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.01814 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
by: Neo, Clement, et al.
Published: (2024)
by: Neo, Clement, et al.
Published: (2024)
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
by: Fu, Tingchen, et al.
Published: (2024)
by: Fu, Tingchen, et al.
Published: (2024)
Query Circuits: Explaining How Language Models Answer User Prompts
by: Wu, Tung-Yu, et al.
Published: (2025)
by: Wu, Tung-Yu, et al.
Published: (2025)
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
by: Simhi, Adi, et al.
Published: (2026)
by: Simhi, Adi, et al.
Published: (2026)
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
by: Lan, Michael, et al.
Published: (2023)
by: Lan, Michael, et al.
Published: (2023)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
by: Chaudhary, Maheep, et al.
Published: (2025)
by: Chaudhary, Maheep, et al.
Published: (2025)
Understanding Addition in Transformers
by: Quirke, Philip, et al.
Published: (2023)
by: Quirke, Philip, et al.
Published: (2023)
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
by: Fu, Tingchen, et al.
Published: (2025)
by: Fu, Tingchen, et al.
Published: (2025)
VAL-Bench: Belief Consistency as a measure for Value Alignment in Language Models
by: Gupta, Aman, et al.
Published: (2025)
by: Gupta, Aman, et al.
Published: (2025)
Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains
by: Fonseca, Marcio, et al.
Published: (2023)
by: Fonseca, Marcio, et al.
Published: (2023)
Rethinking AI Cultural Alignment
by: Bravansky, Michal, et al.
Published: (2025)
by: Bravansky, Michal, et al.
Published: (2025)
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
by: Lan, Michael, et al.
Published: (2024)
by: Lan, Michael, et al.
Published: (2024)
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
by: Oozeer, Narmeen, et al.
Published: (2025)
by: Oozeer, Narmeen, et al.
Published: (2025)
Visualizing Neural Network Imagination
by: Wichers, Nevan, et al.
Published: (2024)
by: Wichers, Nevan, et al.
Published: (2024)
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
by: Schrodi, Simon, et al.
Published: (2025)
by: Schrodi, Simon, et al.
Published: (2025)
Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?
by: Fonseca, Marcio, et al.
Published: (2024)
by: Fonseca, Marcio, et al.
Published: (2024)
Chain-of-Thought Hijacking
by: Zhao, Jianli, et al.
Published: (2025)
by: Zhao, Jianli, et al.
Published: (2025)
AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation
by: Li, Changyi, et al.
Published: (2026)
by: Li, Changyi, et al.
Published: (2026)
Embodied AI: Emerging Risks and Opportunities for Policy Action
by: Perlo, Jared, et al.
Published: (2025)
by: Perlo, Jared, et al.
Published: (2025)
Scaling sparse feature circuit finding for in-context learning
by: Kharlapenko, Dmitrii, et al.
Published: (2025)
by: Kharlapenko, Dmitrii, et al.
Published: (2025)
What can Large Language Models Capture about Code Functional Equivalence?
by: Maveli, Nickil, et al.
Published: (2024)
by: Maveli, Nickil, et al.
Published: (2024)
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models
by: Zhou, Yi, et al.
Published: (2025)
by: Zhou, Yi, et al.
Published: (2025)
Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning
by: Min, Nay Myat, et al.
Published: (2024)
by: Min, Nay Myat, et al.
Published: (2024)
Precise In-Parameter Concept Erasure in Large Language Models
by: Gur-Arieh, Yoav, et al.
Published: (2025)
by: Gur-Arieh, Yoav, et al.
Published: (2025)
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
by: Denison, Carson, et al.
Published: (2024)
by: Denison, Carson, et al.
Published: (2024)
Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour
by: Gyevnár, Bálint, et al.
Published: (2025)
by: Gyevnár, Bálint, et al.
Published: (2025)
Spectral Editing of Activations for Large Language Model Alignment
by: Qiu, Yifu, et al.
Published: (2024)
by: Qiu, Yifu, et al.
Published: (2024)
Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models
by: Qiu, Yifu, et al.
Published: (2025)
by: Qiu, Yifu, et al.
Published: (2025)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
by: Kim, Minseon, et al.
Published: (2025)
by: Kim, Minseon, et al.
Published: (2025)
`Keep it Together': Enforcing Cohesion in Extractive Summaries by Simulating Human Memory
by: Cardenas, Ronald, et al.
Published: (2024)
by: Cardenas, Ronald, et al.
Published: (2024)
Curveball Steering: The Right Direction To Steer Isn't Always Linear
by: Raval, Shivam, et al.
Published: (2026)
by: Raval, Shivam, et al.
Published: (2026)
Modeling News Interactions and Influence for Financial Market Prediction
by: Wang, Mengyu, et al.
Published: (2024)
by: Wang, Mengyu, et al.
Published: (2024)
Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
by: Deng, Ken, et al.
Published: (2026)
by: Deng, Ken, et al.
Published: (2026)
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures
by: Yoon, Sangyeon, et al.
Published: (2026)
by: Yoon, Sangyeon, et al.
Published: (2026)
Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility
by: Maveli, Nickil, et al.
Published: (2026)
by: Maveli, Nickil, et al.
Published: (2026)
Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
by: Qiu, Yifu, et al.
Published: (2025)
by: Qiu, Yifu, et al.
Published: (2025)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
by: Wollschläger, Tom, et al.
Published: (2025)
by: Wollschläger, Tom, et al.
Published: (2025)
Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models
by: Leang, Joshua Ong Jun, et al.
Published: (2026)
by: Leang, Joshua Ong Jun, et al.
Published: (2026)
Unlearn to Relearn Backdoors: Deferred Backdoor Functionality Attacks on Deep Learning Models
by: Shin, Jeongjin, et al.
Published: (2024)
by: Shin, Jeongjin, et al.
Published: (2024)
Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation
by: Qiu, Yifu, et al.
Published: (2023)
by: Qiu, Yifu, et al.
Published: (2023)
Similar Items
-
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
by: Neo, Clement, et al.
Published: (2024) -
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
by: Fu, Tingchen, et al.
Published: (2024) -
Query Circuits: Explaining How Language Models Answer User Prompts
by: Wu, Tung-Yu, et al.
Published: (2025) -
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
by: Simhi, Adi, et al.
Published: (2026) -
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
by: Lan, Michael, et al.
Published: (2023)