Saved in:
| Main Authors: | Afonin, Nikita, Andriianov, Nikita, Hovhannisyan, Vahagn, Bageshpura, Nikhil, Liu, Kyle, Zhu, Kevin, Dev, Sunishchal, Panda, Ashwinee, Rogov, Oleg, Tutubalina, Elena, Panchenko, Alexander, Seleznyov, Mikhail |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.11288 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025)
by: Egbuna, Nathan, et al.
Published: (2025)
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures
by: Somov, Oleg, et al.
Published: (2026)
by: Somov, Oleg, et al.
Published: (2026)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
by: Seleznyov, Mikhail, et al.
Published: (2025)
by: Seleznyov, Mikhail, et al.
Published: (2025)
Evolutionary Search for Automated Design of Uncertainty Quantification Methods
by: Seleznyov, Mikhail, et al.
Published: (2026)
by: Seleznyov, Mikhail, et al.
Published: (2026)
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
by: Moskovskiy, Daniil, et al.
Published: (2025)
by: Moskovskiy, Daniil, et al.
Published: (2025)
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
by: Arturi, Daniel Aarao Reis, et al.
Published: (2025)
by: Arturi, Daniel Aarao Reis, et al.
Published: (2025)
Modeling and Predicting Multi-Turn Answer Instability in Large Language Models
by: He, Jiahang, et al.
Published: (2025)
by: He, Jiahang, et al.
Published: (2025)
Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)
by: Soligo, Anna, et al.
Published: (2026)
Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
by: Chaturvedi, Isha, et al.
Published: (2025)
by: Chaturvedi, Isha, et al.
Published: (2025)
The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation
by: Braslavski, Pavel, et al.
Published: (2026)
by: Braslavski, Pavel, et al.
Published: (2026)
SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
by: Batra, Shourya, et al.
Published: (2025)
by: Batra, Shourya, et al.
Published: (2025)
FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
by: Swaroop, Anand, et al.
Published: (2025)
by: Swaroop, Anand, et al.
Published: (2025)
A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
by: O'Brien, Claire, et al.
Published: (2026)
by: O'Brien, Claire, et al.
Published: (2026)
Harnessing non-adversarial robustness in large language models
by: Zhou, Qinghua, et al.
Published: (2026)
by: Zhou, Qinghua, et al.
Published: (2026)
xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation Metrics
by: Larionov, Daniil, et al.
Published: (2024)
by: Larionov, Daniil, et al.
Published: (2024)
Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
by: Borisiuk, Anna, et al.
Published: (2026)
by: Borisiuk, Anna, et al.
Published: (2026)
Limits of Emergent Reasoning of Large Language Models in Agentic Frameworks for Deterministic Games
by: Su, Chris, et al.
Published: (2025)
by: Su, Chris, et al.
Published: (2025)
CoRoVA: Compressed Representations for Vector-Augmented Code Completion
by: Cherniuk, Daria, et al.
Published: (2025)
by: Cherniuk, Daria, et al.
Published: (2025)
Emergent Persuasion: Will LLMs Persuade Without Being Prompted?
by: Chang, Vincent, et al.
Published: (2025)
by: Chang, Vincent, et al.
Published: (2025)
Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
by: Mishra, Abhishek, et al.
Published: (2026)
by: Mishra, Abhishek, et al.
Published: (2026)
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
by: Mushtaq, Erum, et al.
Published: (2025)
by: Mushtaq, Erum, et al.
Published: (2025)
Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
by: Vazhentsev, Artem, et al.
Published: (2026)
by: Vazhentsev, Artem, et al.
Published: (2026)
Probabilistically Robust Watermarking of Neural Networks
by: Pautov, Mikhail, et al.
Published: (2024)
by: Pautov, Mikhail, et al.
Published: (2024)
Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models
by: Salnikov, Mikhail, et al.
Published: (2025)
by: Salnikov, Mikhail, et al.
Published: (2025)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
by: Giordani, Jeremiah
Published: (2025)
by: Giordani, Jeremiah
Published: (2025)
The experimental observation of $a_0(1710)$: Long awaited from Regge approach
by: Afonin, S. S.
Published: (2025)
by: Afonin, S. S.
Published: (2025)
The Born rule for quantum probabilities from Newton's third law
by: Afonin, S. S.
Published: (2024)
by: Afonin, S. S.
Published: (2024)
Arbitrarily long strings of consecutive primes in special sets
by: Balakrishnan, Sai Sanjeev, et al.
Published: (2023)
by: Balakrishnan, Sai Sanjeev, et al.
Published: (2023)
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
by: Korznikov, Anton, et al.
Published: (2025)
by: Korznikov, Anton, et al.
Published: (2025)
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
by: Korznikov, Anton, et al.
Published: (2026)
by: Korznikov, Anton, et al.
Published: (2026)
The Rogue Scalpel: Activation Steering Compromises LLM Safety
by: Korznikov, Anton, et al.
Published: (2025)
by: Korznikov, Anton, et al.
Published: (2025)
Solving adversarial examples requires solving exponential misalignment
by: Salvatore, Alessandro, et al.
Published: (2026)
by: Salvatore, Alessandro, et al.
Published: (2026)
The benefits of query-based KGQA systems for complex and temporal questions in LLM era
by: Alekseev, Artem, et al.
Published: (2025)
by: Alekseev, Artem, et al.
Published: (2025)
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
by: Zhang, Juzheng, et al.
Published: (2025)
by: Zhang, Juzheng, et al.
Published: (2025)
Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits
by: Patel, Dev, et al.
Published: (2025)
by: Patel, Dev, et al.
Published: (2025)
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
by: Savkin, Maksim, et al.
Published: (2026)
by: Savkin, Maksim, et al.
Published: (2026)
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026)
by: Dubiński, Jan, et al.
Published: (2026)
Sumudu Neural Operator for ODEs and PDEs
by: Zelenskiy, Ben, et al.
Published: (2025)
by: Zelenskiy, Ben, et al.
Published: (2025)
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
by: Dev, Sunishchal, et al.
Published: (2026)
by: Dev, Sunishchal, et al.
Published: (2026)
Similar Items
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025) -
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025) -
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures
by: Somov, Oleg, et al.
Published: (2026) -
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
by: Seleznyov, Mikhail, et al.
Published: (2025) -
Evolutionary Search for Automated Design of Uncertainty Quantification Methods
by: Seleznyov, Mikhail, et al.
Published: (2026)