:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Afonin, Nikita, Andriianov, Nikita, Hovhannisyan, Vahagn, Bageshpura, Nikhil, Liu, Kyle, Zhu, Kevin, Dev, Sunishchal, Panda, Ashwinee, Rogov, Oleg, Tutubalina, Elena, Panchenko, Alexander, Seleznyov, Mikhail
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.11288
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)

Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025)

Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures
by: Somov, Oleg, et al.
Published: (2026)

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
by: Seleznyov, Mikhail, et al.
Published: (2025)

Evolutionary Search for Automated Design of Uncertainty Quantification Methods
by: Seleznyov, Mikhail, et al.
Published: (2026)

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
by: Moskovskiy, Daniil, et al.
Published: (2025)

Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
by: Arturi, Daniel Aarao Reis, et al.
Published: (2025)

Modeling and Predicting Multi-Turn Answer Instability in Large Language Models
by: He, Jiahang, et al.
Published: (2025)

Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)

Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
by: Chaturvedi, Isha, et al.
Published: (2025)

The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation
by: Braslavski, Pavel, et al.
Published: (2026)

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
by: Batra, Shourya, et al.
Published: (2025)

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
by: Swaroop, Anand, et al.
Published: (2025)

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
by: O'Brien, Claire, et al.
Published: (2026)

Harnessing non-adversarial robustness in large language models
by: Zhou, Qinghua, et al.
Published: (2026)

xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation Metrics
by: Larionov, Daniil, et al.
Published: (2024)

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
by: Borisiuk, Anna, et al.
Published: (2026)

Limits of Emergent Reasoning of Large Language Models in Agentic Frameworks for Deterministic Games
by: Su, Chris, et al.
Published: (2025)

CoRoVA: Compressed Representations for Vector-Augmented Code Completion
by: Cherniuk, Daria, et al.
Published: (2025)

Emergent Persuasion: Will LLMs Persuade Without Being Prompted?
by: Chang, Vincent, et al.
Published: (2025)

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
by: Mishra, Abhishek, et al.
Published: (2026)

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
by: Mushtaq, Erum, et al.
Published: (2025)

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
by: Vazhentsev, Artem, et al.
Published: (2026)

Probabilistically Robust Watermarking of Neural Networks
by: Pautov, Mikhail, et al.
Published: (2024)

Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models
by: Salnikov, Mikhail, et al.
Published: (2025)

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
by: Giordani, Jeremiah
Published: (2025)

The experimental observation of $a_0(1710)$: Long awaited from Regge approach
by: Afonin, S. S.
Published: (2025)

The Born rule for quantum probabilities from Newton's third law
by: Afonin, S. S.
Published: (2024)

Arbitrarily long strings of consecutive primes in special sets
by: Balakrishnan, Sai Sanjeev, et al.
Published: (2023)

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
by: Korznikov, Anton, et al.
Published: (2025)

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
by: Korznikov, Anton, et al.
Published: (2026)

The Rogue Scalpel: Activation Steering Compromises LLM Safety
by: Korznikov, Anton, et al.
Published: (2025)

Solving adversarial examples requires solving exponential misalignment
by: Salvatore, Alessandro, et al.
Published: (2026)

The benefits of query-based KGQA systems for complex and temporal questions in LLM era
by: Alekseev, Artem, et al.
Published: (2025)

LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
by: Zhang, Juzheng, et al.
Published: (2025)

Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits
by: Patel, Dev, et al.
Published: (2025)

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
by: Savkin, Maksim, et al.
Published: (2026)

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026)

Sumudu Neural Operator for ODEs and PDEs
by: Zelenskiy, Ben, et al.
Published: (2025)

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
by: Dev, Sunishchal, et al.
Published: (2026)