:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Petrova, Nora, Burden, John
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.20813
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries
by: Petrova, Nora, et al.
Published: (2026)

Evaluating AI Evaluation: Perils and Prospects
by: Burden, John
Published: (2024)

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
by: Petrova, Nora, et al.
Published: (2026)

I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
by: Burden, John, et al.
Published: (2025)

Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture
by: Burden, John, et al.
Published: (2025)

Framing the Game: How Context Shapes LLM Decision-Making
by: Robinson, Isaac, et al.
Published: (2025)

Empirical Evaluation of the Implicit Hitting Set Approach for Weighted CSPs
by: Petrova, Aleksandra, et al.
Published: (2025)

Stress-Testing Model Specs Reveals Character Differences among Language Models
by: Zhang, Jifan, et al.
Published: (2025)

Conversational Complexity for Assessing Risk in Large Language Models
by: Burden, John, et al.
Published: (2024)

Token Alignment via Character Matching for Subword Completion
by: Athiwaratkun, Ben, et al.
Published: (2024)

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
by: Zhang, Jiawei, et al.
Published: (2025)

Behavioural Analysis of Alignment Faking
by: Hadida, Nathaniel Mitrani, et al.
Published: (2026)

From Abstract to Actionable: Pairwise Shapley Values for Explainable AI
by: Xu, Jiaxin, et al.
Published: (2025)

Inferring Capabilities from Task Performance with Bayesian Triangulation
by: Burden, John, et al.
Published: (2023)

BehAVE: Behaviour Alignment of Video Game Encodings
by: Rašajski, Nemanja, et al.
Published: (2024)

Anytime Cooperative Implicit Hitting Set Solving
by: Rollón, Emma, et al.
Published: (2025)

Towards Generalisable Imitation Learning Through Conditioned Transition Estimation and Online Behaviour Alignment
by: Gavenski, Nathan, et al.
Published: (2026)

Evaluating Stability of Unreflective Alignment
by: Lucassen, James, et al.
Published: (2024)

From Stories to Cities to Games: A Qualitative Evaluation of Behaviour Planning
by: Abdelwahed, Mustafa F., et al.
Published: (2026)

Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition
by: Luo, Qiuming, et al.
Published: (2026)

Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences?
by: Gu, Zhuojun, et al.
Published: (2025)

A Revealed Preference Framework for AI Alignment
by: Suleymanov, Elchin
Published: (2026)

CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds
by: Wang, Lei, et al.
Published: (2024)

Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment
by: Haas, Nathanaël, et al.
Published: (2026)

Geometric Analysis of Token Selection in Multi-Head Attention
by: Mudarisov, Timur, et al.
Published: (2026)

Interactive AI Alignment: Specification, Process, and Evaluation Alignment
by: Terry, Michael, et al.
Published: (2023)

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior
by: Ali, Dalia, et al.
Published: (2025)

Evaluating Cognitive Age Alignment in Interactive AI Agents
by: Shen, Yifan, et al.
Published: (2026)

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
by: Nair, Inderjeet, et al.
Published: (2026)

Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition
by: Kaliosis, Panagiotis, et al.
Published: (2025)

Behaviour Distillation
by: Lupu, Andrei, et al.
Published: (2024)

An Evaluation of Cultural Value Alignment in LLM
by: Sukiennik, Nicholas, et al.
Published: (2025)

Pluralistic Off-policy Evaluation and Alignment
by: Huang, Chengkai, et al.
Published: (2025)

Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
by: Ma, Yuhang, et al.
Published: (2024)

StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
by: Wu, Yerong, et al.
Published: (2026)

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
by: Fan, Siqi, et al.
Published: (2025)

From Language Models over Tokens to Language Models over Characters
by: Vieira, Tim, et al.
Published: (2024)

Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants
by: Galatolo, Alessio, et al.
Published: (2025)

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
by: Mishra, Abhishek, et al.
Published: (2026)

Can Brain Signals Reveal Inner Alignment with Human Languages?
by: Han, William, et al.
Published: (2022)