:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lucassen, James, Henry, Mark, Wright, Philippa, Yeung, Owen
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.15116
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Retrying vs Resampling in AI Control
by: Lucassen, James, et al.
Published: (2026)

Unreflected Acceptance -- Investigating the Negative Consequences of ChatGPT-Assisted Problem Solving in Physics Education
by: Krupp, Lars, et al.
Published: (2023)

BashArena: A Control Setting for Highly Privileged AI Agents
by: Kaufman, Adam, et al.
Published: (2025)

Cluster-norm for Unsupervised Probing of Knowledge
by: Laurito, Walter, et al.
Published: (2024)

Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt
by: de Mijolla, Damien, et al.
Published: (2024)

Evaluating LLMs with Multiple Problems at once
by: Wang, Zhengxiang, et al.
Published: (2024)

Correctness, Artificial Intelligence, and the Epistemic Value of Mathematical Proof
by: Weatherall, James Owen, et al.
Published: (2026)

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
by: Queen, Owen, et al.
Published: (2025)

Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
by: Henry, Jazmia
Published: (2026)

Alternative Fairness and Accuracy Optimization in Criminal Justice
by: Wu, Shaolong, et al.
Published: (2025)

Interactive AI Alignment: Specification, Process, and Evaluation Alignment
by: Terry, Michael, et al.
Published: (2023)

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
by: Nie, Fan, et al.
Published: (2026)

ReasonOps: Operator Segmentation for LLM Reasoning Traces
by: Lee, Daniel, et al.
Published: (2026)

Evaluating Cognitive Age Alignment in Interactive AI Agents
by: Shen, Yifan, et al.
Published: (2026)

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
by: Petrova, Nora, et al.
Published: (2026)

Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models
by: Wu, Hui, et al.
Published: (2026)

An Evaluation of Cultural Value Alignment in LLM
by: Sukiennik, Nicholas, et al.
Published: (2025)

Pluralistic Off-policy Evaluation and Alignment
by: Huang, Chengkai, et al.
Published: (2025)

Here's Charlie! Realising the Semantic Web vision of Agents in the age of LLMs
by: Wright, Jesse
Published: (2024)

The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings
by: Susanto, Hengky, et al.
Published: (2026)

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
by: Ye, Seonghyeon, et al.
Published: (2023)

Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants
by: Galatolo, Alessio, et al.
Published: (2025)

CORE -- A Cell-Level Coarse-to-Fine Image Registration Engine for Multi-stain Image Alignment
by: Nasir, Esha Sadia, et al.
Published: (2025)

AI Alignment via Incentives and Correction
by: Agarwal, Rohit, et al.
Published: (2026)

The Goofus & Gallant Story Corpus for Practical Value Alignment
by: Nahian, Md Sultan Al, et al.
Published: (2025)

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
by: Luo, Mingyu, et al.
Published: (2026)

On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)

Evaluating LLM Alignment With Human Trust Models
by: Debnath, Anushka, et al.
Published: (2026)

UK AISI Alignment Evaluation Case-Study
by: Souly, Alexandra, et al.
Published: (2026)

How predictable is language model benchmark performance?
by: Owen, David
Published: (2024)

Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment
by: Salgado, Henry, et al.
Published: (2025)

Exploring the use of AI authors and reviewers at Agents4Science
by: Bianchi, Federico, et al.
Published: (2025)

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
by: Osooli, Hamid, et al.
Published: (2026)

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control
by: Xu, Haofeng, et al.
Published: (2026)

The Reward Model Selection Crisis in Personalized Alignment
by: Rezk, Fady, et al.
Published: (2025)

From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
by: Sinha, Anusha, et al.
Published: (2025)

GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding
by: Dernbach, Stefan, et al.
Published: (2024)

Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence
by: Rau, Anita, et al.
Published: (2025)

Evaluating Human Alignment and Model Faithfulness of LLM Rationale
by: Fayyaz, Mohsen, et al.
Published: (2024)

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)