:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kim, Been, Hewitt, John, Nanda, Neel, Fiedel, Noah, Tafjord, Oyvind
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2506.12152
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Neologism Learning for Controllability and Self-Verbalization
by: Hewitt, John, et al.
Published: (2025)

We Can't Understand AI Using our Existing Vocabulary
by: Hewitt, John, et al.
Published: (2025)

Digital Socrates: Evaluating LLMs through Explanation Critiques
by: Gu, Yuling, et al.
Published: (2023)

BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability
by: Clark, Peter, et al.
Published: (2023)

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs
by: Smit, Andries, et al.
Published: (2023)

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
by: Gu, Yuling, et al.
Published: (2024)

Can we forget how we learned? Doxastic redundancy in iterated belief revision
by: Liberatore, Paolo
Published: (2024)

Features have life history. And we should care
by: Stecher, Philipp, et al.
Published: (2026)

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions
by: Wiegreffe, Sarah, et al.
Published: (2024)

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
by: Li, Belinda Z., et al.
Published: (2025)

A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going?
by: Liu, Xinyu, et al.
Published: (2024)

Can we Evaluate RAGs with Synthetic Data?
by: van Elburg, Jonas, et al.
Published: (2025)

OLMES: A Standard for Language Model Evaluations
by: Gu, Yuling, et al.
Published: (2024)

Thought Branches: Interpreting LLM Reasoning Requires Resampling
by: Macar, Uzay, et al.
Published: (2025)

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
by: Jiang, Nick, et al.
Published: (2025)

When AI reviews science: Can we trust the referee?
by: Wang, Jialiang, et al.
Published: (2026)

Can we trust the evaluation on ChatGPT?
by: Aiyappa, Rachith, et al.
Published: (2023)

Explorations of Self-Repair in Language Models
by: Rushing, Cody, et al.
Published: (2024)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
by: Zhang, Fred, et al.
Published: (2023)

Can we automatize scientific discovery in the cognitive sciences?
by: Jagadish, Akshay K., et al.
Published: (2026)

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
by: Minder, Julian, et al.
Published: (2025)

How Well Do Models Follow Their Constitutions?
by: Jakkli, Arya, et al.
Published: (2026)

LitLLMs, LLMs for Literature Review: Are we there yet?
by: Agarwal, Shubham, et al.
Published: (2024)

Can we only use guideline instead of shot in prompt?
by: Chen, Jiaxiang, et al.
Published: (2024)

Because we're here
by: Susskind, Leonard
Published: (2005)

BatchTopK Sparse Autoencoders
by: Bussmann, Bart, et al.
Published: (2024)

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering
by: Maar, Jim, et al.
Published: (2026)

Guided Persona-based AI Surveys: Can we replicate personal mobility preferences at scale using LLMs?
by: Tzachristas, Ioannis, et al.
Published: (2025)

Can we ease the Injectivity Bottleneck on Lorentzian Manifolds for Graph Neural Networks?
by: Srinivasan, Srinitish, et al.
Published: (2025)

Good things come in small packages: Should we build AI clusters with Lite-GPUs?
by: Canakci, Burcu, et al.
Published: (2025)

Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change
by: Albers, Nele, et al.
Published: (2025)

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
by: Casademunt, Helena, et al.
Published: (2026)

Learning Multi-Level Features with Matryoshka Sparse Autoencoders
by: Bussmann, Bart, et al.
Published: (2025)

Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025)

Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)

Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in
by: Agarwal, Utkarsh, et al.
Published: (2024)

Subliminal Learning Is Steering Vector Distillation
by: Blank, Camila, et al.
Published: (2026)

Steering Evaluation-Aware Language Models to Act Like They Are Deployed
by: Hua, Tim Tian, et al.
Published: (2025)

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
by: Jansen, Peter, et al.
Published: (2024)

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
by: Ferrando, Javier, et al.
Published: (2024)