Saved in:
| Main Authors: | Thurnherr, Hannes, Scheurer, Jérémy |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.13714 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Neural Decompiling of Tracr Transformers
by: Thurnherr, Hannes, et al.
Published: (2024)
by: Thurnherr, Hannes, et al.
Published: (2024)
Large Language Models can Strategically Deceive their Users when Put Under Pressure
by: Scheurer, Jérémy, et al.
Published: (2023)
by: Scheurer, Jérémy, et al.
Published: (2023)
Training Language Models with Language Feedback at Scale
by: Scheurer, Jérémy, et al.
Published: (2023)
by: Scheurer, Jérémy, et al.
Published: (2023)
AfroBench: How Good are Large Language Models on African Languages?
by: Ojo, Jessica, et al.
Published: (2023)
by: Ojo, Jessica, et al.
Published: (2023)
Generalization of RLVR Using Causal Reasoning as a Testbed
by: Lu, Brian, et al.
Published: (2025)
by: Lu, Brian, et al.
Published: (2025)
$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model
by: Zhang, Yufeng, et al.
Published: (2024)
by: Zhang, Yufeng, et al.
Published: (2024)
StyleBench: Evaluating thinking styles in Large Language Models
by: Guo, Junyu, et al.
Published: (2025)
by: Guo, Junyu, et al.
Published: (2025)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
Rethinking Interpretability in the Era of Large Language Models
by: Singh, Chandan, et al.
Published: (2024)
by: Singh, Chandan, et al.
Published: (2024)
Improving Code Generation by Training with Natural Language Feedback
by: Chen, Angelica, et al.
Published: (2023)
by: Chen, Angelica, et al.
Published: (2023)
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
by: Feng, Jie, et al.
Published: (2024)
by: Feng, Jie, et al.
Published: (2024)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
by: Majumder, Bodhisattwa Prasad, et al.
Published: (2024)
by: Majumder, Bodhisattwa Prasad, et al.
Published: (2024)
PromptBench: A Unified Library for Evaluation of Large Language Models
by: Zhu, Kaijie, et al.
Published: (2023)
by: Zhu, Kaijie, et al.
Published: (2023)
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
by: Cho, Hakaze, et al.
Published: (2025)
by: Cho, Hakaze, et al.
Published: (2025)
AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
by: Lee, Jaeho, et al.
Published: (2025)
by: Lee, Jaeho, et al.
Published: (2025)
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
by: Muhamed, Aashiq, et al.
Published: (2025)
by: Muhamed, Aashiq, et al.
Published: (2025)
SelfIE: Self-Interpretation of Large Language Model Embeddings
by: Chen, Haozhe, et al.
Published: (2024)
by: Chen, Haozhe, et al.
Published: (2024)
Fine-Grained Interpretation of Political Opinions in Large Language Models
by: Hu, Jingyu, et al.
Published: (2025)
by: Hu, Jingyu, et al.
Published: (2025)
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
by: Khalifa, Muhammad, et al.
Published: (2026)
by: Khalifa, Muhammad, et al.
Published: (2026)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025)
by: Soo, Samuel, et al.
Published: (2025)
Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models
by: Şakiroğlu, Mehmet Can, et al.
Published: (2026)
by: Şakiroğlu, Mehmet Can, et al.
Published: (2026)
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
by: Song, Xiaoshuai, et al.
Published: (2024)
by: Song, Xiaoshuai, et al.
Published: (2024)
CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making
by: Ma, Mingyu Derek, et al.
Published: (2024)
by: Ma, Mingyu Derek, et al.
Published: (2024)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
by: Wang, Xiaoxuan, et al.
Published: (2023)
by: Wang, Xiaoxuan, et al.
Published: (2023)
Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models
by: Tsui, Ken
Published: (2025)
by: Tsui, Ken
Published: (2025)
TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation
by: Ezzakri, Anas, et al.
Published: (2025)
by: Ezzakri, Anas, et al.
Published: (2025)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
by: Lan, Michael, et al.
Published: (2023)
by: Lan, Michael, et al.
Published: (2023)
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
by: Hu, Tiancheng, et al.
Published: (2025)
by: Hu, Tiancheng, et al.
Published: (2025)
Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales
by: Nirmal, Ayushi, et al.
Published: (2024)
by: Nirmal, Ayushi, et al.
Published: (2024)
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations
by: Wang, Qianli, et al.
Published: (2024)
by: Wang, Qianli, et al.
Published: (2024)
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
by: Casademunt, Helena, et al.
Published: (2026)
by: Casademunt, Helena, et al.
Published: (2026)
FCoReBench: Can Large Language Models Solve Challenging First-Order Combinatorial Reasoning Problems?
by: Mittal, Chinmay, et al.
Published: (2024)
by: Mittal, Chinmay, et al.
Published: (2024)
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
by: Zbeeb, Mohammad, et al.
Published: (2025)
by: Zbeeb, Mohammad, et al.
Published: (2025)
Variational Language Concepts for Interpreting Foundation Language Models
by: Wang, Hengyi, et al.
Published: (2024)
by: Wang, Hengyi, et al.
Published: (2024)
Interpreting the Repeated Token Phenomenon in Large Language Models
by: Yona, Itay, et al.
Published: (2025)
by: Yona, Itay, et al.
Published: (2025)
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
by: Laine, Rudolf, et al.
Published: (2024)
by: Laine, Rudolf, et al.
Published: (2024)
BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
by: Srivastava, Gaurav, et al.
Published: (2025)
by: Srivastava, Gaurav, et al.
Published: (2025)
Counterfactual Token Generation in Large Language Models
by: Chatzi, Ivi, et al.
Published: (2024)
by: Chatzi, Ivi, et al.
Published: (2024)
Similar Items
-
Neural Decompiling of Tracr Transformers
by: Thurnherr, Hannes, et al.
Published: (2024) -
Large Language Models can Strategically Deceive their Users when Put Under Pressure
by: Scheurer, Jérémy, et al.
Published: (2023) -
Training Language Models with Language Feedback at Scale
by: Scheurer, Jérémy, et al.
Published: (2023) -
AfroBench: How Good are Large Language Models on African Languages?
by: Ojo, Jessica, et al.
Published: (2023) -
Generalization of RLVR Using Causal Reasoning as a Testbed
by: Lu, Brian, et al.
Published: (2025)