:: Library Catalog

Copertina

Salvato in:

Dettagli Bibliografici
Autori principali:	Pres, Itamar, Ruis, Laura, Lubana, Ekdeep Singh, Krueger, David
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Artificial Intelligence Computation and Language
Accesso online:	https://arxiv.org/abs/2410.17245
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Documenti analoghi

How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
di: Jaipersaud, Brandon, et al.
Pubblicazione: (2025)

Competition Dynamics Shape Algorithmic Phases of In-Context Learning
di: Park, Core Francisco, et al.
Pubblicazione: (2024)

Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
di: Zur, Amir, et al.
Pubblicazione: (2025)

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
di: Bigelow, Eric, et al.
Pubblicazione: (2025)

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
di: Mueller, Aaron, et al.
Pubblicazione: (2025)

In-Context Learning Dynamics with Random Binary Sequences
di: Bigelow, Eric J., et al.
Pubblicazione: (2023)

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
di: Lee, Andrew, et al.
Pubblicazione: (2024)

Emergence of Hierarchical Emotion Organization in Large Language Models
di: Zhao, Bo, et al.
Pubblicazione: (2025)

ICLR: In-Context Learning of Representations
di: Park, Core Francisco, et al.
Pubblicazione: (2024)

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
di: Bigelow, Eric, et al.
Pubblicazione: (2026)

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
di: Xu, Yi, et al.
Pubblicazione: (2026)

Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders
di: Bohacek, Matyas, et al.
Pubblicazione: (2025)

Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
di: Banayeeanzade, Amin, et al.
Pubblicazione: (2025)

Investigating Non-Transitivity in LLM-as-a-Judge
di: Xu, Yi, et al.
Pubblicazione: (2025)

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
di: Cook, Jonathan, et al.
Pubblicazione: (2025)

Steering Towards Fairness: Mitigating Political Bias in LLMs
di: Nadeem, Afrozah, et al.
Pubblicazione: (2025)

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
di: Siu, Vincent, et al.
Pubblicazione: (2025)

Debating with More Persuasive LLMs Leads to More Truthful Answers
di: Khan, Akbir, et al.
Pubblicazione: (2024)

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs
di: Guo, Pei-Fu, et al.
Pubblicazione: (2025)

Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs
di: Nadeem, Afrozah, et al.
Pubblicazione: (2026)

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
di: Lunardi, Riccardo, et al.
Pubblicazione: (2025)

CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
di: Wang, Xintong, et al.
Pubblicazione: (2024)

Steering Language Models Before They Speak: Logit-Level Interventions
di: An, Hyeseon, et al.
Pubblicazione: (2026)

How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
di: Doostmohammadi, Ehsan, et al.
Pubblicazione: (2024)

Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection
di: Piot, Paloma, et al.
Pubblicazione: (2025)

KV Cache Steering for Controlling Frozen LLMs
di: Belitsky, Max, et al.
Pubblicazione: (2025)

XplainLLM: A Knowledge-Augmented Dataset for Reliable Grounded Explanations in LLMs
di: Chen, Zichen, et al.
Pubblicazione: (2023)

Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games
di: Yadav, Neemesh, et al.
Pubblicazione: (2025)

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators
di: Bajpai, Prasoon, et al.
Pubblicazione: (2024)

Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety
di: Chopra, Muskaan, et al.
Pubblicazione: (2026)

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence
di: Lu, Yuyin, et al.
Pubblicazione: (2026)

Steering Awareness: Detecting Activation Steering from Within
di: Rivera, Joshua Fonseca, et al.
Pubblicazione: (2025)

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
di: Kartik, NVJK, et al.
Pubblicazione: (2025)

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
di: Kang, Diancheng, et al.
Pubblicazione: (2026)

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language
di: Lubana, Ekdeep Singh, et al.
Pubblicazione: (2024)

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
di: Feucht, Sheridan, et al.
Pubblicazione: (2026)

Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs
di: Ma, Huanhuan, et al.
Pubblicazione: (2025)

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
di: Chen, Kexin, et al.
Pubblicazione: (2024)

Exploring the Personality Traits of LLMs through Latent Features Steering
di: Yang, Shu, et al.
Pubblicazione: (2024)

Steering LLMs for Formal Theorem Proving
di: Kirtania, Shashank, et al.
Pubblicazione: (2025)