:: Library Catalog

Copertina

Salvato in:

Dettagli Bibliografici
Autori principali:	Janiak, Jett, Karwowski, Jacek, Mangat, Chatrik Singh, Giglemiani, Giorgi, Petrova, Nora, Heimersheim, Stefan
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Machine Learning
Accesso online:	https://arxiv.org/abs/2409.17113
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Documenti analoghi

Evaluating Synthetic Activations composed of SAE Latents in GPT-2
di: Giglemiani, Giorgi, et al.
Pubblicazione: (2024)

Boundary Point Jailbreaking of Black-Box LLMs
di: Davies, Xander, et al.
Pubblicazione: (2026)

Incoherence in Goal-Conditioned Autoregressive Models
di: Karwowski, Jacek, et al.
Pubblicazione: (2025)

Hilbert geometry of the symmetric positive-definite bicone: Application to the geometry of the extended Gaussian family
di: Karwowski, Jacek, et al.
Pubblicazione: (2025)

Geometric structures and deviations on James' symmetric positive-definite matrix bicone domain
di: Karwowski, Jacek, et al.
Pubblicazione: (2026)

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
di: Janiak, Jett, et al.
Pubblicazione: (2023)

You can remove GPT2's LayerNorm by fine-tuning
di: Heimersheim, Stefan
Pubblicazione: (2024)

From Stability to Inconsistency: A Study of Moral Preferences in LLMs
di: Jotautaite, Monika, et al.
Pubblicazione: (2025)

How to use and interpret activation patching
di: Heimersheim, Stefan, et al.
Pubblicazione: (2024)

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
di: Arcuschin, Iván, et al.
Pubblicazione: (2025)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
di: Lee, Daniel J., et al.
Pubblicazione: (2024)

Evolution of SAE Features Across Layers in LLMs
di: Balcells, Daniel, et al.
Pubblicazione: (2024)

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
di: Recchia, Gabriel, et al.
Pubblicazione: (2025)

Likelihood hacking in probabilistic program synthesis
di: Karwowski, Jacek, et al.
Pubblicazione: (2026)

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs
di: Fillingham, Sean P., et al.
Pubblicazione: (2025)

Detecting Strategic Deception Using Linear Probes
di: Goldowsky-Dill, Nicholas, et al.
Pubblicazione: (2025)

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
di: Taufeeque, Mohammad, et al.
Pubblicazione: (2026)

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
di: Baroni, Luca, et al.
Pubblicazione: (2025)

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
di: Braun, Dan, et al.
Pubblicazione: (2025)

Benchmarking Deception Probes via Black-to-White Performance Boosts
di: Parrack, Avi, et al.
Pubblicazione: (2025)

Transformers represent belief state geometry in their residual stream
di: Shai, Adam S., et al.
Pubblicazione: (2024)

Hallucination Detection in LLMs Using Spectral Features of Attention Maps
di: Binkowski, Jakub, et al.
Pubblicazione: (2025)

FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs
di: Sawczyn, Albert, et al.
Pubblicazione: (2025)

Latent Adversarial Training Improves the Representation of Refusal
di: Abbas, Alexandra, et al.
Pubblicazione: (2025)

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
di: Janiak, Denis, et al.
Pubblicazione: (2025)

A Geometry-Based View of Mahalanobis OOD Detection
di: Janiak, Denis, et al.
Pubblicazione: (2025)

Erfonium: A Hooke Atom with Soft Interaction Potential
di: Karwowski, Jacek, et al.
Pubblicazione: (2023)

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
di: Bushnaq, Lucius, et al.
Pubblicazione: (2024)

Confirmation bias: A challenge for scalable oversight
di: Recchia, Gabriel, et al.
Pubblicazione: (2025)

Refusal in LLMs is an Affine Function
di: Marshall, Thomas, et al.
Pubblicazione: (2024)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
di: Bushnaq, Lucius, et al.
Pubblicazione: (2024)

Thermal Robustness of Retrieval in Dense Associative Memories: LSE vs LSR Kernels
di: Petrova, Tatiana
Pubblicazione: (2026)

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution
di: Kowal, Matthew, et al.
Pubblicazione: (2026)

Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
di: Duszenko, Jacek
Pubblicazione: (2026)

COMBINEX: A Unified Counterfactual Explainer for Graph Neural Networks via Node Feature and Structural Perturbations
di: Giorgi, Flavio, et al.
Pubblicazione: (2025)

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance
di: Kaźmierczak, Stanisław, et al.
Pubblicazione: (2024)

Interpretable Multi-task Learning with Shared Variable Embeddings
di: Żelaszczyk, Maciej, et al.
Pubblicazione: (2024)

A-PETE: Adaptive Prototype Explanations of Tree Ensembles
di: Karolczak, Jacek, et al.
Pubblicazione: (2024)

Towards Unbiased Calibration using Meta-Regularization
di: Wang, Cheng, et al.
Pubblicazione: (2023)

Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations
di: Karolczak, Jacek, et al.
Pubblicazione: (2026)