:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Dubey, Shivam
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.09019
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Steering Towards Fairness: Mitigating Political Bias in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025)

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
by: Siddique, Zara, et al.
Published: (2025)

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
by: Cao, Zouying, et al.
Published: (2024)

Activation Scaling for Steering and Interpreting Language Models
by: Stoehr, Niklas, et al.
Published: (2024)

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering
by: Sui, Yuan, et al.
Published: (2026)

Investigating Bias Representations in Llama 2 Chat via Activation Steering
by: Lu, Dawn, et al.
Published: (2024)

KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement
by: Pan, Jinhao, et al.
Published: (2026)

Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs
by: Nadeem, Afrozah, et al.
Published: (2026)

Extracting Unlearned Information from LLMs with Activation Steering
by: Seyitoğlu, Atakan, et al.
Published: (2024)

Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
by: Bo, Jessica Y., et al.
Published: (2025)

Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations
by: Chen, Boxu, et al.
Published: (2025)

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
by: Chen, Hongyu, et al.
Published: (2025)

Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks
by: Aldahoul, Nouar, et al.
Published: (2025)

Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection
by: Kang, Minjae, et al.
Published: (2026)

Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025)

Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability
by: Yamabe, Shojiro, et al.
Published: (2025)

Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
by: Valentino, Marco, et al.
Published: (2025)

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
by: Chandna, Bhavik, et al.
Published: (2025)

Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
by: Yin, Jianghao, et al.
Published: (2026)

Steering Awareness: Detecting Activation Steering from Within
by: Rivera, Joshua Fonseca, et al.
Published: (2025)

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music
by: Chauhan, Shivam, et al.
Published: (2026)

Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs
by: Pearman, Edie, et al.
Published: (2026)

HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)

Dynamically Scaled Activation Steering
by: Ferrando, Alex, et al.
Published: (2025)

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
by: Cheng, Ruoxi, et al.
Published: (2024)

Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
by: Raimondi, Bianca, et al.
Published: (2025)

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
by: Wang, Zijun, et al.
Published: (2025)

Raw Pointer Rewriting with LLMs for Translating C to Safer Rust
by: Gao, Yifei, et al.
Published: (2025)

Steer Like the LLM: Activation Steering that Mimics Prompting
by: Heyman, Geert, et al.
Published: (2026)

Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training
by: Kumar, Rajeev, et al.
Published: (2025)

Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
by: Liu, Shuliang, et al.
Published: (2026)

Activation Steering for Chain-of-Thought Compression
by: Azizi, Seyedarmin, et al.
Published: (2025)

Minimizing Collateral Damage in Activation Steering
by: Nguyen, Tam, et al.
Published: (2026)

Steered LLM Activations are Non-Surjective
by: Mishra, Aayush, et al.
Published: (2026)

Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs
by: Wei, Kangda, et al.
Published: (2025)

AGR: Age Group fairness Reward for Bias Mitigation in LLMs
by: Cao, Shuirong, et al.
Published: (2024)

Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications
by: Paul, Debdas, et al.
Published: (2026)

A Variational Approach for Mitigating Entity Bias in Relation Extraction
by: Mensah, Samuel, et al.
Published: (2025)

Mitigating Interpretation Bias in Rock Records with Large Language Models: Insights from Paleoenvironmental Analysis
by: Wang, Luoqi, et al.
Published: (2024)

Surrogate Interpretable Graph for Random Decision Forests
by: Dubey, Akshat, et al.
Published: (2025)