Saved in:
| Main Author: | Dubey, Shivam |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.09019 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Steering Towards Fairness: Mitigating Political Bias in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025)
by: Nadeem, Afrozah, et al.
Published: (2025)
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
by: Siddique, Zara, et al.
Published: (2025)
by: Siddique, Zara, et al.
Published: (2025)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
by: Cao, Zouying, et al.
Published: (2024)
by: Cao, Zouying, et al.
Published: (2024)
Activation Scaling for Steering and Interpreting Language Models
by: Stoehr, Niklas, et al.
Published: (2024)
by: Stoehr, Niklas, et al.
Published: (2024)
TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering
by: Sui, Yuan, et al.
Published: (2026)
by: Sui, Yuan, et al.
Published: (2026)
Investigating Bias Representations in Llama 2 Chat via Activation Steering
by: Lu, Dawn, et al.
Published: (2024)
by: Lu, Dawn, et al.
Published: (2024)
KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement
by: Pan, Jinhao, et al.
Published: (2026)
by: Pan, Jinhao, et al.
Published: (2026)
Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs
by: Nadeem, Afrozah, et al.
Published: (2026)
by: Nadeem, Afrozah, et al.
Published: (2026)
Extracting Unlearned Information from LLMs with Activation Steering
by: Seyitoğlu, Atakan, et al.
Published: (2024)
by: Seyitoğlu, Atakan, et al.
Published: (2024)
Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
by: Bo, Jessica Y., et al.
Published: (2025)
by: Bo, Jessica Y., et al.
Published: (2025)
Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations
by: Chen, Boxu, et al.
Published: (2025)
by: Chen, Boxu, et al.
Published: (2025)
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
by: Chen, Hongyu, et al.
Published: (2025)
by: Chen, Hongyu, et al.
Published: (2025)
Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks
by: Aldahoul, Nouar, et al.
Published: (2025)
by: Aldahoul, Nouar, et al.
Published: (2025)
Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection
by: Kang, Minjae, et al.
Published: (2026)
by: Kang, Minjae, et al.
Published: (2026)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025)
by: Soo, Samuel, et al.
Published: (2025)
Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability
by: Yamabe, Shojiro, et al.
Published: (2025)
by: Yamabe, Shojiro, et al.
Published: (2025)
Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
by: Valentino, Marco, et al.
Published: (2025)
by: Valentino, Marco, et al.
Published: (2025)
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
by: Chandna, Bhavik, et al.
Published: (2025)
by: Chandna, Bhavik, et al.
Published: (2025)
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
by: Yin, Jianghao, et al.
Published: (2026)
by: Yin, Jianghao, et al.
Published: (2026)
Steering Awareness: Detecting Activation Steering from Within
by: Rivera, Joshua Fonseca, et al.
Published: (2025)
by: Rivera, Joshua Fonseca, et al.
Published: (2025)
Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music
by: Chauhan, Shivam, et al.
Published: (2026)
by: Chauhan, Shivam, et al.
Published: (2026)
Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs
by: Pearman, Edie, et al.
Published: (2026)
by: Pearman, Edie, et al.
Published: (2026)
HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)
by: Sun, Jiuding, et al.
Published: (2025)
Dynamically Scaled Activation Steering
by: Ferrando, Alex, et al.
Published: (2025)
by: Ferrando, Alex, et al.
Published: (2025)
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
by: Cheng, Ruoxi, et al.
Published: (2024)
by: Cheng, Ruoxi, et al.
Published: (2024)
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
by: Raimondi, Bianca, et al.
Published: (2025)
by: Raimondi, Bianca, et al.
Published: (2025)
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
by: Wang, Zijun, et al.
Published: (2025)
by: Wang, Zijun, et al.
Published: (2025)
Raw Pointer Rewriting with LLMs for Translating C to Safer Rust
by: Gao, Yifei, et al.
Published: (2025)
by: Gao, Yifei, et al.
Published: (2025)
Steer Like the LLM: Activation Steering that Mimics Prompting
by: Heyman, Geert, et al.
Published: (2026)
by: Heyman, Geert, et al.
Published: (2026)
Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training
by: Kumar, Rajeev, et al.
Published: (2025)
by: Kumar, Rajeev, et al.
Published: (2025)
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
by: Liu, Shuliang, et al.
Published: (2026)
by: Liu, Shuliang, et al.
Published: (2026)
Activation Steering for Chain-of-Thought Compression
by: Azizi, Seyedarmin, et al.
Published: (2025)
by: Azizi, Seyedarmin, et al.
Published: (2025)
Minimizing Collateral Damage in Activation Steering
by: Nguyen, Tam, et al.
Published: (2026)
by: Nguyen, Tam, et al.
Published: (2026)
Steered LLM Activations are Non-Surjective
by: Mishra, Aayush, et al.
Published: (2026)
by: Mishra, Aayush, et al.
Published: (2026)
Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs
by: Wei, Kangda, et al.
Published: (2025)
by: Wei, Kangda, et al.
Published: (2025)
AGR: Age Group fairness Reward for Bias Mitigation in LLMs
by: Cao, Shuirong, et al.
Published: (2024)
by: Cao, Shuirong, et al.
Published: (2024)
Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications
by: Paul, Debdas, et al.
Published: (2026)
by: Paul, Debdas, et al.
Published: (2026)
A Variational Approach for Mitigating Entity Bias in Relation Extraction
by: Mensah, Samuel, et al.
Published: (2025)
by: Mensah, Samuel, et al.
Published: (2025)
Mitigating Interpretation Bias in Rock Records with Large Language Models: Insights from Paleoenvironmental Analysis
by: Wang, Luoqi, et al.
Published: (2024)
by: Wang, Luoqi, et al.
Published: (2024)
Surrogate Interpretable Graph for Random Decision Forests
by: Dubey, Akshat, et al.
Published: (2025)
by: Dubey, Akshat, et al.
Published: (2025)
Similar Items
-
Steering Towards Fairness: Mitigating Political Bias in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025) -
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
by: Siddique, Zara, et al.
Published: (2025) -
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
by: Cao, Zouying, et al.
Published: (2024) -
Activation Scaling for Steering and Interpreting Language Models
by: Stoehr, Niklas, et al.
Published: (2024) -
TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering
by: Sui, Yuan, et al.
Published: (2026)