Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.20796 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908730791034880 |
|---|---|
| author | Shan, Zhengyang Mueller, Aaron |
| author_facet | Shan, Zhengyang Mueller, Aaron |
| contents | We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_20796 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics? Shan, Zhengyang Mueller, Aaron Computation and Language We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities. |
| title | Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics? |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2512.20796 |