Liu, S., Chen, X., Urcelay, B. M., & Croce, F. (2026). Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders.
Chicago Style (17th ed.) CitationLiu, Shunchang, Xin Chen, Belen Martin Urcelay, and Francesco Croce. Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders. 2026.
MLA (9th ed.) CitationLiu, Shunchang, et al. Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders. 2026.
Warning: These citations may not always be 100% accurate.