Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Danqing, Ladner, Tobias, Mhadhbi, Ahmed Rayen, Althoff, Matthias
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.12767
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912860409430016
author	Chen, Danqing Ladner, Tobias Mhadhbi, Ahmed Rayen Althoff, Matthias
author_facet	Chen, Danqing Ladner, Tobias Mhadhbi, Ahmed Rayen Althoff, Matthias
contents	As large language models become integral to high-stakes applications, ensuring their robustness and fairness is critical. Despite their success, large language models remain vulnerable to adversarial attacks, where small perturbations, such as synonym substitutions, can alter model predictions, posing risks in fairness-critical areas, such as gender bias mitigation, and safety-critical areas, such as toxicity detection. While formal verification has been explored for neural networks, its application to large language models remains limited. This work presents a holistic verification framework to certify the robustness of transformer-based language models, with a focus on ensuring gender fairness and consistent outputs across different gender-related terms. Furthermore, we extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic inputs are consistently detected and appropriately censored, thereby ensuring the reliability of moderation systems. By formalizing robustness within the embedding space, this work strengthens the reliability of language models in ethical AI deployment and content moderation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_12767
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Language Models That Walk the Talk: A Framework for Formal Fairness Certificates Chen, Danqing Ladner, Tobias Mhadhbi, Ahmed Rayen Althoff, Matthias Artificial Intelligence As large language models become integral to high-stakes applications, ensuring their robustness and fairness is critical. Despite their success, large language models remain vulnerable to adversarial attacks, where small perturbations, such as synonym substitutions, can alter model predictions, posing risks in fairness-critical areas, such as gender bias mitigation, and safety-critical areas, such as toxicity detection. While formal verification has been explored for neural networks, its application to large language models remains limited. This work presents a holistic verification framework to certify the robustness of transformer-based language models, with a focus on ensuring gender fairness and consistent outputs across different gender-related terms. Furthermore, we extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic inputs are consistently detected and appropriately censored, thereby ensuring the reliability of moderation systems. By formalizing robustness within the embedding space, this work strengthens the reliability of language models in ethical AI deployment and content moderation.
title	Language Models That Walk the Talk: A Framework for Formal Fairness Certificates
topic	Artificial Intelligence
url	https://arxiv.org/abs/2505.12767

Similar Items