Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Haider, Emman, Perez-Becker, Daniel, Portet, Thomas, Madan, Piyush, Garg, Amit, Ashfaq, Atabak, Majercak, David, Wen, Wen, Kim, Dongwoo, Yang, Ziyi, Zhang, Jianwen, Sharma, Hiteshi, Bullwinkel, Blake, Pouliot, Martin, Minnich, Amanda, Chawla, Shiven, Herrera, Solianna, Warreth, Shahed, Engler, Maggie, Lopez, Gary, Chikanov, Nina, Dheekonda, Raja Sekhar Rao, Jagdagdorj, Bolor-Erdene, Lutz, Roman, Lundeen, Richard, Westerhoff, Tori, Bryan, Pete, Seifert, Christian, Kumar, Ram Shankar Siva, Berkley, Andrew, Kessler, Alex
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2407.13833
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911999864078336
author	Haider, Emman Perez-Becker, Daniel Portet, Thomas Madan, Piyush Garg, Amit Ashfaq, Atabak Majercak, David Wen, Wen Kim, Dongwoo Yang, Ziyi Zhang, Jianwen Sharma, Hiteshi Bullwinkel, Blake Pouliot, Martin Minnich, Amanda Chawla, Shiven Herrera, Solianna Warreth, Shahed Engler, Maggie Lopez, Gary Chikanov, Nina Dheekonda, Raja Sekhar Rao Jagdagdorj, Bolor-Erdene Lutz, Roman Lundeen, Richard Westerhoff, Tori Bryan, Pete Seifert, Christian Kumar, Ram Shankar Siva Berkley, Andrew Kessler, Alex
author_facet	Haider, Emman Perez-Becker, Daniel Portet, Thomas Madan, Piyush Garg, Amit Ashfaq, Atabak Majercak, David Wen, Wen Kim, Dongwoo Yang, Ziyi Zhang, Jianwen Sharma, Hiteshi Bullwinkel, Blake Pouliot, Martin Minnich, Amanda Chawla, Shiven Herrera, Solianna Warreth, Shahed Engler, Maggie Lopez, Gary Chikanov, Nina Dheekonda, Raja Sekhar Rao Jagdagdorj, Bolor-Erdene Lutz, Roman Lundeen, Richard Westerhoff, Tori Bryan, Pete Seifert, Christian Kumar, Ram Shankar Siva Berkley, Andrew Kessler, Alex
contents	Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to test the safety behavior of Phi-3.5-mini and Phi-3.5-MoE, which were optimized for multilingual capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_13833
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle Haider, Emman Perez-Becker, Daniel Portet, Thomas Madan, Piyush Garg, Amit Ashfaq, Atabak Majercak, David Wen, Wen Kim, Dongwoo Yang, Ziyi Zhang, Jianwen Sharma, Hiteshi Bullwinkel, Blake Pouliot, Martin Minnich, Amanda Chawla, Shiven Herrera, Solianna Warreth, Shahed Engler, Maggie Lopez, Gary Chikanov, Nina Dheekonda, Raja Sekhar Rao Jagdagdorj, Bolor-Erdene Lutz, Roman Lundeen, Richard Westerhoff, Tori Bryan, Pete Seifert, Christian Kumar, Ram Shankar Siva Berkley, Andrew Kessler, Alex Computation and Language Artificial Intelligence Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to test the safety behavior of Phi-3.5-mini and Phi-3.5-MoE, which were optimized for multilingual capabilities.
title	Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2407.13833

Similar Items