_version_ 1866911999864078336
author Haider, Emman
Perez-Becker, Daniel
Portet, Thomas
Madan, Piyush
Garg, Amit
Ashfaq, Atabak
Majercak, David
Wen, Wen
Kim, Dongwoo
Yang, Ziyi
Zhang, Jianwen
Sharma, Hiteshi
Bullwinkel, Blake
Pouliot, Martin
Minnich, Amanda
Chawla, Shiven
Herrera, Solianna
Warreth, Shahed
Engler, Maggie
Lopez, Gary
Chikanov, Nina
Dheekonda, Raja Sekhar Rao
Jagdagdorj, Bolor-Erdene
Lutz, Roman
Lundeen, Richard
Westerhoff, Tori
Bryan, Pete
Seifert, Christian
Kumar, Ram Shankar Siva
Berkley, Andrew
Kessler, Alex
author_facet Haider, Emman
Perez-Becker, Daniel
Portet, Thomas
Madan, Piyush
Garg, Amit
Ashfaq, Atabak
Majercak, David
Wen, Wen
Kim, Dongwoo
Yang, Ziyi
Zhang, Jianwen
Sharma, Hiteshi
Bullwinkel, Blake
Pouliot, Martin
Minnich, Amanda
Chawla, Shiven
Herrera, Solianna
Warreth, Shahed
Engler, Maggie
Lopez, Gary
Chikanov, Nina
Dheekonda, Raja Sekhar Rao
Jagdagdorj, Bolor-Erdene
Lutz, Roman
Lundeen, Richard
Westerhoff, Tori
Bryan, Pete
Seifert, Christian
Kumar, Ram Shankar Siva
Berkley, Andrew
Kessler, Alex
contents Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to test the safety behavior of Phi-3.5-mini and Phi-3.5-MoE, which were optimized for multilingual capabilities.
format Preprint
id arxiv_https___arxiv_org_abs_2407_13833
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
Haider, Emman
Perez-Becker, Daniel
Portet, Thomas
Madan, Piyush
Garg, Amit
Ashfaq, Atabak
Majercak, David
Wen, Wen
Kim, Dongwoo
Yang, Ziyi
Zhang, Jianwen
Sharma, Hiteshi
Bullwinkel, Blake
Pouliot, Martin
Minnich, Amanda
Chawla, Shiven
Herrera, Solianna
Warreth, Shahed
Engler, Maggie
Lopez, Gary
Chikanov, Nina
Dheekonda, Raja Sekhar Rao
Jagdagdorj, Bolor-Erdene
Lutz, Roman
Lundeen, Richard
Westerhoff, Tori
Bryan, Pete
Seifert, Christian
Kumar, Ram Shankar Siva
Berkley, Andrew
Kessler, Alex
Computation and Language
Artificial Intelligence
Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to test the safety behavior of Phi-3.5-mini and Phi-3.5-MoE, which were optimized for multilingual capabilities.
title Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2407.13833