Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.05467 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916565393342464 |
|---|---|
| author | Gershon, Talia Seelam, Seetharami Belgodere, Brian Bonilla, Milton Hoang, Lan Barnett, Danny Chung, I-Hsin Mohan, Apoorve Chen, Ming-Hung Luo, Lixiang Walkup, Robert Evangelinos, Constantinos Salaria, Shweta Dombrowa, Marc Park, Yoonho Kayi, Apo Schour, Liran Alim, Alim Sydney, Ali Maniotis, Pavlos Schares, Laurent Metzler, Bernard Karacali-Akyamac, Bengi Wen, Sophia Chiba, Tatsuhiro Choochotkaew, Sunyanan Yoshimura, Takeshi Misale, Claudia Elengikal, Tonia Connor, Kevin O Liu, Zhuoran Molina, Richard Schneidenbach, Lars Caden, James Laibinis, Christopher Fonseca, Carlos Tarasov, Vasily Sundararaman, Swaminathan Schmuck, Frank Guthridge, Scott Cohn, Jeremy Eshel, Marc Muench, Paul Liu, Runyu Pointer, William Wyskida, Drew Krull, Bob Rose, Ray Wolfe, Brent Cornejo, William Walter, John Malone, Colm Perucci, Clifford Franco, Frank Hinds, Nigel Calio, Bob Druyan, Pavel Kilduff, Robert Kienle, John McStay, Connor Figueroa, Andrew Connolly, Matthew Fost, Edie Roma, Gina Fonseca, Jake Levy, Ido Payne, Michele Schenkel, Ryan Malki, Amir Schneider, Lion Narkhede, Aniruddha Moshref, Shekeba Kisin, Alexandra Dodin, Olga Rippon, Bill Wrieth, Henry Ganci, John Colino, Johnny Habeger-Rose, Donna Pandey, Rakesh Gidh, Aditya Gaur, Aditya Patterson, Dennis Salmani, Samsuddin Varma, Rambilas Rumana, Rumana Sharma, Shubham Gaur, Aditya Mishra, Mayank Panda, Rameswar Prasad, Aditya Stallone, Matt Zhang, Gaoyuan Shen, Yikang Cox, David Puri, Ruchir Agrawal, Dakshi Thorstensen, Drew Belog, Joel Tang, Brent Gupta, Saurabh Kumar Biswas, Amitabha Maheshwari, Anup Gampel, Eran Van Patten, Jason Runion, Matthew Kaki, Sai Bogin, Yigal Reitz, Brian Pritko, Steve Najam, Shahan Nambala, Surya Chirra, Radhika Welp, Rick DiMitri, Frank Telles, Felipe Arvelo, Amilcar Chu, King Seminaro, Ed Schram, Andrew Eickhoff, Felix Hanson, William Mckeever, Eric Light, Michael Joseph, Dinakaran Chaudhary, Piyush Shivam, Piyush Chaudhary, Puneet Jones, Wesley Guthrie, Robert Bostic, Chris Islam, Rezaul Duersch, Steve Sawdon, Wayne Lewars, John Klos, Matthew Spriggs, Michael McMillan, Bill Gao, George Kamra, Ashish Singh, Gaurav Curry, Marc Katarki, Tushar Talerico, Joe Shi, Zenghui Malleni, Sai Sindhur Gallen, Erwan |
| author_facet | Gershon, Talia Seelam, Seetharami Belgodere, Brian Bonilla, Milton Hoang, Lan Barnett, Danny Chung, I-Hsin Mohan, Apoorve Chen, Ming-Hung Luo, Lixiang Walkup, Robert Evangelinos, Constantinos Salaria, Shweta Dombrowa, Marc Park, Yoonho Kayi, Apo Schour, Liran Alim, Alim Sydney, Ali Maniotis, Pavlos Schares, Laurent Metzler, Bernard Karacali-Akyamac, Bengi Wen, Sophia Chiba, Tatsuhiro Choochotkaew, Sunyanan Yoshimura, Takeshi Misale, Claudia Elengikal, Tonia Connor, Kevin O Liu, Zhuoran Molina, Richard Schneidenbach, Lars Caden, James Laibinis, Christopher Fonseca, Carlos Tarasov, Vasily Sundararaman, Swaminathan Schmuck, Frank Guthridge, Scott Cohn, Jeremy Eshel, Marc Muench, Paul Liu, Runyu Pointer, William Wyskida, Drew Krull, Bob Rose, Ray Wolfe, Brent Cornejo, William Walter, John Malone, Colm Perucci, Clifford Franco, Frank Hinds, Nigel Calio, Bob Druyan, Pavel Kilduff, Robert Kienle, John McStay, Connor Figueroa, Andrew Connolly, Matthew Fost, Edie Roma, Gina Fonseca, Jake Levy, Ido Payne, Michele Schenkel, Ryan Malki, Amir Schneider, Lion Narkhede, Aniruddha Moshref, Shekeba Kisin, Alexandra Dodin, Olga Rippon, Bill Wrieth, Henry Ganci, John Colino, Johnny Habeger-Rose, Donna Pandey, Rakesh Gidh, Aditya Gaur, Aditya Patterson, Dennis Salmani, Samsuddin Varma, Rambilas Rumana, Rumana Sharma, Shubham Gaur, Aditya Mishra, Mayank Panda, Rameswar Prasad, Aditya Stallone, Matt Zhang, Gaoyuan Shen, Yikang Cox, David Puri, Ruchir Agrawal, Dakshi Thorstensen, Drew Belog, Joel Tang, Brent Gupta, Saurabh Kumar Biswas, Amitabha Maheshwari, Anup Gampel, Eran Van Patten, Jason Runion, Matthew Kaki, Sai Bogin, Yigal Reitz, Brian Pritko, Steve Najam, Shahan Nambala, Surya Chirra, Radhika Welp, Rick DiMitri, Frank Telles, Felipe Arvelo, Amilcar Chu, King Seminaro, Ed Schram, Andrew Eickhoff, Felix Hanson, William Mckeever, Eric Light, Michael Joseph, Dinakaran Chaudhary, Piyush Shivam, Piyush Chaudhary, Puneet Jones, Wesley Guthrie, Robert Bostic, Chris Islam, Rezaul Duersch, Steve Sawdon, Wayne Lewars, John Klos, Matthew Spriggs, Michael McMillan, Bill Gao, George Kamra, Ashish Singh, Gaurav Curry, Marc Katarki, Tushar Talerico, Joe Shi, Zenghui Malleni, Sai Sindhur Gallen, Erwan |
| contents | AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2407_05467 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | The infrastructure powering IBM's Gen AI model development Gershon, Talia Seelam, Seetharami Belgodere, Brian Bonilla, Milton Hoang, Lan Barnett, Danny Chung, I-Hsin Mohan, Apoorve Chen, Ming-Hung Luo, Lixiang Walkup, Robert Evangelinos, Constantinos Salaria, Shweta Dombrowa, Marc Park, Yoonho Kayi, Apo Schour, Liran Alim, Alim Sydney, Ali Maniotis, Pavlos Schares, Laurent Metzler, Bernard Karacali-Akyamac, Bengi Wen, Sophia Chiba, Tatsuhiro Choochotkaew, Sunyanan Yoshimura, Takeshi Misale, Claudia Elengikal, Tonia Connor, Kevin O Liu, Zhuoran Molina, Richard Schneidenbach, Lars Caden, James Laibinis, Christopher Fonseca, Carlos Tarasov, Vasily Sundararaman, Swaminathan Schmuck, Frank Guthridge, Scott Cohn, Jeremy Eshel, Marc Muench, Paul Liu, Runyu Pointer, William Wyskida, Drew Krull, Bob Rose, Ray Wolfe, Brent Cornejo, William Walter, John Malone, Colm Perucci, Clifford Franco, Frank Hinds, Nigel Calio, Bob Druyan, Pavel Kilduff, Robert Kienle, John McStay, Connor Figueroa, Andrew Connolly, Matthew Fost, Edie Roma, Gina Fonseca, Jake Levy, Ido Payne, Michele Schenkel, Ryan Malki, Amir Schneider, Lion Narkhede, Aniruddha Moshref, Shekeba Kisin, Alexandra Dodin, Olga Rippon, Bill Wrieth, Henry Ganci, John Colino, Johnny Habeger-Rose, Donna Pandey, Rakesh Gidh, Aditya Gaur, Aditya Patterson, Dennis Salmani, Samsuddin Varma, Rambilas Rumana, Rumana Sharma, Shubham Gaur, Aditya Mishra, Mayank Panda, Rameswar Prasad, Aditya Stallone, Matt Zhang, Gaoyuan Shen, Yikang Cox, David Puri, Ruchir Agrawal, Dakshi Thorstensen, Drew Belog, Joel Tang, Brent Gupta, Saurabh Kumar Biswas, Amitabha Maheshwari, Anup Gampel, Eran Van Patten, Jason Runion, Matthew Kaki, Sai Bogin, Yigal Reitz, Brian Pritko, Steve Najam, Shahan Nambala, Surya Chirra, Radhika Welp, Rick DiMitri, Frank Telles, Felipe Arvelo, Amilcar Chu, King Seminaro, Ed Schram, Andrew Eickhoff, Felix Hanson, William Mckeever, Eric Light, Michael Joseph, Dinakaran Chaudhary, Piyush Shivam, Piyush Chaudhary, Puneet Jones, Wesley Guthrie, Robert Bostic, Chris Islam, Rezaul Duersch, Steve Sawdon, Wayne Lewars, John Klos, Matthew Spriggs, Michael McMillan, Bill Gao, George Kamra, Ashish Singh, Gaurav Curry, Marc Katarki, Tushar Talerico, Joe Shi, Zenghui Malleni, Sai Sindhur Gallen, Erwan Distributed, Parallel, and Cluster Computing Artificial Intelligence AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings. |
| title | The infrastructure powering IBM's Gen AI model development |
| topic | Distributed, Parallel, and Cluster Computing Artificial Intelligence |
| url | https://arxiv.org/abs/2407.05467 |