This weblog was written in collaboration with the DeepSpeed workforce, the Azure ML workforce, and the Azure HPC workforce at Microsoft.
Giant-scale transformer-based deep studying fashions educated on giant quantities of knowledge have proven nice outcomes in recent times in a number of cognitive duties and are behind new merchandise and options that increase human capabilities. These fashions have grown a number of orders of magnitude in measurement over the past 5 years. Ranging from a number of million parameters of the unique transformer mannequin all the way in which to the most recent 530 billion-parameter Megatron-Turing (MT-NLG 530B) mannequin as proven in Determine 1. There’s a rising want for purchasers to coach and fine-tune giant fashions at an unprecedented scale.
Determine 1: Panorama of huge fashions and hardware capabilities.
Azure Machine Studying (AzureML) brings giant fleets of the most recent GPUs powered by the InfiniBand interconnect to sort out large-scale AI coaching. We already practice a few of the largest fashions together with Megatron/Turing and GPT-Three on Azure. Beforehand, to coach these fashions, customers wanted to arrange and keep a posh distributed coaching infrastructure that normally required a number of guide and error-prone steps. This led to a subpar expertise each when it comes to usability and efficiency.
At the moment, we’re proud to announce a breakthrough in our software program stack, utilizing DeepSpeed and 1024 A100s to scale the coaching of a 2T parameter mannequin with a streamlined person expertise at 1K+ GPU scale. We’re bringing these software program improvements to you thru AzureML (together with a completely optimized PyTorch surroundings) that provides nice efficiency and an easy-to-use interface for large-scale coaching.
Prospects can now use DeepSpeed on Azure with simple-to-use coaching pipelines that make the most of both the really helpful AzureML recipes or by way of bash scripts for VMSS-based environments. As proven in Determine 2, Microsoft is taking a full stack optimization method the place all the mandatory items together with the hardware, the OS, the VM picture, the Docker picture (containing optimized PyTorch, DeepSpeed, ONNX Runtime, and different Python packages), and the user-facing Azure ML APIs have been optimized, built-in, and well-tested for glorious efficiency and scalability with out pointless complexity.
Determine 2: Microsoft full-stack optimizations for scalable distributed coaching on Azure.
This optimized stack enabled us to effectively scale coaching of huge fashions utilizing DeepSpeed on Azure. We’re completely satisfied to share our efficiency outcomes supporting 2x bigger mannequin sizes (2 trillion vs. 1 trillion parameters), scaling to 2x extra GPUs (1024 vs. 512), and as much as 1.8x larger compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs) in comparison with these revealed on different cloud suppliers.
We provide near-linear scalability each when it comes to an improve in mannequin measurement in addition to improve in variety of GPUs. As proven in Determine 3a, along with the DeepSpeed ZeRO-Three, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs, we had been capable of keep an environment friendly throughput/GPU (>157 TFLOPs) in a near-linear style because the mannequin measurement elevated from 175 billion parameters to 2 trillion parameters. However, for a given mannequin measurement, for instance, 175B, we obtain near-linear scaling as we improve the variety of GPUs from 128 all the way in which to 1024 as proven in Determine 3b. The important thing takeaway from the outcomes offered on this weblog is that Azure and DeepSpeed collectively are breaking the GPU reminiscence wall and enabling our clients to simply and effectively practice trillion-parameter fashions at scale.
Determine Three: (a) Close to-perfect throughput/GPU as we improve the mannequin measurement from 175 billion to 2 trillion parameters (BS/GPU=eight), (b) Close to-perfect efficiency scaling with the rise in variety of GPU gadgets for the 175B mannequin (BS/GPU=16). The sequence size is 1024 for each instances.
To be taught extra in regards to the optimizations, applied sciences, and detailed efficiency traits offered above, please seek advice from our prolonged technical weblog.
- Study extra about DeepSpeed, which is a part of Microsoft’s AI at Scale initiative.
- Study extra about Azure HPC + AI.
- To get began with DeepSpeed on Azure, please comply with our getting began tutorial.
- The outcomes offered on this weblog had been produced on Azure by following the recipes and scripts revealed as a part of the Megatron-DeepSpeed repository. The really helpful and most easy-to-use technique to run the coaching experiments is to make the most of the AzureML recipe.
- In case you are working experiments on a customized surroundings constructed utilizing Azure VMs or VMSS, please seek advice from the bash scripts we offer in Megatron-DeepSpeed.