On this three half collection we discover the efficiency debugging ecosystem of PyTorch/XLA on Google Cloud TPU VM. TPU VM earlier this yr (2021). The TPU VM structure permits the ML practitioners to work immediately on the host the place TPU is hooked up. With the TPU profiler launched earlier this yr, debugging your PyTorch coaching on TPU VM is less complicated than ever earlier than. Whereas the method to research the efficiency has modified, the basics of PyTorch/XLA that you’ve got acquired with the community hooked up TPU structure (aka TPU Node structure), nonetheless apply.
On this (first) half we’ll briefly lay out the conceptual framework for PyTorch/XLA within the context of coaching efficiency. Please word that coaching efficiency within the present scope refers to coaching throughput, i.e. samples/sec, photographs/sec or equal. We use a case research to make sense of preliminary profiler logs and determine the corrective actions. The answer to resolve the efficiency bottleneck will likely be left as an train to the reader.
Partly-II of this collection we’ll focus on the answer left as an train within the part-I and introduce additional evaluation of the efficiency to determine different efficiency enchancment alternatives.
Lastly, in part-III, we introduce the consumer outlined code annotation. We’ll see easy methods to visualize these annotations within the type of a hint and introduce some fundamental ideas to grasp the hint.
By the top of this collection, we purpose to provide you a greater understanding of easy methods to analyze efficiency of your PyTorch code on Cloud TPUs and issues to contemplate when working with Cloud TPUs.
An understanding of inside workings of XLA Tensor could make the next content material extra accessible and helpful. We encourage you to evaluate this speak from PyTorch Builders Day 2020 and this speak from Google Cloud Subsequent for a fast primer on XLA Tensors. You may additionally discover this text useful in case you are new to PyTorch/XLA. This text additionally assumes that the reader is acquainted with Google Cloud Platform SDK and has entry to a Google Cloud venture with permissions to create sources corresponding to digital machines and Cloud TPU situations. A lot of the profiler ideas will likely be defined right here, nevertheless, introductory studying of TPU VM Profiler can be really useful.
Shopper-Server Terminology for PyTorch/XLA
As within the TPU Node structure (earlier than TPU VM) PyTorch XLA nonetheless makes use of the lazy tensor paradigm, i.e. if you find yourself utilizing XLA Tensors, any operations carried out on this tensor are merely recorded in an intermediate illustration (IR) graph. When a step is marked (xm.mark_step() name), this graph is transformed to XLA (HLO format – Excessive Stage Operations) and dispatched for execution to TPU runtime (server).
Be aware that the TPU runtime is the a part of TPU server facet performance and all of the work executed as much as the technology of the HLO graph is a part of (and henceforth known as) the shopper facet performance. In contrast to the earlier technology the place the TPU runtime (server) was mechanically began if you created a TPU occasion, incase of TPU VM, PyTorch/XLA library takes care of beginning the server if you submit a coaching. You may also begin the XRT (XLA Runtime) server manually on the specified port, Therefore the XRT_TPU_CONFIG set within the code snippets later within the publish refers back to the default port the place PyTorch/XLA begins the XRT server. In contrast to the earlier technology, shopper and server run on the identical host nevertheless the abstractions nonetheless maintain and are useful to grasp the efficiency (extra particulars right here).
We’ll look at UniT (Unified Transformer) coaching on GLUE/QNLI job utilizing the MMF framework for multi-modal studying from Fb Analysis. We’ll uncover an fascinating facet of Multihead Consideration Implementation (noticed in PyTorch 1.eight) that by the way ends in sub-optimal coaching efficiency with PyTorch/XLA and focus on a possible corrective motion.
The case research makes use of TPU VM. Within the following steps we create a TPU VM. The next instructions could be run from Google Cloud Shell or any machine with the Google Cloud SDK put in and the proper credentials provisioned. (For extra particulars please confer with TPU VM consumer information.)