Don’t get overwhelmed by all the knowledge on this web page! There are three key numbers that may inform you numerous: System Compute Time, TF Op placement, and System Compute Precision.
The machine compute time lets you understand how a lot of the step time is from precise machine execution. In different phrases, how a lot time did your machine(s) spend on the computation of the ahead and backward passes, versus sitting idle ready for batches of information to be ready. In a perfect world, a lot of the step time needs to be spent on executing the coaching computation as a substitute of ready round.
The TF op placement tells you the proportion of ops positioned on the machine (eg GPU), vs host (CPU). On the whole you need extra ops on the machine as a result of that will probably be sooner.
Lastly, the machine compute precision reveals you the proportion of computations that had been 16 bit vs 32 bit. Immediately, most fashions use the float32 dtype, which takes 32 bits of reminiscence. Nonetheless, there are two lower-precision dtypes–float16 and bfloat16– which take 16 bits of reminiscence as a substitute. Fashionable accelerators can run operations sooner within the 16-bit dtypes. If a diminished accuracy is suitable on your use case, you’ll be able to think about using blended precision by changing extra of the 32 bit opts by 16 bit ops to hurry up coaching time.
You’ll discover that the abstract part additionally gives some suggestions for subsequent steps. So within the following sections we’ll check out some extra specialised profiler options that may make it easier to to debug.
Deep dive into the efficiency of your enter pipeline
After having a look on the overview web page, an awesome subsequent step is to guage the efficiency of your enter pipeline, which usually contains studying the information, preprocessing the information, after which transferring knowledge from the host (CPU) to the machine (GPU/TPU).
GPUs and TPUs can cut back the time required to execute a single coaching step. However reaching excessive accelerator utilization is determined by an environment friendly enter pipeline that delivers knowledge for the following step earlier than the present step has completed. You don’t need your accelerators sitting idle because the host prepares batches of information!
The TensorFlow Profiler gives an Enter-pipeline analyzer that may make it easier to decide in case your program is enter certain. For instance, the profile proven right here signifies that the coaching job is very enter certain. Over 80% of the step time is spent ready for coaching knowledge. By making ready the batches of information earlier than the following step is completed, you’ll be able to cut back the period of time every step takes, thus lowering complete coaching time general.