July 27, 2024

[ad_1]

Organizations utilizing AI and ML for purposes equivalent to product suggestions, scientific computing, and gaming typically flip to NVIDIA GPUs on Google Cloud for the required compute efficiency. To know their workload’s habits and optimize the ML improvement course of, they should monitor the GPU efficiency metrics. To assist, we’re excited to announce that Ops Agent now collects metrics from NVIDIA GPUs on Compute Engine VMs. 

Cloud Ops Agent is the Google-recommended telemetry resolution for Compute Engine that provides a curated expertise for monitoring VM situations. With important metrics from the NVIDIA Administration Library (NVML) and superior profiling metrics from the NVIDIA Information Heart GPU Supervisor (DCGM), now you can get improved visibility into your NVIDIA GPUs and accelerated workloads. 

With Ops Agent, you possibly can:

  • Visualize the well being of your GPU fleet with GPU metrics and out-of-the-box dashboards

  • Optimize prices by figuring out underutilized GPUs and consolidating workloads

  • Plan scaling by tendencies to determine when to increase GPU capability or improve present GPUs

  • Determine which GPU processes (the ML fashions) are consuming utilization and reminiscence

  • Use DCGM profiling metrics to determine bottlenecks and efficiency points inside the GPU

  • Alert on metrics out of your GPUs

Get important GPU metrics proper out of the field

In case you use NVIDIA GPUs, you’re in all probability acquainted with the nvidia-smi command, which offers an outline of all GPU units and the processes working on them. Leveraging the identical underlying API in NVML, Ops Agent can accumulate these important metrics with out additional configuration. This contains metrics for:

The method metrics monitor what workloads are working on the GPUs.

[ad_2]

Source link