Cost-efficient AI inference with Cloud TPU v5e on GKE

[ad_1]

Google Cloud TPU v5e is a purpose-built AI accelerator that brings the cost-efficiency and efficiency required for large-scale mannequin coaching and inference. With Cloud TPUs on Google Kubernetes Engine (GKE), the main Kubernetes service within the business, prospects can orchestrate AI workloads effectively and cheaply with best-in-class coaching and inference capabilities. GKE has lengthy been a pacesetter in supporting GPUs for AI workloads and we’re excited to increase our help to incorporate TPU v5e for large-scale inference capabilities.

MLPerf™ three.1 outcomes

As we introduced in September, Google Cloud submitted outcomes for the MLPerf™ Inference three.1 benchmark and achieved 2.7x greater efficiency per greenback in comparison with TPU v4.

Our MLPerf™ Inference three.1 submission outcomes demonstrated operating the 6-billion parameter GPT-J LLM benchmark utilizing Saxml, a high-performance inference system, and XLA, Google’s AI compiler. A few of the key optimizations used embrace:

XLA optimizations and fusions of Transformer operators
Publish-training weight quantization with INT8 precision
Excessive-performance sharding throughout the 2×2 TPU node pool topology utilizing GSPMD
Bucketized execution of batches of prefix computation and decoding in Saxml
Dynamic batching in Saxml

We achieved the identical efficiency when operating Cloud TPU v5e on GKE clusters, demonstrating that Cloud TPUs on GKE let you acquire the scalability, orchestration, and operational advantages of GKE whereas sustaining the price-performance of TPU.

Maximizing cost-efficiency with GKE and TPUs

When constructing a production-ready, highly-scalable, and fault-tolerant managed utility, GKE brings further worth by decreasing your whole value of possession (TCO) for inference on TPUs:

Handle and deploy your AI workloads with a Kubernetes commonplace platform.
Decrease value with autoscaling to make sure that sources routinely alter to workload wants. GKE can routinely scale up and down TPU node swimming pools primarily based on site visitors utilizing the autoscaling, rising value effectivity and improved automation for inference.
Provision the required compute sources wanted to your workloads: TPU node swimming pools might be routinely provisioned primarily based on TPU workload necessities with GKE’s node auto provisioning capabilities.
Guarantee excessive availability of your functions with inbuilt well being monitoring for TPU VM node swimming pools on GKE. If TPU nodes grow to be unavailable, GKE will carry out node auto restore to keep away from disruptions.
Decrease disruption from updates and failures with GKE’s proactive dealing with of upkeep occasions and gracefully terminating workloads.
Acquire full visibility into your TPU functions with GKE’s mature and dependable metrics and logging capabilities

GKE TPU Inference Reference Structure

To make the most of all the above advantages, we created a proof of idea to exhibit TPU inference utilizing the GPT-J 6B LLM mannequin with a single-host Saxml mannequin server.

[ad_2]

Source link

MLPerf™ three.1 outcomes

Maximizing cost-efficiency with GKE and TPUs

GKE TPU Inference Reference Structure

Related News

You may have missed

Categories