May 25, 2024


When builders are innovating shortly, safety could be an afterthought. That’s even true for AI/ML workloads, the place the stakes are excessive for organizations making an attempt to guard worthwhile fashions and knowledge. 

While you deploy an AI workload on Google Kubernetes Engine (GKE), you may profit from the various safety instruments obtainable in Google Cloud infrastructure. On this weblog, we share safety insights and hardening strategies for coaching AI/ML workloads on one framework particularly — Ray.

Ray wants safety hardening

As a distributed compute framework for AI functions, Ray has grown in reputation lately, and deploying it on GKE is a well-liked selection that gives flexibility and configurable orchestration. You may learn extra on why we suggest GKE for Ray

Nonetheless, Ray lacks built-in authentication and authorization, which implies that in the event you can efficiently ship a request to the Ray cluster head, it is going to execute arbitrary code in your behalf.

So how do you safe Ray? The authors state that safety ought to be enforced outdoors of the Ray cluster, however how do you really harden it? Operating Ray on GKE can assist you obtain a safer, scalable, and dependable Ray deployment by benefiting from present world Google infrastructure elements together with Id-Conscious Proxy (IAP). 

We’re additionally making strides within the Ray group to make safer defaults for operating Ray with Kubernetes utilizing KubeRay. One focus space has been enhancing Ray part compliance with the restricted Pod Safety Requirements profile and by including safety greatest practices, equivalent to operating the operator as non-root to assist stop privilege escalation.  

Safety separation helps multi-cluster operation

One key benefit of operating Ray inside Kubernetes is the power to run a number of Ray clusters, with numerous workloads, managed by a number of groups, inside a single Kubernetes cluster. This offers you higher useful resource sharing and utilization as a result of nodes with accelerators can be utilized by a number of groups, and spinning up Ray on an present GKE cluster saves ready on VM provisioning time earlier than workloads can start execution.

Safety performs a supporting function in touchdown these multi-cluster benefits by utilizing Kubernetes safety features to assist maintain Ray clusters separate. The aim is to keep away from unintended denial of service or unintended cross-tenant entry. Be aware that the safety separation right here is just not “exhausting” multitenancy — it’s only ample for clusters operating trusted code and groups that belief one another with their knowledge. If additional isolation is required, think about using separate GKE clusters.

The structure is proven within the following diagram. Completely different Ray clusters are separated by namespaces inside the GKE cluster, permitting approved customers to make calls to their assigned Ray cluster, with out accessing others.


Source link