From product suggestions, to fraud detection, to route optimization, low latency predictions are important for quite a few machine studying duties. That’s why we’re excited to announce a public preview for a brand new runtime that optimizes serving TensorFlow fashions on the Vertex AI Prediction service. This optimized TensorFlow runtime leverages applied sciences and mannequin optimization strategies which might be used internally at Google, and will be integrated into your serving workflows with none adjustments to your coaching or mannequin saving code. The result’s quicker predictions at a decrease price in comparison with the open supply based mostly pre-built TensorFlow serving containers.
This submit is a high-level overview of the optimized TensorFlow runtime that evaluations a few of its options, methods to use it, after which gives benchmark information that demonstrates the way it performs. For detailed details about methods to use the runtime, see Optimized TensorFlow runtime within the Vertex AI Consumer Information.
Optimized TensorFlow runtime overview
The optimized TensorFlow runtime makes use of mannequin optimizations and proprietary Google applied sciences to serve educated fashions quicker and at a decrease price than open supply TensorFlow. This runtime leverages each the TensorFlow runtime (TFRT) and Google’s inner stack.
To utilize this runtime, all you could do is choose a selected container when deploying the mannequin and optionally set a couple of flags. After doing this, you get the next advantages:
-
Improved tabular mannequin efficiency on GPUs: This runtime can serve tabular fashions quicker and at a decrease price by working computationally costly elements of the mannequin on GPUs. The remainder of the mannequin is run on CPUs by minimizing communication between the host and the accelerator. The runtime routinely determines which operations are greatest run on GPUs, and which of them are greatest run on CPUs. This optimization is obtainable by default and doesn’t require any adjustments to your mannequin code or setting any flags to make use of it.
-
Mannequin precompilation: To remove the overhead attributable to working all operations individually, this runtime can precompile some, or all, of the TensorFlow graph. Precompilation is optionally available and will be enabled throughout mannequin deployment.
-
Optimizations which will have an effect on precision:This flag gives optionally available optimizations that may ship important enhancements in latency and throughput, however might incur a small (often a fraction of a proportion) drop in mannequin precision or accuracy. Due to this, it’s really useful that you simply check the precision or accuracy of the optimized mannequin on a maintain out validation set earlier than you deploy a mannequin with these optimizations.
These are the optimizations obtainable on the time of public preview. Extra optimizations and enhancements will probably be added in upcoming releases.
To additional decrease your latency, you should use a personal endpoint with the optimized TensorFlow runtime. For extra info, see Use non-public endpoints for on-line prediction within the Vertex AI Predictions Consumer Information.
Be aware that the influence of the above optimizations will depend on the operators used within the mannequin and the mannequin structure. The latency and throughput (i.e. price) enhancements noticed range for various fashions. The benchmarks within the later part present a tough estimate of what you possibly can count on.
Find out how to use the optimized TensorFlow runtime
You should use the optimized TensorFlow runtime nearly precisely how you employ the open supply based mostly pre-built TensorFlow Serving containers. As a substitute of utilizing a pre-built container that’s based mostly on open supply TensorFlow construct, all you could do is select an optimized TensorFlow runtime container.
There are two varieties of containers obtainable: nightly and steady. Nightly containers have essentially the most present updates and optimizations and are greatest fitted to experimentation. Steady containers are based mostly on steady TensorFlow releases and are greatest fitted to manufacturing deployments. To see an inventory of containers with the optimized TensorFlow runtime, see Out there container photographs.
While you configure your deployment, you possibly can allow the 2 optionally available optimizations talked about earlier on this article, mannequin precompilation and optimizations that have an effect on precision. For particulars about methods to allow these optimizations, see Mannequin optimization flags within the Vertex AI Prediction Consumer Information.
The next code pattern demonstrates how one can create a mannequin with a pre-built optimized TensorFlow runtime container. The important thing distinction is using the us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:newest
container within the “container_spec”
. For extra particulars, see Deploy a mannequin utilizing the optimized TensorFlow runtime within the Vertex AI Predictions Consumer Information.
- code_block
- [StructValue([(u’code’, u’from google.cloud.platform import gapic as aiprnPROJECT_ID = u201c<your PROJECT_ID>u201drnREGION = u201c<your REGION>u201drnAPI_ENDPOINT = f”REGION-aiplatform.googleapis.com”rnPARENT = f”projects/PROJECT_ID/locations/REGION”rn rnclient_options = “api_endpoint”: API_ENDPOINTrnmodel_service_client = aip.ModelServiceClient(client_options=client_options)rntf_opt_model_dict = rntf_opt_model = model_service_client.upload_model(rn guardian=PARENT,rn mannequin=tf_opt_model_dict).outcome(timeout=180).mannequin’), (u’language’, u”)])]
Evaluating efficiency
To showcase the advantages of utilizing Vertex AI Prediction’s optimized TensorFlow runtime, we carried out a side-by-side comparability of efficiency for the tabular Criteo and BERT base classification fashions deployed on Vertex AI Prediction. For the comparability, we used the inventory TensorFlow 2.7 and the optimized TensorFlow runtime containers.
To evaluate the efficiency, we used MLPerf loadgen for Vertex AI within the “Server” situation. MLPerf loadgen sends requests to Vertex Prediction endpoints utilizing the identical distribution because the official MLPerf Inference benchmark. We ran it at growing queries per second (QPS) till the fashions had been saturated, then the noticed latency was recorded for every request.
Fashions and benchmark code are absolutely reproducible and obtainable within the vertex-ai-samples GitHub repository. You’ll be able to stroll by and run our benchmark assessments your self utilizing the next notebooks.
-
Coaching a tabular Criteo mannequin and deploying it to Vertex AI Predictions
-
Positive-tuning a BERT base classification mannequin and deploying it to Vertex AI Predictions
Criteo mannequin efficiency testing
The tabular Criteo mannequin was deployed on Vertex AI Prediction utilizing n1-standard-16 with NVIDIA T4 GPU cases and the optimized TensorFlow runtime, TF2.7 CPU, and TF2.7 GPU containers. Whereas utilizing the optimized TensorFlow runtime with the gRPC protocol isn’t formally supported, they work collectively. To check the efficiency of various Criteo mannequin deployments, we run MLPerf loadgen benchmarks over Vertex AI Prediction non-public endpoints utilizing the gRPC protocol with requests of batch measurement 512.
The next graph exhibits the efficiency outcomes. The “TF Decide GPU” bar exhibits the efficiency of the optimized TensorFlow runtime with precompilation enabled, and the “TF Decide GPU lossy” bar exhibits efficiency with each precompilation and precision affecting optimizations enabled.

The optimized TensorFlow runtime resulted in considerably decrease latency and better throughput. As a result of the optimized TensorFlow runtime strikes nearly all of the computations to the GPU, machines with much less CPU energy can be utilized. These benchmark assessments display that enabling the optionally available precision affecting optimizations (the “TF Decide GPU” lossy bar) considerably helps enhance the mannequin’s efficiency. We in contrast the outcomes of predictions for 51,200 requests for a mannequin working on the optimized TensorFlow runtime with lossy optimizations and a mannequin working on TF2.7 containers. The common distinction in precision in our outcomes is lower than zero.0016%, and within the worst case the precision distinction is lower than zero.05%.
Primarily based on these outcomes, for the tabular Criteo mannequin the optimized TensorFlow runtime gives roughly 6.5 occasions higher throughput and a couple of.1 occasions higher latency in comparison with TensorFlow 2.7 CPU, and eight occasions higher throughput and 6.7 occasions higher latency when precision affecting optimizations are enabled.
BERT base mannequin efficiency testing
For benchmarking the BERT base mannequin, we wonderful tuned the bert_en_uncased_L-12_H-768_A-12 classification mannequin from TensorFlow Hub for sentiment evaluation utilizing the IMDB dataset. The benchmark was run utilizing MLPerf loadgen on the general public endpoint with requests of batch measurement 32.
The next graph exhibits the efficiency outcomes. The “TF Decide GPU” bar exhibits the efficiency of the optimized TensorFlow runtime with precompilation enabled, and the “TF Decide GPU lossy” bar exhibits efficiency with each precompilation and precision affecting optimizations enabled.

To find out the influence of precision affecting optimizations on the mannequin precision, we in contrast the outcomes of predictions for 32,000 requests for a mannequin working on the optimized TensorFlow runtime with lossy optimizations with a mannequin working on TF2.7 containers. The common distinction in precision in our outcomes is lower than zero.01%, and within the worst case the precision distinction is lower than 1%.
For the BERT base mannequin the optimized TensorFlow runtime gives roughly 1.45 occasions higher throughput and 1.13 occasions higher latency in comparison with TensorFlow 2.7, and four.three occasions higher throughput and 1.64 occasions higher latency when precision affecting optimizations are enabled.
What’s subsequent
On this article you discovered in regards to the new optimized TensorFlow runtime and methods to use it. In case you’d like to breed the benchmark outcomes, make sure to check out the Criteo and Bert samples, or take a look at thelist of obtainable photographs to start out working some low latency experiments of your personal!
Acknowledgements
An enormous thanks to Cezary Myczka who made important contributions in getting the benchmarking outcomes for this challenge.