A collaboration between Google Cloud and NVIDIA has enabled Apache Beam customers to maximise the efficiency of ML fashions inside their knowledge processing pipelines, utilizing NVIDIA TensorRT and NVIDIA GPUs alongside the brand new Apache Beam TensorRTEngineHandler.
The NVIDIA TensorRT SDK supplies high-performance, neural community inference that lets builders optimize and deploy skilled ML fashions on NVIDIA GPUs with the best throughput and lowest latency, whereas preserving mannequin prediction accuracy. TensorRT was particularly designed to help a number of lessons of deep studying fashions, together with convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformer-based fashions.
Deploying and managing end-to-end ML inference pipelines whereas maximizing infrastructure utilization and minimizing complete prices is a tough drawback. Integrating ML fashions in a manufacturing knowledge processing pipeline to extract insights requires addressing challenges related to the three most important workflow segments:
-
Preprocess massive volumes of uncooked knowledge from a number of knowledge sources to make use of as inputs to coach ML fashions to “infer / predict” outcomes, after which leverage the ML mannequin outputs downstream for incorporation into enterprise processes.
-
Name ML fashions inside knowledge processing pipelines whereas supporting totally different inference use-cases: batch, streaming, ensemble fashions, distant inference, or native inference. Pipelines are usually not restricted to a single mannequin and infrequently require an ensemble of fashions to supply the specified enterprise outcomes.
-
Optimize the efficiency of the ML fashions to ship outcomes inside the software’s accuracy, throughput, and latency constraints. For pipelines that use advanced, computate-intensive fashions for use-cases like NLP or that require a number of ML fashions collectively, the response time of those fashions usually turns into a efficiency bottleneck. This will trigger poor hardware utilization and requires extra compute sources to deploy your pipelines in manufacturing, resulting in doubtlessly larger prices of operations.
Google Cloud Dataflow is a totally managed runner for stream or batch processing pipelines written with Apache Beam. To allow builders to simply incorporate ML fashions in knowledge processing pipelines, Dataflow not too long ago introduced help for Apache Beam’s generic machine studying prediction and inference rework, RunInference. The RunInference rework simplifies the ML pipeline creation course of by permitting builders to make use of fashions in manufacturing pipelines without having numerous boilerplate code.
You’ll be able to see an instance of its utilization with Apache Beam within the following code pattern. Observe that the engine_handler is handed as a configuration to the RunInference rework, which abstracts the consumer from the implementation particulars of working the mannequin.