
[ad_1]
To run these comparisons, we leveraged JAX, a extremely environment friendly library that enables AI fashions to be compiled with XLA, a compiler designed particularly for AI fashions. Utilizing XLA, we will construct a compiled illustration of Conformer-2 that may be conveniently ported to completely different , making it straightforward to run on varied accelerated cases on Google Cloud for simple comparability.
Experimental setup
The Conformer-2 mannequin that we used for testing has 2 billion parameters, with over 1.5k hidden dimensions, 12 consideration heads, and 24 encoder layers. We examined the mannequin on three completely different accelerated cases on Google Cloud TPU v5e, G2, and A2. Given the cloud’s pay-per-chip-hour pricing mannequin, we maximized the batch dimension for every kind of accelerator underneath the constraint of the chip’s reminiscence. This allowed for an correct measurement of value per hour of audio transcribed for a manufacturing system.
To judge every chip, we handed equivalent audio knowledge via the mannequin on every kind of , measuring the inference pace for every kind of . This method allowed us to judge the associated fee per chip to run inference on 100okay hours of audio knowledge with no confounding elements.
Outcomes: Cloud TPU v5e leads in large-scale inference price-performance
Our experimental outcomes present that Cloud TPU v5e is probably the most cost-efficient accelerator on which to run large-scale inference for our mannequin. It delivers 2.7x better efficiency per greenback than G2 and four.2x better efficiency per greenback than A2 cases.
[ad_2]
Source link