Google’s ‘World’s Fastest’ Supercomputer Breaks AI Performance Records in MLPerf

Akarshan Narang

6 months ago

Expedited training of Machine Learning (ML) models is essential for research and engineering teams working on new products, services, and novel advanced research. The latest results of the industry-standard MLPerf benchmark competition prove that Google has developed the world’s fastest ML training supercomputer.
Google’s latest ML-enabled improvements include the most effective search results and a single ML sample that can be translated into 100 languages. Google’s supercomputer and its latest Tensor Processing Unit (DPU) chip have set performance records in six of the eight MLPerf benchmarks.

Speedup of Google’s best MLPerf Training v0.7 Research submission over the fastest non-Google submission in any available category. Comparisons are normalized by the overall training time regardless of the computer size, ranging from 8 to 4096 chips. Tall bars are the best.

Google has reached these results with ML sample implementations at TensorFlow, Jax, and Lingvo. Four of the eight models were newly trained within 30 seconds. To put this into context, note that in 2015, it took more than three weeks to train one of these models on the most advanced hardware accelerator. Only five years have gone by, and Google’s supercomputer can already train the five times larger model even more quickly.

MLPerf Models At a Glance

MLPerf models are selected to represent sophisticated machine learning workloads that are common throughout the industry and academia. Here are more details about each MLPerf model in the image above:

DLRM refers to ranking and referral models that are crucial for online businesses, from travel to media to e-commerce.
The Transformer is the foundation of recent advances in natural language processing, including BERT.
BERT directed Google’s “biggest improvement over the last five years.”
Resnet-50 is a widely used model for image classification.
SSD is an object detection model that is light enough to run on mobile devices.
Mask R-CNN is a widely used image segment model used in autonomous navigation, medical imaging, and other domains (you can test this on Colab).

In addition to the industry-leading results at the highest level, Google also used TensorFlow to provide MLPerf submissions on the Google Cloud platform, which are ready for use by companies today.

World’s Fastest ML Training Supercomputer

The supercomputer that Google used for this MLPerf training round is four times bigger than the Cloud TPU v3 track that set three records in the previous competition. The system contains 4096 TPU v3 chips and hundreds of CPU host machines, all of which are interconnected through high-speed and ultra-large custom interconnections. In total, this system delivers over 430 PFLOPs of peak performance.

Scale training with TensorFlow, Jax, Lingvo, and XLA

Training thousands of complex ML models using thousands of TPU chips is a combination of algorithmic techniques and enhancements in TensorFlow, Jax, Lingvo, and XLA. To provide some background, XLA is the underlying compiler technology running all of Google’s MLPerf submissions. XLA is a new research-focused structure based on change. The aforementioned log-setting size model parallels relied on measured volume normalization, efficient computational graph outputs, and tree-based weight initiation.

All the Tensorflow, Jax, and Lingvo submissions in the table above were trained on Resnet-50, Bert, SSD, and Transformer 2048 or 4096 TPU chips each within 33 seconds.

TPU v4: Google’s fourth-generation Tensor processing unit chip

Google’s fourth-generation TPU ASIC offers twice the TFLOPs as compared to TPU v3’s matrix amplifier. It significantly boosts memory bandwidth and advancing technology. Further, Google’s TPU v4 MLPerf submissions take advantage of these new hardware features with complementary compiler and modeling improvements. The results demonstrate an average improvement of 2.7 times over TPU v3 performance at similar levels in the last MLPerf training competition. Stay tuned for more information on TPU v4.

On average, TPU v4 results in Google’s MLPerf Training v0.7 display a 2.7-fold improvement over Google’s comparable MLPerf Training v0.6 TPU v3 results. Improvements in TPU v4 are due to hardware innovations and software upgrades.

Rapid, ongoing progress

Google’s MLPerf Tutorial v0.7 submissions fully demonstrate its commitment to advance ML research and engineering along with delivering these improvements to users via open-source software, Google’s products, and Google Cloud.

Google’s second and third-generation TPU supercomputers can be used on Google Cloud today.