Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Amazon EC2 Trn1 Instances for High Performance on Deep Learning Training Models Now Available

Amazon EC2 Trn1 Instances for High Performance on Deep Learning Training Models Now Available

AWS announces general availability of Amazon EC2 Trn1 instances powered by AWS Trainium Chips. Trn1 instances deliver the highest performance on deep learning training of popular machine learning models on AWS, while offering up to 50% cost-to-train savings over comparable GPU-based instances.

AWS Trainium is the second-generation of machine learning chips that AWS purpose built for deep learning training. Each Amazon EC2 Trn1 instance deploys up to 16 AWS Trainium accelerators to deliver a high-performance, low-cost solution for deep learning training in the cloud.

Trn1 instances are the first Amazon EC2 instance to offer up to 800 Gbps of networking bandwidth, lower latency and 2x faster than the latest EC2 GPU-based instances, using the second generation of AWS’s Elastic Fabric Adapter network interface to improve scaling efficiency.

Trn1 instances also use AWS Neuron, the SDK for Trn1 instances, which enables customers to get started with minimal code changes and is integrated into popular frameworks for machine learning like PyTorch and TensorFlow.

Trn1 instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that streamline the delivery of isolated multi-tenancy, private networking, and fast local storage.

According to AWS, developers can run deep-learning training workloads on Trn1 instances using AWS Deep Learning AMIs, AWS Deep Learning Containers, or managed services such as Amazon Elastic Container Service, and AWS ParallelCluster, with support for Amazon Elastic Kubernetes Service, Amazon SageMaker, and soon AWS Batch.

As the computer-intensive workloads are increasing, the need for high-efficiency chips is growing dramatically. While Trainium can be compared to Google’s tensor processing units which are their AI training workloads hosted in Google Cloud Platform, the offerings are different at many levels.

It can also compete with some of the newly launched AI chips such as IBM Power10, which claims to be three times more efficient than the previous models of the POWER CPU series, or NVIDIA A100, which claims to offer 6x higher performance than NVIDIA’s previous-generation chips.

About the Author

Rate this Article