LambdaML: Pros and Cons of Serverless for Deep Network Training

A new study called Towards Demystifying Serverless Machine Learning Training aims to provide an experimental analysis of training deep networks by leveraging serverless components (e.g. Azure/GCP Functions, AWS Lamda). The proposed platform, LambdaML, further extends Cirrus by decoupling intermediate gradient outputs via external storage layers (e.g. blob, cache).

The work is part of an increasing effort observed in the last couple of years for offloading parallelization primitives from IaaS to FaaS for video/image processing and machine learning (e.g. Cirrus, Stanford GG, UCSD Sprocket).

Serving deep learning models in serverless platforms is a common method and officially supported by many cloud vendors especially for lightweight networks that do not require acceleration hardware (e.g. official Azure docs, AWS blog). The case for training is more challenging due to its distributed nature and aggregation step found in learning algorithms (i.e. gradient computed over batches have to be combined). Training requires communication and synchronization of data between peer nodes (e.g. Ring-AllReduce) or between worker nodes and the orchestrator (e.g. parameter server). As the number of workers increases, the limitations of distributed computation also come into force.

LambdaML runs benchmarks on two deep networks (ResNet50 and MobileNet) and several classical machine learning algorithms (logistic regression, SVM, and k-means) in various settings. The results show that the current serverless architectures become unfavorable as the network size increases due to data transfer overhead (even for relatively small ResNet50). On the other hand, lightweight algorithms accompanied by distributed optimizers (e.g ADMM) result in acceptable throughput in FaaS as they balance speed and cost-related tradeoffs. In general, the experiments also indicate that serverless costs are not lower than IaaS on AWS.

In the last couple of years, billion-parameter models have increased significantly. Training larger models requires the utilization of an increased number of functions. As a result, this leads to an increase in pay-per-use cumulative costs. In some cloud vendors, increasing the size of the workers may not be feasible as subscription plans generally depend on the usage. There may be a requirement to switch to a reserved plan after a limit. As a managed compute service, serverless may lower the barrier to entry for training larger models in the future.

MLSys is a new conference established for machine learning systems. For more information on this topic, its proceedings can be a great source. You may find the following list citing several articles published in the last couple of years interesting. ZIP ML research project and its monograph also offer a range of articles about optimizations for various training settings.

About the Author

Sabri Bolkar

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Sabri Bolkar

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter