Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Open Source SkyPilot Targets Cloud Cost Optimization for ML and Data Science

Open Source SkyPilot Targets Cloud Cost Optimization for ML and Data Science

A team of researchers at the RISELab at UC Berkeley recently released Skypilot, an open-source framework for running machine learning workloads on the major cloud providers through a unified interface. The project focuses on cost optimization automatically finding the cheapest availability zone, region, and provider for the requested resources.

Given the requirement of a job, the framework determines automatically which locations on AWS, Azure, and Google Cloud have the resources (CPU/GPU/TPU) required to run the job and the most affordable one. Skypilot then performs three main tasks: it provisions the cluster, with automatic failover to other locations if there are capacity or quota errors, synchronizes user code and files to the destination, and manages job queueing and execution.

Zongheng Yang, postdoctoral researcher at UC Berkeley, and Ion Stoica, professor at UC Berkeley and co-founder and executive chairman at Anyscale and Databricks, explain:

Cloud computing for ML and Data Science is already plenty hard, but when you start applying cost-cutting techniques your overhead can multiply. Want to stop leaving machines up when they’re idle? You’ll need to spin them up and down repeatedly, redoing the environment and data setup. Want to use spot-instance pricing? That can add weeks of work to handle preemptions. What about exploiting the big price differences between regions, or the even bigger price differences between clouds?

SkyPilot is not the first open-source project from the RISELab targeting cloud cost optimization. As previously reported on InfoQ, the research center released SkyPlane to optimize the transfer of large datasets between cloud providers, reducing transfer times and costs.


Training machine learning models on the cloud can be costly and inefficient, with some companies recently shifting data and models back to their own data centers to reduce costs and improve performance. Yang and Stoica write:

SkyPilot has been under active development at UC Berkeley’s Sky Computing Lab for over a year. It is being used by more than 10 organizations for a diverse set of use cases, including model training on GPU/TPU (3x cost savings), distributed hyperparameter tuning, and bioinformatics batch jobs on 100s of CPU spot instances (6.5x cost savings).

Among other benefits of SkyPilot, the authors suggest building multi-cloud applications, leveraging best-in-class hardware, and increasing the availability of scarce resources like high-end NVIDIA V100 or A100 GPUs.


The framework includes Managed Spot, an option to use cheaper spot instances, with automatic recovery from preemptions, and Autostop, a feature to automatically cleans up idle clusters. The team released a collection of Jupyter notebooks to help developers understand how the project works.

SkyPilot currently supports AWS, Google Cloud and Azure, and provides a CLI and a Python API. According to a Reddit thread, the project plans to support other smaller cloud providers in the future.

SkyPilot is available on GitHub under Apache-2.0 license.

About the Author

Rate this Article