Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Microsoft Open-Sources Distributed Machine Learning Library SynapseML

Microsoft Open-Sources Distributed Machine Learning Library SynapseML

This item in japanese

Microsoft announced the release of SynapseML, an open-source library for creating and managing distributed machine learning (ML) pipelines. SynapseML runs on Apache Spark, provides a language-agnostic API abstraction over several datastores, and integrates with several existing ML technologies, including Open Neural Network Exchange (ONNX).

The release was announced in a blog post by software engineer Mark Hamilton. SynapseML runs on Apache Spark, taking advantage of Spark's management of large-scale fault-tolerant compute clusters. The library has APIs for both Python and Java, with the ability to generate bindings for Java, R, and C#. SynapseML includes the HTTP on Spark module, allowing users efficient integration of web services into their pipelines, as well as pre-built wrappers for invoking several such services, including Azure Cognitive Services. Using ONNX, developers can deploy pre-trained models from Microsoft's ONNX Model Hub, or convert models built in other frameworks, such as TensorFlow or PyTorch, to perform distributed inference on Spark. The Spark Serving module allows developers to expose their Spark pipelines as low-latency web services. According to Hamilton,

Our goal is to free developers from the hassle of worrying about the distributed implementation details and enable them to deploy them into a variety of databases, clusters, and languages without needing to change their code.

SynapseML includes integrations with several popular ML frameworks, including ONNX, CNTK, LightGBM, OpenCV, and Vowpal Wabbit. These integrations provide APIs that conform to the Transformer and Estimator abstractions defined by Spark's ML pipelines.The ONNX and CNTK integrations allow users to import pre-trained deep learning models generated by other systems; users can then use the Spark cluster as a distributed inference engine. The OpenCV integration provides Transformer abstractions for manipulating image data within a ML pipeline, for example, flipping or blurring images. The LightGBM and Vowpal Wabbit integrations provide both Estimators for training models and Transformers for inference.

To support integration with Azure Cognitive Services, SynapseML includes an HTTP on Spark module, which is described in detail in a paper by Hamilton's research team. HTTP on Spark provides an efficient way for Spark workers to call out to web services. It augments Spark's parallelism model, allowing Spark workers to multi-task while waiting for HTTP requests to complete. This improves throughput without interfering with Spark's native parallelism. To mitigate the risk of the cluster overwhelming a downstream web service, HTTP on Spark handles rate-limiting header responses to implement "back pressure" and re-tries. In addition to Transformer wrappers for nearly 50 Azure services, SynapseML also provides a generic HTTPTransformer that allows users to call arbitrary web services.

SynapseML includes several other features. Spark Serving extends Spark's Structured Streaming engine, allowing any Structured Streaming job to be exposed as a web service. SynapseML also includes tools for responsible AI, such as data balance analysis and model explainability. The library includes support for AutoML features, such as finding the best-performing model using hyperparameter search, as well as Spark-native implementation of several models, including an anomaly-detection model for cyber security; an isolation forest model, which performs nonlinear outlier detection; and a conditional k-nearest-neighbor model.

Along with the open-source release, at their recent Ignite conference Microsoft announced the general availability of SynapseML on the Azure Synapse Analytics service. In response to a question on Twitter about the difference between SynapseML and Spark's MLLib, Bart Czernicki, a principal technical architect at Microsoft, replied that the library was repackaged and optimized for Azure Synapse Analytics. In a separate tweet about SynapseML, ML researcher Moein Kareshk said:

Although I haven't had a chance to benchmark the speed, its APIs are clear and easy to learn to prototype #ML products.

The SynapseML code is available on GitHub.

Rate this Article