Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Microsoft Releases .NET for Apache Spark 1.0

Microsoft Releases .NET for Apache Spark 1.0

This item in japanese

Last month, Microsoft released the first major version of .NET for Apache Spark, an open-source package that brings .NET development to the Apache Spark platform. The new release allows .NET developers to write Apache Spark applications using .NET user-defined functions, Spark SQL, and additional libraries such as Microsoft Hyperspace and ML.NET.

Apache Spark is an open-source, general-purpose analytics engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. Initially developed by the AMPLab team at UC Berkeley, it can be used in conjunction with different data repositories, including the Hadoop Distributed File System, NoSQL databases, and relational data stores. Since all data is processed in-memory (RAM), Spark can be 100x faster than Hadoop for large-scale data processing.

According to Jeremy Likness, senior program manager for .NET Data at Microsoft, the release addresses a long-standing community demand:

.NET for Apache Spark launched two years ago to address the increasing demand from the .NET community for an easier way to build big data applications. A recent survey confirmed the biggest motivation to use the package is to take advantage of existing .NET development skills and resources, including the enormous .NET ecosystem of existing libraries and frameworks.

.NET for Apache Spark brings key Spark functionalities to the .NET development ecosystem, including DataFrame APIs (versions 2.3, 2.4, and 3.0, allowing the use of Spark SQL queries) and support for Spark's machine learning library (MLlib). .NET developers can also use user-defined functions (UDFs) to write Spark applications.

The package also provides an API extension framework for additional libraries, including Delta Lake (a storage layer for ACID transactions in Spark), Microsoft Hyperspace (an indexing subsystem for Spark), and ML.NET (Microsoft's machine learning framework) - which is particularly interesting for .NET developers since it can also be extended with other machine learning libraries such as TensorFlow.

Performance is another critical feature of this release. According to Microsoft's benchmarks, .NET for Apache spark programs that do not use UDFs show the same speed as Scala and PySpark-based non-UDF Spark applications. If the applications include UDFs, the .NET for Apache Spark programs are at least as fast as PySpark programs, often faster.  

Source: Microsoft

The official release article also included plans for future features, including LINQ support and additional deployment options such as integration with CI/CD DevOps pipelines and publishing or submitting jobs directly from Visual Studio.

.NET for Apache Spark supports all .NET applications targeting .NET Standard 2.0 (.NET Core 3.1 or later is recommended). The package is available as an OSS project on the .NET Foundation's GitHub and can be downloaded from NuGet. It can also be used in other Apache Spark cloud offerings, including Azure Databricks and AWS EMR Spark. For on-premise deployments, it offers multi-platform support for Windows, macOS, and Linux.


Rate this Article