Microsoft Releases Azure Data Factory

Any cloud provider that believes in data gravity is trying to make it easier to collect and store data in their facilities. To make data movement between cloud and on-premises endpoints easier, Microsoft recently announced the general availability of Azure Data Factory (ADF). However, this pay-per-use service is being positioned as part of Microsoft’s analytics suite, not as a pure-play Extract Transform Load (ETL) tool.

In a post earlier this month on Microsoft’s Machine Learning blog, Microsoft VP Joseph Sirosh described ADF and explained the benefits of the service.

With ADF, existing data processing services can be composed into data pipelines that are highly available and managed in the cloud. These data pipelines can be scheduled to ingest, prepare, transform, analyze, and publish data, and ADF will manage and orchestrate all of the complex data and processing dependencies without human intervention. Solutions can be quickly built and deployed in the cloud, connecting a growing number of on-premises and cloud data sources.

Using ADF, businesses can enjoy the benefits of using a fully managed cloud service without procuring hardware; reduce costs with automatic cloud resource management, efficiently move data using a globally deployed data transfer infrastructure, and easily monitor and manage complex schedule and data dependencies. All of this can be done through an intuitive monitoring and management UI available from the Azure portal, and developers are accelerated through a familiar Visual Studio plug-in experience for solution building and deployment.

ADF works by processing datasets through a pipeline composed of activities. A “dataset” describes data structures stored within a given data store. ADF offers a number of data store connectors including Azure SQL, Azure DocumentDB, on-premises SQL Server, on-premises Oracle, on-premises Teradata, on-premises MySQL, and more. The “activities” in ADF perform actions on a given dataset. An activity may relate to date movement or data transformation. Data movement activities such as copying data from a datastore are responsible for transferring data between endpoints. The data transformation activities take the raw data and run queries against it. There are seven transformations available, most of them relying on the Hadoop-based Azure HDInsight service.

Hive. Executes SQL-like Hive queries on a HDInsight cluster.
Pig. Execute Pig queries on a HDInsight cluster.
MapReduce. Run MapReduce programs.
Hadoop Streaming. Invoke a streaming job.
Maching Learning Batch Scoring. Uses the Azure Machine Learning web service.
Stored Procedure. Invoke a stored procedure in a SQL Azure database.
.NET. Define custom activities written in C#.

To access on-premises endpoints, ADF uses a tool called the Data Management Gateway. The Gateway runs on local Windows servers and uses certificate-encrypted credentials to access the on-premises data stores. Outbound calls are done over standard HTTP ports. Gateway instances are tied to individual data factories, and only one instance of a Gateway can be on a given server. So, users should expect to run a fleet of servers hosting Gateways if they have many factories in use. To create data factories, developers can use the (beta) Azure Portal, PowerShell, Visual Studio, or a REST API.

Image source: https://azure.microsoft.com/en-us/documentation/articles/data-factory-introduction/

ADF is part of the Cortana Analytics Suite that was announced in July. Other products in that suite include Azure Data Catalog, Azure Machine Learning, HDInsight, Power BI, and Azure Stream Analytics. How does Microsoft plan to integrate this separate set of services into a single suite? In a ZDNet article about the ADF release, Andrew Brust explained how the packaging and integration may work.

General availability is slated for "later this fall" and promises to deliver a single subscription for all of the Azure Big Data and analytics services. Pricing will be disclosed in the fall as well.

Microsoft also promises to bring integrated vertical industry solutions to Cortana Analytics customers. These are essentially use case templates/accelerators, for industries the will likely include manufacturing, healthcare and financial services. While they may not be full-fledged products per se, and definitely won't constitute true integration of the services, they will nonetheless serve as canonical examples of how to use the services together.

Certain of the services have point-to-point integration already in place. Azure Data Factory has connectivity to Azure Stream Analytics, and the latter has connectivity to Event Hubs. Power BI knows how to talk to Apache Spark running on HDInsight. Azure Data Lake emulates HDFS (the Hadoop Distributed File System), which has native connectivity from the Power Query component of Power BI. Azure SQL Data Warehouse features Microsoft's PolyBase technology, which integrates HDInsight and other Hadoop distributions.

Microsoft doesn’t appear to be positioning this service as a traditional (cloud-enabled) ETL product such as Informatica or SnapLogic. While it can perform some of the similar ingestion and transformation functions, ADF looks to be primarily targeted at analytics scenarios and gathering insight from disparate data sets. ADF is priced on a per-activity basis, and charges vary depending on whether the activity is occurs frequently or not, and whether it’s running against cloud or on-premises endpoints. Users pay for data movement on an hourly basis, and inactive pipelines incur a nominal charge.

To learn more, take a look at the learning map for the product, or read the frequently asked questions.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter