BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Microsoft Releases Azure Data Factory

Microsoft Releases Azure Data Factory

This item in japanese

Bookmarks

Any cloud provider that believes in data gravity is trying to make it easier to collect and store data in their facilities. To make data movement between cloud and on-premises endpoints easier, Microsoft recently announced the general availability of Azure Data Factory (ADF). However, this pay-per-use service is being positioned as part of Microsoft’s analytics suite, not as a pure-play Extract Transform Load (ETL) tool.

In a post earlier this month on Microsoft’s Machine Learning blog, Microsoft VP Joseph Sirosh described ADF and explained the benefits of the service.

With ADF, existing data processing services can be composed into data pipelines that are highly available and managed in the cloud. These data pipelines can be scheduled to ingest, prepare, transform, analyze, and publish data, and ADF will manage and orchestrate all of the complex data and processing dependencies without human intervention. Solutions can be quickly built and deployed in the cloud, connecting a growing number of on-premises and cloud data sources.

Using ADF, businesses can enjoy the benefits of using a fully managed cloud service without procuring hardware; reduce costs with automatic cloud resource management, efficiently move data using a globally deployed data transfer infrastructure, and easily monitor and manage complex schedule and data dependencies. All of this can be done through an intuitive monitoring and management UI available from the Azure portal, and developers are accelerated through a familiar Visual Studio plug-in experience for solution building and deployment.

ADF works by processing datasets through a pipeline composed of activities. A “dataset” describes data structures stored within a given data store. ADF offers a number of data store connectors including Azure SQL, Azure DocumentDB, on-premises SQL Server, on-premises Oracle, on-premises Teradata,  on-premises MySQL, and more. The “activities” in ADF perform actions on a given dataset. An activity may relate to date movement or data transformation. Data movement activities such as copying data from a datastore are responsible for transferring data between endpoints. The data transformation activities take the raw data and run queries against it. There are seven transformations available, most of them relying on the Hadoop-based Azure HDInsight service.

To access on-premises endpoints, ADF uses a tool called the Data Management Gateway. The Gateway runs on local Windows servers and uses certificate-encrypted credentials to access the on-premises data stores. Outbound calls are done over standard HTTP ports. Gateway instances are tied to individual data factories, and only one instance of a Gateway can be on a given server. So, users should expect to run a fleet of servers hosting Gateways if they have many factories in use. To create data factories, developers can use the (beta) Azure Portal, PowerShell, Visual Studio, or a REST API.

 

Image source: https://azure.microsoft.com/en-us/documentation/articles/data-factory-introduction/

ADF is part of the Cortana Analytics Suite that was announced in July. Other products in that suite include Azure Data Catalog, Azure Machine Learning, HDInsight, Power BI, and Azure Stream Analytics. How does Microsoft plan to integrate this separate set of services into a single suite? In a ZDNet article about the ADF release, Andrew Brust explained how the packaging and integration may work.

General availability is slated for "later this fall" and promises to deliver a single subscription for all of the Azure Big Data and analytics services. Pricing will be disclosed in the fall as well.

Microsoft also promises to bring integrated vertical industry solutions to Cortana Analytics customers. These are essentially use case templates/accelerators, for industries the will likely include manufacturing, healthcare and financial services. While they may not be full-fledged products per se, and definitely won't constitute true integration of the services, they will nonetheless serve as canonical examples of how to use the services together.

Certain of the services have point-to-point integration already in place. Azure Data Factory has connectivity to Azure Stream Analytics, and the latter has connectivity to Event Hubs. Power BI knows how to talk to Apache Spark running on HDInsight. Azure Data Lake emulates HDFS (the Hadoop Distributed File System), which has native connectivity from the Power Query component of Power BI. Azure SQL Data Warehouse features Microsoft's PolyBase technology, which integrates HDInsight and other Hadoop distributions.

Microsoft doesn’t appear to be positioning this service as a traditional (cloud-enabled) ETL product such as Informatica or SnapLogic. While it can perform some of the similar ingestion and transformation functions, ADF looks to be primarily targeted at analytics scenarios and gathering insight from disparate data sets. ADF is priced on a per-activity basis, and charges vary depending on whether the activity is occurs frequently or not, and whether it’s running against cloud or on-premises endpoints. Users pay for data movement on an hourly basis, and inactive pipelines incur a nominal charge.

To learn more, take a look at the learning map for the product, or read the frequently asked questions.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • SQL cannot produce the above statistics – SQL is obsolete and out of business.

    by Ilya Geller,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    SQL is obsolete and out of business.

    SQL, Structured Query Language is a programming language designed for managing data held in relational database, and was intended to manipulate and retrieve the data. SQL structures EXTERNAL questions in the sense that it was designed to convert incorrectly formulated EXTERNAL questions into the right ones.
    SQL works with (usually manually) structured data; where the structured data refers to information with a high - but never absolute! - degree of organization, such the database is easily searchable by simple, straightforward search engine.
    SQL structures queries which have nothing in common with the data itself! Actually SQL operates with EXTERNAL descriptions of the data. For instance, for the query 'What is the rate?' the clarification 'of Russian ruble to Australian dollar?' - could be an EXTERNAL description.

    I, however, discovered and patented how to structure any data without SQL, the queries - INTERNALLY: Language has its own INTERNAL parsing, indexing and statistics and can be structured INTERNALLY. (For more details please browse on my name ‘Ilya Geller’.)

    For instance, there are two sentences:
    a) ‘Pickwick!’
    b) 'That, with the view just mentioned, this Association has taken into its serious consideration a proposal, emanating from the aforesaid, Samuel Pickwick, Esq., G.C.M.P.C., and three other Pickwickians hereinafter named, for forming a new branch of United Pickwickians, under the title of The Corresponding Society of the Pickwick Club.'
    Evidently, that the ' Pickwick' has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain 'Pickwick', weights: the first has 1, the second – 0.11; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.

    SQL cannot produce the above statistics – SQL is obsolete and out of business.

  • Re: SQL cannot produce the above statistics – SQL is obsolete and out of bu

    by Ilya Geller,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I was right.

  • Re: SQL cannot produce the above statistics – SQL is obsolete and out of bu

    by Ilya Geller,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Even more:
    What is AI chatbot phenomenon ChatGPT and could it replace humans?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT