Pivotal recently released Spring XD 1.1 GA with new features including stream processing with Reactor, RxJava, Spark Streaming and Python. Additionally support for Kafka, batching and compression with RabbitMQ, and support for container group management when running on YARN are now featured. The Spring XD project provides over 25 sample applications for developers.
As part of the release, Pivotal Product Manager, Sabby Anandan, wrote "there's no reason developing Big Data applications has to be time-consuming and complicated". In Spring XD: Data-Driven Connectivity Within a Unified Platform, Anandan notes that XD's high-level DSL allows you to build streams from the command line, without the need to setup IDEs or build scripts. He also mentions that its embedded Admin UI can remotely monitor and manage streams, batch jobs, and the entire cluster.
Shortly after announcing the Spring XD release, Pivotal open sourced their big data suite. In the aforementioned InfoQ article, Abel Avram wrote:
Pivotal came later to the Big Data market, after some of the earlier players such as HortonWorks, Cloudera and MapR. But now, to address "fragmentation and vendor lock-in" in the big data space, Pivotal has decided to open source a number of products from its Big Data Suite, namely Greenplum Database - parallel processing data warehouse –, HAWQ – an ANSI-compliant SQL on Hadoop query engine, and GemFire – a distributed in-memory NoSQL database.
Alex Handy, Senior Editor at Software Development Times wrote in his article "Pivotal pivots to open source":
[Spring XD] is to Hadoop what Spring was to Java EE.
He continues:
Put simply, that means Spring XD simplifies the configuration and boilerplate work that comes with building Map/Reduce and other YARN queries. Spring did this for Java EE, simplifying a process that, for years, had enterprise developers grinding their teeth over endless configuration and XML files.
To learn more about this release, InfoQ sat down with Spring XD co-leads Mark Pollack and Mark Fisher, as well as Product Manager Sabby Anandan.
InfoQ: How does Spring XD simplify the development of big data applications?
DSL vs. API development
Spring has always been motivated by the idea attributed to Alan Kay (http://en.wikiquote.org/wiki/Alan_Kay): "simple things should be simple; complex things should be possible", and in fact, Spring XD takes that motivation to a whole new level. It provides multiple distributed runtime options, and supports a wide range of use-cases where no code is required. With Spring XD, something as significant as ingesting data from HTTP to HDFS is "simple", and even for cases where custom stream processing might get complex, the model of Spring XD allows developers to focus only on that code and then drop it into a folder where it will be dynamically added to the registry of modules. It will then be immediately available within the DSL just like any out-of-the-box module. There is a strong separation of concerns between the processing logic and the infrastructure.
InfoQ: Can you provide a real-world example with code?
For the ingestion case mentioned above, a user can simply submit "http | hdfs" via the XD shell, a REST API, or the web UI. For some examples of custom modules that do involve code, see the Spring XD Samples repository: https://github.com/spring-projects/spring-xd-samples
InfoQ: The major theme for the 1.1 GA release is streaming. What motivated adding support for Reactor, RxJava and Spark?
We wanted to provide a way to do stream processing using the functional style APIs offered in these projects. With our modular architecture, that means those APIs can now be connected within a stream to any of the data sources or sinks that are available now or in the future.
InfoQ: Of the streaming libraries mentioned, which was the easiest to integrate with? The most difficult?
First of all, just to be clear, our main goal was to ensure that the developer experience for using these streaming libraries is as simple as possible and also as consistent as possible. So, here we’re discussing the effort required to integrate those libraries within Spring XD itself in such a way that it met that goal. Integrating with Reactor and RxJava was much easier than integrating with Spark Streaming. The latter required more than API-level integration since we are delegating workloads to the Spark cluster, which involves custom serialization mechanisms and a lifecycle management model that is still evolving. We even ended submitting a couple of pull requests to the Spark Streaming project along the way, but we were glad to contribute. An interesting aspect of integrating with RxJava and Reactor was how to assign incoming messages to specific stream instances. We can assign all the incoming events across processing threads to a single stream instance, so you can calculate global state, but also assign events to multiple stream instances, one per processing thread since stream instances are not thread safe. You can further divide this into finer grained stream instances, for example mapping a stream instance to a Kafka partition ID. In the case of Kafka we ensure that all events for a given partition will be handled by the same thread, so ordering is preserved.
InfoQ: According to a recent Typesafe report, the Scala community is showing intense interest in Apache Spark (88% of Spark users are working in Scala, 44% in Java, 22% in Python). You have a Scala API, do you see a lot of developers using it?
We provide support for Spark Streaming processors or sinks within a Spring XD stream. Those can be written in either Java or Scala, and in both cases it's just a matter of implementing a single interface that takes the DStream as input and (optionally, i.e. if a sink) produces a DStream as output. In other words, there is no intrusion of any other API, just pure Spark. The Spark Streaming support is new as of this 1.1 release, but for those modules, we expect most developers will choose Scala, since as you pointed out in the question, that is the most common choice for Spark users in general. We also have a Scala DSL for Spring Integration, and that can be used for custom modules as well. Given that the Java 8 DSL is also based on lambdas and thus quite concise, there's less incentive to use Scala for that reason. That said, the fact that implementing a custom module is completely isolated to the code that produces and/or consumes messages, it's more convenient for developers who wish to use Scala in those well-defined extension points as opposed to trying to fit it into some monolithic system that is not itself Scala-based.
InfoQ: What are your plans for the future of Spring XD?
Though we believe in shorter iterations and incremental deliverables, we have aggressive goals for 2015. Spring XD on Pivotal Cloud Foundry is very much an active work-in-progress, and we envision closing the gap between Big Data and Cloud through a straightforward developer and devops experience. Having Spring XD as a Pivotal Cloud Foundry service allows us to build domain specific PaaS solutions - “Telco as a service”, “Healthcare as a service” etc. We are also building an HTML5 based UI Canvas to further simplify the developer experience through drag & drop support for out of the box modules along with add-on’s such as metrics and monitoring capabilities. An Apache Ambari plugin for Spring XD is another short-term goal in our pipeline.
InfoQ: What's the best way to get started with Spring XD?
You can get started with Spring XD in less than 5 minutes. Depending on the operating system, you have a few options. OSX users can install Spring XD using homebrew. RedHat/CentOS users can install Spring XD using the yum repository. If you prefer downloading the bits directly, you have the option of manually setting up the environment. The getting-started guide covers the steps and other general requirements.
To learn more about Spring XD, see Charles Humbles article titled Introducing Spring XD, a Runtime Environment for Big Data Applications.