BT

InfoQ Homepage News Interview: Pervasive's Jim Falgout on Multi Core Programming and Data Flow

Interview: Pervasive's Jim Falgout on Multi Core Programming and Data Flow

Bookmarks
Pervasive Software's Data Rush product brings the concept of data flow to Java. Recently on the company's blog data flow was explained with the following example:

KPN is a common model for describing signal processing systems where infinite streams of data are incrementally transformed by processes executing in sequence or parallel. Despite parallel processes, multitasking or parallelism are not required for executing this model.

In a KPN, processes communicate via unbounded FIFO channels. Processes read and write atomic data elements or tokens from and to channels. Writing to a channel is non-blocking, i.e. it always succeeds and does not stall the process, while reading from a channel is blocking, i.e. a process that reads from an empty channel will stall and can only continue when the channel contains sufficient data items (tokens). Processes are not allowed to test an input channel for existence of tokens without consuming them. Given a specific input (token) history for a process, the process must be deterministic so that it always produces the same outputs (tokens). Timing or execution order of processes must not affect the result and therefore testing input channels for tokens is forbidden.

The company also recently announced a new RC of the next version of the product which includes a major rewrite of both the dataflow engine and the library in order to eliminate the use of XML as the composition language. InfoQ sat down recently with Pervasive architect Jim Falgout to discuss Data Rush and the data flow concept.

As you mention in your FAQ the industry has addressed scale in recent years with application fabrics, grids, and clusters all containing multiple machines. You are instead betting on the simplicity of getting all of the performance possible out of a single machine with many cores. That is your strategy right?

For many problems it is. The commoditization of SMP is in full swing. Most people don’t realize the massive horsepower you can get in a commodity box, and it’s just the beginning of the multicore SMP commoditization curve that will race forward in the next 5-10 years. I just went online and priced an example of this – and guess what? For a little over $9,000 you can get 16 high-speed 64-bit cores, 32 gigs of RAM and 4 terabytes of hard disk. The amount of storage you can connect today to a 16 or 32 core machine is impressive. And you still get the familiar high productivity of a single address space.

However, there is a scale of data problems, commonly called internet scale, that is beyond the capacity of a single machine. For that scale of problem, which is becoming more common, a larger architecture is required. To reach that audience, we are looking at several different avenues to provide a clustered solution using Pervasive DataRush. For example, integrating DataRush with a system such as Hadoop could allow Hadoop to provide coarse-grained (machine node) parallelism and DataRush to provide fine-grained (core) parallelism.

Does this mean we no longer believe in the “revenge of the SMP” class machine? Not at all. For some data-intensive problems, these kinds of robust SMP will elegantly fill that need. But, we recognize that there will always exist a class of problems whose solution requires a scale-out beyond even the SMP machines available today.

Data Flow programming seems to be a concept that developers are still wrapping their head around. How would you quickly explain it?

The best analogy I’ve found is to compare it to Unix (Linux) command line piping. If you’ve used Unix or Linux before, you are familiar with the ability to pipeline commands together to achieve some overall purpose. Each command is unaware of the others in the pipeline in the sense that, when the command was programmed, it was written to a specific contract. This contract is to accept standard input and write to standard output. As long as commands adhere to that protocol, they can be “wired” together in ways that the original programmer may have not considered. Shell programming in this sense is very simple. It also provides the ability to utilize coarse-grained pipelined parallelism. Each command works independently of the other and so they can concurrently consume system resources (i.e. processing cores).

Dataflow programming is very similar in concept. Each operator in a dataflow graph is a piece of code that adheres to a contract. Much like a Unix shell command, the dataflow operator reads from its input queues, transforms the data in some way, and writes to its output queues. The dataflow application developer can use many operators, wiring them together to produce an overall application. Beyond the capability of shell programming, dataflow programming provides the ability to have multiple inputs and multiple outputs. Dataflow programming can also provide other parallelism constructs such as horizontal and vertical partitioning of data.

How do you complement the existing java.util.concurrent apis's?

We utilize the java.util.concurrent API’s that became available in the Java5 release to build DataRush. One of our main precepts is to do the “hard work” of parallel programming so the user of DataRush doesn’t have to.

Looking forward to the Java7 release which will include the Fork-Join and ParallelArray constructs, we feel we are a natural fit for those algorithms or datasets that extend beyond what you can do in memory. While memory becomes cheaper and more dense, the size of problems continues to increase. This constrains the type and size of problems you can solve with strictly in-memory methods. The style of dataflow programming is not that different from that used with ParallelArray. Dataflow programming however, extends that capability to include massive sized datasets that go beyond the abilities of current memory systems.

One way we’ve thought of DataRush from a positioning standpoint is this: the Java7 constructs of Fork-Join and ParallelArray provide excellent abilities for in-memory problems on a single machine. DataRush extends that by adding the ability to solve problems beyond the constraints of memory and support many inputs and outputs, but is targeted at single-machine solutions. Hadoop and similar grid-enabled frameworks add scale-out capability.

What is the most surprising thing that Pervasive has found while developing DataRush?

Probably one of the most surprising things we run into every day is how little mainstream awareness there is about multicore and the software issues around it. The media coverage for multicore has been extensive for several years now, so people know of the issues. But we are surprised at how many organizations have no idea how inexpensive super-computers are today and what is available. When we talk about 16-core machines, customers still think of that capability as way too expensive and years away. However, a 16-core machine stacked with memory and disk space can be purchased for under $10K.

Another surprising thing is how regularly people default to SQL as the hammer for every data-intensive nail when in fact for certain types of data-intensive analytic applications, RDBMS can be very inefficient. We recently cut a roll-up process for a customer from 3 hours to 22 seconds by getting it out of a relational database and onto DataRush.

As far as the software issues, it appears that many organizations are waiting for the “big” companies to solve the programming model problems for them, somehow automagically. Even though industry leaders such as Tim Mattson from Intel have stated emphatically that there will be no auto-compiler that will take existing code bases and parallelize them automatically. Software architects at all levels need to realize this and start investigating different programming models such as dataflow to help them build parallel enabled software.

While you provide a number of tools to assist developers, it is really as simple as maximizing cpu usage?

No, not at all. As cores become more abundant, compute resources will no longer be in short supply. At that point I/O capabilities become key and can quickly become a bottleneck. DataRush provides connectors to different data sources that attempt to parallelize the I/O as well as parallelize the computation. Large data-size problems are as much about getting to the data quickly as they are about running the computations quickly.

That said, for the computationally intense parts of a problem, utilizing the CPU’s available is absolutely key to providing an efficient solution. With two or four cores, keeping those cores busy takes careful programming. Going to 16, 32 and more cores becomes very difficult. Having a framework that can help your solutions scale as machine sizes increase becomes critical. Otherwise developers will spend all of their time working on scaling issues instead of solving business problems that pay the bills.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.