Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Billy Newport Discusses Parallel Programming in Java

Billy Newport Discusses Parallel Programming in Java


1. Hi, my name is Ryan Slobojan and I am here with Billy Newport, Distinguished Engineer at IBM software group. Billy, what are some of the challenges that exist for doing parallel programming effectively in Java?

I think that the problem is that the level of abstraction isn't there yet and I think for a lot of people it's still rocket science. There's probably a couple of mad scientists somewhere that can design data structures that scale up, like Doug Lea, Brian Goetz, you know, people in IBM that have done it before, but it's not common knowledge how to do it. And I think there has been a lack of focus on frameworks, for doing parallel programming even in the multi-core cases for example, so you have Doug Lea who has obviously been super good at moving the state of the art forward in Java with the Java concurrency packages. Now he is working on the Fork/Join stuff, but I think we still need something easy, maybe sets of data structures that you can plug in visitors to, so that it just works for most people. And then I think people need to stop thinking about multi-core as meaning multi-core single process, and start thinking about multi-core as meaning multi-core period, where there could be a number of machines cooperatively holding the data that you want to process. And I can give examples like Hadoop or whatever Google is doing, stuff like that there, but Hadoop is very file focused so I think we need something like Hadoop, or Apache Hive, or Pig or stuff like that but for Java.

We were talking with Chris Wensel about his Cascading framework to say, "Well maybe could Cascading become like a high level framework for working with data structures, like a scheduler for describing your algorithm in very high level terms, in terms of grouping and things like that, and then let a scheduler figure how to actually run it, execute it on a thread pool distributed over several processes where the data has been split up, or if it was in a file". But why should there be a difference really in terms of the programming model for doing programming for Hadoop or programming for data structures in memory, or programming for data kept in a data grid. It would be nice if there is one thing for everyone to learn that we could then build up some assets and collateral, like around common data structures, common algorithms for processing data structures; and those algorithms could then be written once. And in this language it is the language's job, the compiler's job is to really map that high level language into the primitives that the scheduler then executes, places on the grid or places in thread pools to actually make it run.

That's one of the things that concerns me is that there is not enough work being done, so why don't we try to solve it, we are doing lots of other things as well. We have something like X10 in IBM research which is something we're doing in that area, but I think that is a wide open space; I don't see anybody having the answer. I see product specific solutions like Cascading, Pig, Apache Hive, Hadoop, stuff like that, but I don't see anybody making a generic framework that you can code at a very high level to, so that you don't have to worry about all the concurrency aspects of it, and scheduling that would aim that you write code as portable across scale if you want to be able to do parallel programming. So I think that would be an interesting place for the next open source company to get bought by somebody.


2. There are several approaches right now to doing parallelism, there is the actor model, there is message passing, there are many different languages that are doing many different things. Which do you think is most likely to be the best match for Java?

I think none of them, because Actors is too low level; if I want to describe a search algorithm, I'd like to describe a search algorithm. I don't want to have to map it on to a threading model, or an actor model or write it myself with synchronization blocks or stuff like that. So the actor stuff and message passing seem really similar to me, in that it's almost like you are making a mail box for receiving events that you want the actor to process. There is only one thread mutating the state and the results go out to another actor to keep going forwards but it's very much like, as we were talking before, apartment level threading from COM back in the late nineties early 2000s. And it's not the level of abstraction I think that people are waiting for; if you give someone an actor, does that mean they are going to be able to code up a fully distributed search over it? No they are not going to be able to do it; you still have to be a smart guy.

So I think that actors are like primitives, but I think in terms of frameworks we need something much more abstract, where you can talk about what you are doing to the data structure and not how it's implemented. Because with an actor it has got nothing to do with the data structure, it's an actor, but if I just want to specify a Lisp comprehension in Scala, something like that, and then I want something to take that description and then parallelize it; probably the best example to something like that would be like Simon Peyton-Jones with Haskell, or the full functional languages where they have tried to do that, where you write your code and if it's pure code i.e without mutation, reference transparent code is what it is called, then it's relatively easy for a compiler to take an algorithm expressed in that form, and turn it into a parallel piece of code.

And Simon has done a ton of work in Haskell there and I am sure the Erlang guys are looking down that direction as well, but I think that high level view, and high level way of expressing algorithms, is what we need for doing parallel programming, not another set of primitives that someone could use to do it. Because now you are talking about how to do it not what you want to do and I think for most people it's too hard the how, what they want to do is the what. So I think that is the tricky part. And then the framework we talked about in the first question is basically that's what we need higher, and maybe Java needs a level on top of it to make that possible but I think actors are just a primitive, they don't really make it; they solve certain kinds of concurrency problem, but you still have to know how to do it. And I think what you really want is a smart guy to write a compiler to run your code in parallel on, and it might not be as good as a rocket scientist could do but it's certainly better than the average guy could do, which is really the game; so I think that is what we need again.


3. It seems that one of the major trends in the Java industry over the last few years has been simplification. Is this what you are seeing, that this kind of simplification, this raising of the abstraction, that this will enable parallel programming to be much more effective?

Yes, because I think that as we are seeing - I wrote a blog entry, it's probably one of my most popular blog entries ever, where I said "Multi-core was bad for Java". And then I got descended on for that but it's still true because while systems programmers, I had people from BEA and Sun retort "That's not true", because it's in their interest to say it's not true. And they were saying GC scales perfectly with the number of cores, and of course it does because the guy that is conversing with me is working at BEA or Sun and he is an expert in multi-threaded programming so surprise, surprise, GC scales. But it's not really about GC scaling, it's about the code the JVM is running scaling and that's the issue.

So I think there has been lots of moves towards making things easier, like Scala is an example down that direction, like they built in support for Lisp processing, Lisp comprehensions; they all have been "stolen" in functional languages before, but they are now available in Java, or at least running on a Java JVM, so you can make your code more precise and not have to worry about the boilerplate code of writing for loops with indexes and this.

And in a way that's a level of abstraction, to make expressing the algorithm cleaner, and I think there is still a lot of room in Java, especially around the multi-core stuff, because as cores slow down, and make no mistake they are slowing down: I mean you could get single cores running at 3.84 GHz two or three years ago and now the Nehalem stuff is running at 2.2 GHz, there's ones at 2.93; they have turbo boost mode to make one core faster if the other cores are not doing anything, but on what servers is there only one core at work? Most of the time they are all busy so the clock speed is going to be 2.2, 2.3 GHz and we've seen customers complain that they were on a Xeon box before with two cores, we gave them a Quad core one and then their database query times went up, because the clock speed on the Quad core was lower than the clock speed on the double core. So like the database is single-threaded processing the queries and so the queries take longer to run or page views take longer to come up, but it has more processing power in a benchmark because an expert wrote the multi threaded code.

So I think as we get used to slower, wider processors and then as well as that in the multiple machine space, where we have large data sets that don't fit, I think there is plenty of room right now to try to come up with an abstraction so that, like we said before, people can talk about what they want to do, and express what they want to do not how they want to do it. And I guess if I was going to make an open source company now that's what I would make; I would make a company that builds a new abstraction for doing parallel data structure processing and I'd make it work so that it ports on top of stuff in one JVM, stuff on the popular data grids, Hadoop type stuff. And I think that would be a fairly powerful product to make available and it would have a large audience, if it was easy to consume, because there is a lot of people with this problem trying to figure out how to use this 32 way Niagara box with two sockets, you know with sixteen cores per socket, how do we make this go fast? They can write the algorithms in terms of loops, but that means it's not parallel. So they need a way to express the algorithm where the JVM can actually make it run fast and we can make a plug-in for the JVM, maybe to run it in parallel on multiple machines if that is what it takes.

But I think that what he said in terms of making things, the way to succeed is not with features, it's letting people do what they were doing before faster. And that means your product can get adopted faster because it's a no brainer; drop it in, "da, da, da boom", its working and you are in there versus having to go through the normal cycle of selling them a new feature, waiting for it to get deployed.

And we took that to heart with eXtreme Scale, in that we could have designed eXtreme Scale to be part of WebSphere and to only work on WebSphere 7 or something, but that's really slowing down the adoption curve for a new product, so instead we designed it to work with just J2SE 1.4, 5 or 6, we designed it to work with Web Sphere 6.0, 6.1, 7, JBoss, Tomcat, WebLogic, because we are trying to learn a lesson from the open source guys which is the reason I think Hibernate, SpringSource are so successful is exactly that; people could just get a jar, drop it in to what they were doing already and with very little overhead get using it straight away, so think viral middleware. We took the same approach with eXtreme Scale where we designed this to be very easy to download, very easy to start using in your existing environment, so we are getting rid of all the barriers that would stop you getting productive at something.

And in the same way I think if somebody can build a multi-core abstraction for dealing with data structures at a very high level, like a Haskell type thing that runs in Java, something like that, with efficient compilers and schedulers for the various paradigms doing parallel programming like Hadoop, Doug's stuff, and the single JVM, data grids, I think there would be a fairly big market for something like that.


4. Does .Net's LINQ feature do this sort of querying in a parallel fashion, somewhat how you are describing?

Not really. I mean LINQ does queries to databases and then the results come back and it does another set of queries on those to give you the answer back, but that's not the same thing. It's not running it in parallel; it's not compiling it down to like what Cascading does where you specify in Groovy with their DSL a high level description of what you want to do and then the Cascading compiler takes that and then figures out how to run it in parallel on all these boxes. I mean that's not what LINQ does; that's not what SQL does, although SQL could but I think it's too hard to do. I mean it's too hard in SQL to compile SQL down to run on a partition data model. That's state of the art stuff even today.

I was thinking of attempting to do it and got told to run as far as I can away from that problem because lots of people tried to do that before. And you see Facebook doing stuff like Apache Hive, which is an attempt to do it, but it's not meant for real time. It's meant to give people that are storing data in Hadoop an easier way than using Pig to write the scripts to analyze all the log files that they are collecting from the web servers and things like that. So they write them in SQL instead and then they've got a very simple mapping to take the SQL, convert it into Pig, and then run it on the VM, or maybe not into Pig but into some other language that they can run on the boxes. But it still doesn't help the cases where you are doing joins on columns that aren't partition predicates. It works in the simpler cases - I can make anything work in the simpler case - but it's the edge cases, which are very common, that tend to be the issue. So I think that is really the problem.


5. With this automated parallelism across multiple VMs and multiple processes, there would be issues around automatic discovery of things. How would you prevent for instance random node A from clustering with random node B and causing unforeseen issues?

I think that is mainly an issue with multicast based products where they discover other JVMs that are on the same multicast port and then they assume they are in the same cluster and they all come up, and that's the way some of the competitor products to us work; I guess like Oracle's product. But in eXtreme Scale, for example, we don't do things that way in that what we have is a clustered management server, that we call a catalogue service, and it's the well known thing that everybody knows about to bootstrap into the grid, and for servers in the grid to bootstrap into the grid as well as clients.

And then when your container starts up it will connect to that service and you can have that service running on your laptop. There is nobody else using your service to configure your clustered JVMs so it's completely separate. Because we're not multicasting so there is no danger, unless somebody is being malicious. And of course with Coherence obviously you can use different port numbers in the multicast groups but then you have to know what port numbers everyone else is using to get around it I guess, right.

So the way we have done it, traditionally IBM customers have been adverse to running multicast just because nobody wants to open up TTL flooding across routers, things like that. So TCP was always more preferred. In the early days UDP had a big advantage over TCP mainly because Java 1.3 didn't have non-blocking IO, so if you used TCP IP sockets in 1.3 you had to have a thread per socket, which didn't scale that well. But if you went UDP it was connection-less so you effectively had non blocking IO if you used UDP on Java 1.3, but when we started doing our stuff we had non-blocking IO so we don't need UDP to do it. And no one runs 1.3 anymore I guess, or it's not that common.

And the other thing is that TCP stacks these days, rumor has it the Internet has caused TCP stacks on most operating systems to be totally optimized. There is no way, I think, that you are going to write a reliable streaming protocol on top of UDP that's going to be faster, or use less CPU power, than the Linux TCP stack, or the Microsoft one, or the AIX one, or some network cards have it built into the existing hardware. So how are you going to beat that? We use TCP IP and we are proud of it, because there is no point in trying to beat that game, that game is over, and TCP is a heavily optimized protocol now, it's fair to say, so there is no point in using UDP anymore.

There might be for some specialist things but in general the differentiator between product A and product B is not because it's using UDP or TCP. The chances are the TCP product is going to be faster because you have got all the man years that have gone into optimizing TCP stacks and all that code is written in C; you are not writing it in Java and it's not generating garbage and it's just a battle that is not worth fighting so we use TCP for that reason.

And so those kinds of issues don't really come up because it's an architectural thing; the multicast-based products you have to have a well known white board where you write down which port number you are going to use or a WIKI page, or something like that, or in products like ours it doesn't enter into it because you just start up your own management service, and you hook your containers up to it, and you have your own grid, so it doesn't arise anyway, and that is the way it works.


6. One of the other challenges that I can see is there is a cost to distributing work to another node because IO is orders of magnitude slower than just doing work on a local machine. So I need to do some kind of up-front analysis of, this is the amount of work that I have and it's probably going to take at least this much so there is benefit to farming this out to one of the other VMs, having it calculated there and then returned. How do you do the scheduling?

That's one of the problems we talked about earlier. I mean one of the challenges for using the data grid stuff is figuring out where is the best place to run the code. And then what you will see is data grids will be optimized or partitioned a certain way, for a certain use case. But then what can happen is another use case comes along and from a business point of view they don't see the problem, "Well I bought a grid, run this algorithm against it". But the way the data has been laid out is optimized for algorithm A and then when algorithm B comes along it might require the data to be partitioned a whole other way to run efficiently, and that's the big challenge.

Products like Hadoop solve that by basically resorting the data, because there is infinite disk space. They run for a couple of days, resorting the petabyte of data, to be ordered on this key now instead of that key, and then they run the algorithm on that one, at that point; but that's a batch type thing so if you are working with lots, hundreds, of terabytes or petabytes of data that is what you expect. But in data grids they run on memory typically and memory is still expensive so it's not infinite. We can look at the emerging technologies like SSDs to overflow, but even SSDs are still not as cheap as disk and one of the reasons for using a data grid instead of something like Hadoop is to keep the data in memory so that it's fast. If you are going to keep all the data on disk you may as well use Hadoop.

We've been looking at in the lab things like, can we combine Hadoop type stuff with eXtreme Scale so that Hadoop is a scheduler for managing work on top of a data grid and not on HTFS - their file system. Why couldn't it be? We have got a research project going on that is looking at that. I have talked with Chris Wensel obviously at Cascading to see if he could make the Cascading engine opened up to where we could plug in data grids instead of Hadoop, same type thing, where you have a higher level abstraction.

So I think it all comes down to the skill level of the people and I think we are making it as easy as we can. Anthony Chaves just wrote a book on eXtreme Scale and he talked to me before he wrote the book and I was like, "Well you could write a book about just APIs in eXtreme Scale, but that is not that interesting because the APIs are going to evolve over time so the book will have a short shelf live, if that's the book that you wrote. But would it make more sense to talk about how to write algorithms that run on a data grid, and the problems you are trying to solve", and I think that is what he did.

The book, whilst it is about eXtreme Scale, a lot of the value of the book is that it is trying to educate people to be able to code algorithms that will run efficiently on any data grid, eXtreme Scale or not. So it's a great book to read from that point of view because it hasn't been written from the usual generic professional, "Blah and it's just product documentation that you are buying in the book". I certainly spent a lot of time talking to him about customer problems, anonymously of course, but the issues they were seeing, and how we solved them.

And the game was to just give enough of them to get Anthony's brain re-wired so now he is seeing "That's how it works", and the game of the book really was to try and give the reader of the book that experience where you throw scenarios at him, and then slowly the dots join and then the guy gets it and then he can actually write the code efficiently. Because that's one of the major challenges with any kind of a product, is learning the patterns for applying it effectively, and I think Anthony's book certainly goes a long way towards those kinds of patterns for data grids.


7. It seems in a lot of ways that what's being described has many parallels with an operating system. You have, for instance, the data grid which is like the RAM, Hadoop which is like paging out stuff from the RAM. You have the distribution of a scheduler, you have the different processes which could be considered something akin to threads. And do you see parallelism and grids as being something like an operating system on top of a series of larger pieces, but almost recreating physical machine?

Absolutely, because if you look at a normal operating system, for example paging, it is very stupid - I mean the way it pages things it has no idea what is going on or what pages the program will try to do next. They try to do FIFO, LIFO, they have LRU, different algorithms, but they don't know, and they don't let the application plug in a different algorithm that suits what is running.

Something like Hadoop is optimized for streaming data; it tries to read through a very large file, like a 60MB file or whatever is the chunk you are going through, but it's not loading the whole 60MB file up at once and then the OS could page it in and out, which would be one way to do it. But it's cleverer than that. It knows it is going work on that file a record at a time, so it reads a block of records in, you zip through them, you get an output that is written to disk, you read the next block of records in. You are not paging but you are still working with a much bigger data set. So I think they can definitely add value on top of operating systems in that they have better knowledge.

And once you have these higher level abstractions for describing what you want to do, then the how to can be done in a way where someone that's expert in memory management, it can run in that kind of a model, whereas if you left that to a naïve programmer that's inexperienced in something like that he might actually read the whole file into RAM and let the OS page it, and the difference in performance is massive.

I think that OSs are cool but they've never really had enough hooks so people have to write paging, how to page stuff in and out of disk manually to really get the performance they want. I think the higher level abstractions will let them do that in cleverer ways for certain types of algorithm and then more and more people will get better performance because they are not trying to reinvent the wheel; they are using patterns which have been codified, implemented, tested and you are not the only one discovering for the first time how to write an algorithm.

If you look at the cloud and stuff, what's an operating system? If you look at people running PaaS (platforms as a service) you don't even see the boxes. You give it the app and the app runs, so you could treat the cloud as a box with an operating system, but under the covers it is doing a lot more than a conventional OS does. So I think operating system is kind of, in the past it was what controlled a single machine, but I think now it's much bigger than that; it has parallels in what we talked about before in terms of multi-core programming. Multi-core programming is not just about multi-core programming on a single process anymore, and operating systems are not about controlling the resources of a single box anymore.

You have Hypervisor, which is like an operating system for an operating system, you have provisioners, which are like operating systems for provisioning out VMWare, or Zen, or Z stuff. So I think the definitions are going to change in terms of an operating system because people will get bored with inventing new terms for different kinds of operating operating systems. So it's a big topic - an exciting space to be in in that area.

Apr 16, 2010