Bio James Spooner, VP of Acceleration at Maxeler Technologies, is responsible for project delivery with Maxeler's acceleration customers. Working predominantly in Oil & Gas and Finance, James and his team deliver solutions in seismic modelling, data processing, fixed-income analytics and trading platforms.
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
Hi. I work for Maxeler. I’m the VP of Acceleration; so what does that mean? I’ve got a team of people who work with customers to apply data flow computing to their problems and make things 20-40 times faster which means a reduction in space and power, for instance.
Our main focus is oil, gas and finance; they seem to be the people with the right sort of problems that we actually focus on and get the speed ups but it applies equally well in other big computer industries like Bio tech and Pharma.
It’s a bit of both; so we have some very good tools and very good people but it’s actually the approach, it’s about making the most use of the silicon. The semiconductor industry has done a very good job up until sort of the middle part of last decade of using silicon to make computers faster without the programmers having to do any extra effort; but come 2007, the clock frequency stopped getting faster and they keep adding transistors and they add other things but it gets to a point where the memory is limiting you and the amount of power is limiting you, the amount of power you can dissipate in a chip is around 130-150 watts now; and you can’t do anything more with that; you can’t raise the frequency, you have to start and be very clever how you use silicon.
Right. We use silicon from a bunch of different places and we do use control flow silicon which is what a CPU is doing, control flow, and it's pretty good at doing things like popping up a window when you got a chat message or something like that; but when you’re starting to do massively data parallel operations or you want to do something very specific for a very large amount of data, then you start running into inefficiencies, there’s lots of what's in your CPU is used to make the programming task easier and abstract away having to think about how to use the CPU.
That’s not entirely true. The CPU is designed to make serial programming easy and what’s happened over the last five to ten years is that serial programming isn’t the solution any more. Parallel programming, there’s a lot of research out there that’s sort of the Holy Grail of computer science where people try to make automatically parallelizing compilers and all sorts of stuff; but fundamentally, having to think about multiple things at once is just something that requires a bit of thought.
Data flow is really saying, "Hey, look, we know what our heritage is; we know that control flow and the sort of serial execution did us well for a long time but let’s start re-thinking the way we use the silicon and actually getting more out of it."
We use basically a Java environment which allows our programmers and our customers to express how the data should flow through the silicon. There’s nothing complex about it; there’s no clever parallel inferences or anything like that; it’s very simple and it makes it very quick to take what your data flow used to be and compile it down on something which runs on faster hardware.
That’s correct or data flow engine as we like to call them.
They get turned into wires and transistors and all the other good things inside.
There’s a little bit of magic behind the scenes but there’s no writing of hardware description language; it’s all abstracted away but it’s abstracted in a good way, what is being produced is clearly visible; for the implementation you don't really need to know how they’re all connected, you just need to know the maths and science of the problem.
It’s all done automatically, so we have what we describe as Maxeler OS which is an operating system for a data flow engine. What does that mean ? So you put an acceleration data flow engine inside your PC; it’s going to show up as line in one of the /proc interfaces; you need something which actually takes the operating system that you have and extends it out to the thing that’s going to be doing in the computing; so this is memory management data streaming, streaming between the different data flow engines inside the chip, looking at multiple of them together, looking at utilization and, of course, the reconfiguration. If you reconfigure the silicon, you need that to happen but I mean that’s all built into it basically you just put what our compiler gives you into the same linker that your software already uses. You run the binary and it works; there’s no cut hands or cables or anything like that obviously there.
The flash is very quick, order 100 milliseconds or so.
For the whole chip so you can take your program, run it and in 100 milliseconds it takes it, flashes it and the data flow is on there.
Yes, so there’s a lot of work behind the scenes because basically what you’re doing is you’re building a custom sort of chip for every different application; now obviously to do that normally it takes a team of Intel engineers, probably a few thousand people sitting in a room for a few years and you’ll get the architecture; we don’t do that. This is something that someone can do in an afternoon, just wire things up and you got the architecture.
At the end of the day, what we’re selling is the speed and the power efficiency and also what you can do with it because at the end of the day, you’ll find that you’re interested in doing the same thing that you’re doing now just on a different architecture. They’re interested in doing 100 times more simulations to work out what’s going to happen when Europe finally sorts itself out. They’re interested in working out whether to spend a $100 million on a bore hole in an oil well, that’s the sort of thing they’re interested in doing and if they can do ten times more computing, 100 times more computing within same budget, it’s a clear one.
GPUs are interesting because there's a control flow element in there as well; so there’s pros and cons in doing that. We find that being able to because there's a fixed architecture there we still don’t get the full performance out of the underlying technology. What you really want is you want to take the business problem if you like and all they care about is answers per second or answers per dollar and then you got the silicon down here and it’s about doing an optimization and the more flexibility you have, the closer you work with the modelers and quants and the programmers all the way down to the wire, the more you’ll get out of it. It’s breaking the traditional horizontal stratification of IT and say "Look we want to do this thing as fast as possible".
We do review sort of every six, twelve months or so, of all the technology to make sure; at the end of the day, we’re not wedded to any particular thing.
From our point of view, data flow has dramatic benefits; how you implement it; it’s really a matter of who’s done the best R&D in the last two to three years.
There’s a bunch of different levels of parallelism; so at the end of the day what you want to do is you want to take, I guess, the transistors and you want to do as much computing as possible while using the least power as possible; so it’s sort of a dichotomy there; so one level is pipeline parallelism, that’s the traditional sort of parallelism that you use for raising clock frequencies in CPUs; another is you have data parallelism; so we just lay out multiple pipelines.
The interesting thing is where you have this pipeline where the data is moving not the instructions; it means you talk about sort of data Formula 1 raising analogy; you have a single pit lane, a single pit crew and you have one car, okay, that’s fine. But if you have ten cars and they all decide to pit at once, the guy in the last car is going to have an order of magnitude more pit time than he should; whereas if he did something like the Ford factory where you have ten small mini pits and the cars are driven in, they get the wheels taken off and shipped to the next one; they get the fuel put in and then shipped to the next one; they get the windscreen cleaned and so forth; you get something with cars that continue moving all the way through.
Now if you take that to the n-th degree and you get really really good pit people, the cars could keep driving through at 230 miles an hour through the pit; now, that’s kind of not something that you do in Formula 1 but it’s something that you can do in silicon.
It’s a basic approach and what you find is it’s both throughput and latency; so what you find is you get different levels of parallelism; you’ve got very, very coarse grained parallelism which is like the Cloud. You take something, you block it out and you put it on to multiple nodes in the cloud and they all do something, they return it; then you got sort of fine grained parallelism where the CPU in doing data parallelism like SIMD and the rest of it and then you have ultra fine grained parallelism which is really where data flow comes in; where you’re taking all of these sort of operations arranging them in such a way that there’s no scheduling to be done up front and you can stream things through.
The problem with coarse grained parallelism is that it doesn’t give you a latency advantage, it just gives you throughput; fine grained parallelism is you pay a lot because the CPU is second guessing what is going on; so what you need is some sort of plan; we got the control flow to organize what’s happening but you got the data to do the compute and to keep the latency low.
You put them in the rack or we can bring a rack for it depending if there’s a rack already; essentially they look like any standardized equipment that has a power, network, etc really just integrates it of the rest of the system; so if you want to, say, if you want to call out... the compiler when it generates the data flow engine and reconfigures to go on to the hardware, it also generates a software API which is essentially like; it’s basically like instantiating objects; you say, "I want a new one of these," here’s the thing that I want to process and you call the one function, you give it the data and it gives you back the results.
Now, the thing is you might want to do this on more than one core at once and the Maxeler OS is designed to interleave all of those together so you can use your 24 or your 32 hypothetic cores in your latest control flow processor to be queuing all of these data sending them in and waiting for response to come back and you can overlap all those so the CPUs aren’t stalled either; so what you’re doing is using all of the silicon in the CPU and all of the silicon in the data flow and when you overlap that’s when you get this speed up that we talk about.
We use a variety of interconnects ; we have our own custom one MaxRing; we also use sort of more standard protocols like PCI Express, Infiniband, 10 Gb Ethernet, it’s really anything with wires or fibers we can hook it up, and it really depends on the application, if you want to sort of make the data available more like a Cloud system where you have data flow engines available to multiple machines, you might want that shared over some very, very high speed network like Infiniband or, if you want something which is in an exchange which is buying and selling stocks you might want something different for that depending on what suits you.
It’s quite interesting; in general, it’s the same rule you’ll apply if you call them a function in any software. At the end of the day, you want that data to be as regular as possible; you don’t want to have lots of different loads of data and have an if statement working out which data it is; it’s just doesn’t make any sense. If you lay out the data nicely, you can get very, very good results; you can be up to lots of data parallelism without worrying about switching things around or converting types or something at the last minute.
It is interesting, I mean, I’ve noticed that there’s a lot of talk here at QCon about abstraction and abstraction versus simplicity; and abstraction versus parallelism; and these sorts of topics are interesting and using abstraction can be very seductive, and making very complex metaphors. You might even do this for job security if you like; if you want to take the business logic which you know very well and it’s very complicated and the algorithm which is very complicated.
If you wanted no one else to ever touch this code, you can tangle these things together and object orientation is a lovely way of doing it; you have...there’s this object here and this object here and you hide all the data inside - it’s wonderful.
It’s really sort of coming to a time now where people are hurting from that; IT budgets are hurting from people doing that and they should reconsider.
Of course, yes. Anytime you hide data, you restrict the ability to do parallelism because you’re hiding behind a reference; so you’re going through a cache miss just before you get the data, how can you possibly do some vectorized operations? How do you know it’s going to be data aligned? What you want is if you have a thousand things with ten elements, you want them in a 2d array of a thousand by ten and you might want to wrap that up in something just to describe what it is and have some method to hide the implementation so you don’t have to worry about what to do with the data but at that point, you got the implementation in one place, you got all the data together and then whatever computer platform you’re using can be very, very effective with that data and exploit as much parallelism as there is available.
Sure. I talked about a little bit about oil and gas and finance and I could get into that a little bit more. Generally these organization have data centers or multiple data centers with thousands of cores; and generally speaking, their applications are not distributed; the distribution of applications is not even; you don’t have 10% in this application and 10% in this application etc; what you have is one application uses 30% of the data center; another application uses 40% of the data center; and the rest is crammed in there; and the reason for that is it just fundamentally some things are harder than others; in reverse time migration the oil and gas space calculating anything long dated or exotic or highly structured integrated space, multiple elimination in oil and gas - all of these things take a lot of memory; do a lot of computing and then they run for hours and when the ship comes back from surveying your oil field with containers and containers of tapes, there’s no 3G signal here, tapes and they ship these containers into the data centers, they read them back off and then they do computing and they will spend months and months processing that data. Those use cases that benefit amazingly as time goes on and people tend to accept the heterogeneous architecture, the use cases are becoming more and more common and more and more simple and we're seeing a lot more of that at universities and all sorts of things; and of course, there's the low latency stuff as well which is a big topic.
They don’t like latency at all; they want to get to the exchange first and buy and sell.