John Nolan on the State of Hardware Acceleration with GPUs/FPGAs, Parallel Algorithm Design
00:30:37 video length
Bio John Nolan has 20+ years of experience in technology, is an ACM Distinguished Engineer and winner of a Royal Society prize for innovation. John has worked in diverse domains from finance, advertising, engineering and telecoms. John is currently an independent consultant working with a combination of large enterprises and start-ups.
The target audience for GOTO conferences are software developers, IT architects and project managers. GOTO Aarhus is an annual event in Denmark. The idea for GOTO (formerly known as JAOO) came about as the management at Trifork was dissatisfied with the conferences that existed and wanted to create a forum that would give the development staff inspiration, energy and desire to learn coupled with the chance to network with the IT community.
I am John Nolan I am currently an independent consultant, as we like to say which means that I work by myself, I have a number of clients but I am stigmergist, I am known as stigmergist.com. My specialism is getting things done, I'm a technologist with more than twenty years' experience in software and hardware, and my thing has always been changing the cutting edge into the bleeding edge, that's usually the new ideas which seem like a good idea but no one has actually made practical so my whole career is turning impractical ideas into practical money making ideas. So I have plenty of scars to show but that's what I do, that's what I am still doing.
Well, by training I am an engineer, my background, many years ago, actually I did control engineering at university, so to get away from computing and writing software so I went to do hardware which I didn't really know much about it, so my background is actually hardware. And then I have a lot of experience in finance, building large systems there, through my banking contacts, actually in a bar which is where I do a lot of my business, I ended up having a conversation with someone about calculation problems they were having in a particular set of trading analysis circumstances, and being me I was like "You are taking the wrong approach".
So they were trying to do the big software grid like solution, and my reaction was "Well you can use hardware for doing this" and I've known of a Field Programmable Gate Arrays, FPGAs, had become quite large and were getting much better than they used to be so they went through a thing in the eighties where they grew and became much larger and I have done some work in the eighties in control systems with FPGAs and they were quite good, but the tools just weren't there. They were really suitable so I said "You might be able to use FPGAs for doing this" and the of course cool idea, and bankers love cool ideas, was that maybe you could fit it all into hardware and of course with FPGA you end up getting a specific hardware design from a software specification essentially, you write a program and you get it out, that's the general idea and of course the belief by most software people is that therefor will be faster.
This is usually due to their ignorance of how software actually works and how the software you write ends up going all the way down to run on the hardware. So that was kind of where the conversation came from and then when we came to look at the problem I pointed out these were also very large array of problems, and that things like the GPUs were set up well for this, the way the hardware is arranged and there were a few very exotic types of hardware, array processors, a very similar kind of thing and there were also very large compute blocks that were available and of course it being banking one of the things was that one of the pieces of hardware was a hybrid super computer which was partially FPGA, partly CPU that was in the mapping technologies that is in Predator drones they were trying to vend so of course this appeals to traders greatly that they will have a predator drone on their desk. So this was one of the options. Of course, when they saw the price that didn't quite work.
So that's were it kind of where it came from, so this is about four years ago I sat down to do this in seriousness with the FPGAs, GPUs, array processors and exotic hardware, sort of benchmark them.
3. So you are saying that just compiling something to hardware doesn't make it fast which blows everybody's mind, so what makes, what's the problem, where's the impedance miss match between those two?
The thing that you have to remember about hardware is that if you want to get speed out of hardware that a lot of the way you do things is to pipeline and to break your problem into… it's a throughput problem. Most of the processors these days are faster because they do better throughput, they're not actually faster in processing, the clock speeds have not necessarily jumped up actually. They did in recent times due to change in the hardware technology. But you have to change the way you approach the problem, and if you are doing stream and signal processing, you change your problem into a more that kind of approach, you can get the speed, but nearly all the programming models that you get, C and upwards, they are not that basis, they are not about pipelining, they are not about that kind of concurrency parallelism.
So the problem is you end up writing code in a sequential form, which when you then transfer it to hardware is not using the hardware efficiently, essentially what you end up with, if you think of an FPGA as a block of computing, instead of using all of it at the same time, is if you write a sequential piece of code, you essentially use a little bit, and a little bit, and you sort of run around the chip, so you are not actually using it efficiently and effective, that's where a lot of impedance miss match comes. So there are two sides when you think about performance when doing software to hardware one is just the pure performance in terms of speed. So when you compile it down you can make it run fast, but the other side is productivity, which is how long does it take me to write and then get that out to run on the hardware? And in current technologies, the FPGA style of things when you are writing applications to run on FPGAs, code for things like option prices and quite large pieces of code, it can take many hours for it to compile, to deploy it.
And when you write a piece of code and in fact in one circumstance it took two days for the system to compile and figure out how, because when it's compiling code and try to figure out how to lay it out optimally on the chip, and do all this, that took two days to figure this out. Once you have it, it takes two hundred milliseconds for it to blast on and go as fast as you want. But you have a productivity problem, because we are very used to that cycle of write it, change it, do it kind of thing and this compute cycle of understanding the data, doing the problem, and then using it, you want that kind of investigating and a two days gap is ridiculous even if it's a few hours, even if it's a half of hour, it's just unproductive.
So you always have to watch for the sooner and faster problem, so you can be faster, in fact one particular problem we made a hundred and ninety two times faster than the equivalent C code on a processor doing this, but it took three days for it to compile so this kind of the sooner and faster, that's a balance. That was FPGAs but with GPUs the array processors, they are an interesting one in the middle because they have that ability to do the fast development cycle. Now you do have to change the algorithm the way you write code, to be appropriate for the hardware, and it's not sequential programming, you can't just take a Java programmer and have them write efficient code, they have to change the way they think about it, but it's definitely fast and productive.
The balance you have to watch out with GPUs is the power balance, and I mean literally power and heat because they use a lot of power and a lot of heat, it's like having a couple of kettles boiling under your desk if you put a lot of GPUs into a box and run them high power, you have to think about these things when you are running data centers and so forth, seventy percent of the cost of data centers is heating and cooling, but if you think of using GPUs you are going to double that seventy percent and you have to watch out.
There is a whole load of compromises and you have to think about this, so if you want to go just for performance, the software view of hardware is that it's automatically going to be faster and better - but not always. The other problem with software and hardware is that compilers and high-level languages are doing many things, that never get to the hardware, so a friend of mine who was on the original SmallTalk team, he used to say C might be faster but SmallTalk is sooner. So if you write your algorithms and in software there is a lot of this, it will optimize out and remove problems, so there is a lot of stack between you and the hardware these days, which is getting rid of a lot of things to make it efficient to run on the hardware. So standard software development misses this.
Interestingly there is a movement the other way around which says what happens if we start at the hardware and work our way up, so what happens if we write a language that starts right down at the bottom, and uses hardware efficiently, what does that look like? How does that change things? So there is a small group of people looking at it the other way around, many of them from the functional programming area.
They are not as low as assembly language, often there are bit's which are done in assembly specifically to make it run, but they are more like a C-like levels and then functionally higher than that, but they don't look like C but there is a mid level here, C is quite of a mid-level language, and so it's kind of keeping it as close as possible so it's very hardware sympathetic. So it's not trying to be too clever, it's trying to give you direct access but it's sort of affecting your model, it's thinking of the way to solve the problem in the terms of the way the hardware and the architecture is most efficient.
Over the last fifteen to twenty years this whole thing about abstractions you can kind of blame objects for it, this whole "You can have the model at the top", it's a penguin is a type of animal, all this thing that nobody ever does because it's wrong and abstraction, and adding that abstraction level in makes it inefficient but if you go back to the old style where you would write it dependent on the hardware, and compromise your abstraction, based on the implementation, you can get more efficient, but it's sort of going back to that kind of old school way of looking at it, but with modern hardware, especially as the hardware engineers are saying "That's nonsense, why are you writing software like that?", there is a big gap between software and hardware.
And if you talk to modern software developers, their view of how hardware works is based somewhere around the eighties, they don't really get how it all actually hangs together, there's a dying breed of developers who are actually half way between these two stools, that still know hardware and still know software.
Yes, there are a lot of techniques, everybody needs to know those things, I mean that is what some of these layers abstract away and try to help you, there is still a level of abstraction but understanding how to use pipelining and what multi-level caching is and when to move things between caches - isn't that one of the three computing problems, when to move things between caches. It's also breaking of the standard software models that abstraction is a good thing, that modularization is a good thing, it's kind of what happens when you do the reverse you get hardware. And when you sit and look at it, it's not well built for software either, I mean looking at FPGA and these low level hardware stuff, I am dealing with hardware engineers, they don't get software, they think software is about configuring hardware, it's this kind of thing where the software view of hardware is bad, but the hardware view of software is also very odd.
So being in the middle and looking at it, I mean the tools are very like ninety nineties, they feel very a decade behind, even the concept of incremental compilation on FPGA is only just coming in, that when you make a change you don't have to recompile the world and there's a lot of this kind of newer ideas that lead to meld together and I think it's an interesting time.
No, most of them are actually research products or are proprietary things that people are doing at the moment, obviously people who like doing these kind of things are governmental agencies who can't be named, and hedge funds and banks really don't want to talk about it. I can't actually answer the question, sorry about that. Dave Thomas was talking about some of the work he was doing on the server thing, so that's kind of like that. It's a lot of stuff going on in the functional area, I know of a few people who are doing their own thing and they are doing this, the reason they want to keep it proprietary is they are getting an advantage from it having an edge, being able to actually real time process some of this stuff that all the people are taking hours to process, it's a big advantage in some of these circumstances.
Array processors are sort of like GPUs you can consider GPUs a kind of array processors and it's where you can essentially process blocks of data simultaneously so think of it like a hundred and ninety two CPUs running together on a block of data. So if you have a piece of data a hundred and ninety two floats wide you can put them all through in one step essentially that is the easy way of thinking of array processors, that doesn't mean you have to set up your program to take advantage of that in an effective manner. If statements are a bad idea because if you have an if and it fails in this branch that processor is not doing anything until the one that is true is going through so to get good efficiency, the rules change. I distinguish array processors from GPUs because there was a number of products that came out around at that time, I don't know how many of them are still in existence, which were based on the idea of low power.
So instead of having several kettles running these things were extremely low power, so twenty to twenty five watts of power as opposed to sort of kilowatts, as with some of these GPUs, so the idea was to put a lot of these in place and use them. I think a lot of those who went down that road they've proved the technology and what's happening with CPUs is now the multi die CPUs are starting to bring on board array processors and FPGAs on to the die in the CPU so I think we'll see over the next couple of years where this shift will change, so that so to sell multi CPUs processors, the processors will turn into FPGAs and array processors, you already see it with some of them, they have the CPUs and the GPUs on the same chip, that's essentially where it's going so I think those companies who are doing those boards, created the IPR and essentially try to sell, I presume, to people like AMD and Intel.
So I think we'll see it coming along, but I think we'll see it hidden behind conventional programming stacks, which will be slightly unfortunate because of that whole the compiler guys are going to have to become even cleverer to understand what people are trying to express through the high level software. I think that will be a mistake in some ways, but I have hopes that some of this other work will take advantage of it in a more direct manner.
Yes, you can take an algorithm and say I can do this, and do the automatic vectorization for the queues and pipelines, I've used some of them and they are quite effective, they can be very effective if you sit down and use Intel, for example the Intel compilers have a lot of this and you can also, if you write in C level code, you can do the whole moving between caches and make sure things fit in and use pipelines effectively and stuff like that. They're pretty good. The auto vectorization, I think you have to write your code in a style so it can take advantage of it and if you are writing in that style anyway it's kind of you're giving a hint. It's that abstraction problem again, I don't think there's a good model for getting above the calculation level to effectively use array processors and FPGAs and that sort of this kind of things.
Once you go above the basic calculation level and you want to be more abstract about expressing things, I think we are a distance from that at the moment which is why some of this functional stuff is quite interesting.
9. You mentioned that some of the algorithms are a problem, how you write algorithms is a problem. So, are there some general methods or ideas for how to write code in a more parallel way or algorithm design ideas?
You're hitting one of my rants right now, which is about writing algorithms. I think there is something gone wrong in software education and how software development is taught, to me, in my mind, which is people are taught about algorithms in the sense that they are taught how a sort works, they're taught here is this algorithm, but they are not taught how to come up with new algorithms, how to generate an algorithm, how to understand the constraints and work backwards and figure out what the efficient way of doing something is. And I see this repeatedly, I like optimization problems, I like making things go faster, and I particularly like making things go faster by coming up with a good algorithm and I see a lot where there is a very brute force kind of approach, going back to your original question.
This whole thing about software and software developers wanting hardware because it's going to be faster, in my mind it's driven a lot by writing inefficient algorithms because normally if it runs faster, and this is very Monte Carlo simulation, sort of try all the options, do this and do that. They're ignorant of many opportunities that have been around for eighty or a hundred years of algorithmic optimization techniques, there is just a gap. One particular example I was called in to a client and they wanted something that was taking forty minutes to run, and said "Have a look at this", and I went through the code, just did keyhole optimization on the code, got it about fifteen times faster, which they were happy with, but I wasn't happy with it because the way they were solving it was one of these big things, and being me I said "There is a better way of solving this problem".
So over a weekend I sat down and redid it basically using a Simplex kind of approach. When I brought it back in I said it runs much faster, they said how much times faster, and I said it's almost infinitely faster, and they said why, and I said I ran the problem and it took half a second, that's why. And they were just amazed, but then when I started explaining how I solved the problem, which was by using a different algorithm, the concepts of the algorithm and the concepts of looking at the problem that way and understanding the opportunity just wasn't there. And I think there is this thing about understanding what the options are, what the constraints of the problem are, what are you trying to solve.
Some of the thing is not doing all the calculation, techniques, this kind of the abstract back figure out the shortest path, get the software to figure out what the shortest calculation is and do that. Almost move from the calculation to an abstraction of it or a representation of it to find the shortest way of calculating it and do that. Not using if statements is one of my favorite bug bears, if statements are very inefficient, especially in parallel code or in sequential or array based code, very inefficient. And there's ways of writing it, it always cracks me, people are happy writing if, but if you've got GOTO everyone gets upset, but if and GOTO are actually the same thing when you look at it and the way people use it. If with a return statement is a GOTO.
There is a lot of snobbery in how you write code, which actually, I think fundamentally undermined a lot of algorithmic kind of approaches. And it's just that people don't think about the algorithm, step back from it and look at the whole problem, and say "What are the actual constraints, what are the possibilities here". If you go and look up algorithms on the internet and sort to say "Teach me", people will show you good algorithms, it's often sort, for some reason. A lot of algorithms are stated in mathematical manner, which is also one of the things, I find is that the way you solve mathematics is a mathematical problem, but computation is different. If you can express it so it's complete in the symbolic mathematical sense but to actually calculated it is actually algorithmic and computation and that's different.
So quite often you find people say oh, we need to do double representation to be accurate but that's a fallacy, that's not necessarily accuracy, that's representation, you can do it integers and you can do it faster, there's many tricks to it but, I wish there was a definite guide, kind of "Here's where to start", but I honestly wouldn't even know where to start teaching someone how to do it. It's an experience thing to me, but that's because I've dealt with a lot of algorithms and that's sort of the way I think. But I did tweet about this and I got a number of people they agreed with me because when they were in college they were taught these algorithms as in here is a good algorithm, here is a good algorithm but never taught you how to create or understand or analyze a problem in order to come up with an answer from a tool kit, kind of approach.
There is that compromise, you've hit the nail on the head there, which is the understandability of a piece of code, you write code for it to be rewritten, to be read, and that's sort of the thing about high level languages, if you write to be machine efficient it's not necessarily human readable, so there is always a balance and a compromise. If you take the program with an if statement in as an expression of what someone is trying to achieve, there are ways of transforming that into a more efficient representation to execute, so an if is a branch circumstance. So let's say you have something that is a collection of numbers and you say if it's greater than five double it, if it's less than five just add it on.
Well, if we look at that we can obviously say if we sorted them, and do the first ones that are below five first, we could do those in parallel and then the ones above it, do those in parallel. You can see how you could look at it and actually sort of take that statement and say I understand the meaning of it. And this also comes back to a thing I was saying earlier, if a program is a statement of intent turning it into an expression which is a symbolic representation and then getting the machine to refactor it into an efficient manner is kind of what you are aiming for.
That's what a lot of VMs do, if you look at what virtual machines do and hotspot compilations, that's essentially kind of the idea, which is also why actually a lot of software runs faster than it does on the hardware, because you have virtual machines in between that are being smart about this stuff. It's all sort of part of the same ball, I have a different rant "Why if is a bad statement" to do with object and message sending and bindings and stuff, so that's just one of my bug bears and people find it odd reading my code sometimes because I don't use if statements very often, they find it bothersome when they read code and it doesn't have if statements in.
Yes, instead of using if statements I tend to use a lot of polymorphism, state machines, things like that.
The rich man's if, it's no if at all, there is no condition, it just is in the state, so you move to the state, you know what's next. Polymorphic approaches with state machine obviously give you a lot more flexibility, if you need to extend the state machine or restrict it or change it on the fly, which is the other thing which I have done, a SmallTalk system which actually could change the state machine, we could upgrade the code as it was running because it was all state machines, didn't have to stop the system and start it again because it could just transition live. There are many advantages, makes it a lot easier to test as well, some of testing extreme programmer, polymorphic kind of approach makes it a lot easier to test, if statements are hard to test.
Of a class of problems. I think they are quite good for abstraction, as an abstraction mechanism, revealing some of the work. As an abstraction mechanism, going back to how you can be optimal about this, if you have state machines that transition so that when a calculation is in a certain place instead of having a method which says if you're in a state do this, when you call that method it is actually the method because that's the state it is in, means you've got less code, it's easier to test and if you are using FPGAs or array processors you can download that specific behavior into the hardware, it takes twenty milliseconds to flash it in.
So you don't have to have the whole program on the processor at a given point in time, as it transitions between states you can push it down, then you can say when it's these circumstances, that's the state, and it just makes it a lot easier and simpler, it's smaller, it's less code, you can optimize it better. For a certain way of looking at it, again it's an algorithmic kind of view of the world, how you look at it, it can be more efficient as well, so it's easier to test, you can make it more efficient and you can get dynamic extendibility kind of comes in a lot more and you've got one less piece of syntax to worry about, don't need an if statement.
14. I think Guy Steele had a talk [Editor's note: http://www.infoq.com/presentations/Thinking-Parallel-Programming ] about finding a different way of designing algorithms to step away from fold and more to a map instead of sequentially describing algorithm do it in parallel or maybe even speculative in a way, have you done something like that?
I didn't see Guy's presentation, but yes, I know where he is coming from if I understand what you're saying. You do need to completely step away from the conventional thinking of it. There is a whole set of things about not knowing what things are, when there's missing values in data what do you do, if it's dirty, if it's incomplete. I also worked on some stuff about, I wanted to work on some research, I used to be at Sun Research, on small wireless devices, so of course in that circumstance the processor and the data isn't always there, the processor isn't always there what do you do when it's missing and stuff like that. And some of the guys working on looking at computational units, CPU's essentially, where parts of them could fail make them extremely fast, extremely cheap, extremely dense because currently they have to be perfect and some huge percentage of processors are thrown away because they are flawed.
But what would happen if you could actually stack these into compute blocks and you could allow parts of them to not actually run correctly? There is a whole sort of different way of looking at computation, going back to this idea of what's computation and algorithm. If you start asking the question what would happen if we wrote code that would allow for failure and didn't always work and some of the values weren't always there, how would you start writing algorithms that work that way and then you get into self-organizing systems and how all that stuff goes together. Hence the name stigmergist, by the way, stigmergy is the study of self-organizing systems that interact with an environment, and everything interacts with the environment and then comes back.
I'm very much into self-organizing things and I think there is a whole new area that we need to look at, and there's a lot of people researching, but it's true research. Nobody's found the gold and magic solution in this particular area, but yes, I do think we need to step back from algorithms and the sequential view of the world, hence why the functional stuff is quite popular at the moment. I think data flow stuff us coming back as well.
No, I've heard of it, just a few minutes ago somebody mentioned it to me but no I've never used that kind of thing. But I think if you look at some of the things like Erlang, one of the things about that is it's got that attractiveness to it for those reasons. So I think there is a set of stuff there which is going to be quite interesting but you've got huge impedance, that is another thing when doing like the FPGAs and the hardware, the impedance that software developers or scripters, people that live in the data. The programming model is just so different there is a whole learning the difference, sort of the unlearn and relearn kind of thing, it will mess with your mind, you have to throw a lot of things away and listen to yourself, failing then go back to it. So I think there is a still a gap it's not going to happen next year or not in the next five years maybe after five years something will come along.