Duncan Coutts on Parallelism and Concurrency with Haskell, Distributed Programming with Cloud Haskell
Bio Duncan is well known in the Haskell community and runs Haskell consultancy company Well-Typed ( http://www.well-typed.com/ ). He helps maintain several popular libraries and tools such as Cabal and bytestring. He has several years experience in packaging the Haskell toolchain and took a leading role in establishing the Haskell Platform.
Tech Mesh, the alternative programming conference,focuses on promoting useful non-mainstream technologies to the software industry. The underlying theme is "the right tool for the job", as opposed to automatically choosing the tool at hand. By bringing together users and inventors of different languages and technologies new and old), speakers will get the opportunity to inspire, to share experience, and to increase insight.
Well I’m a Haskell enthusiast, I guess and also a Haskell consultant, I run a small company and we are called Well-Typed and we help companies that use Haskell. So we do training and consulting and development and also other cool stuff related to Haskell.
Well, they might be using it heavily in which case they might want help with the tool chain, support, improvements to the compiler , integration between Haskell stuff and whatever they are doing or they might be just thinking about getting into Haskell. So they might heard of this great idea and want advice or help or they might have quite a few of our clients are small start ups and they might have one Haskell programmer or two and their manager just wants someone to build a core as backup, so we do that sort of thing and we also do quite a lot of work with a sort of R & D technology transfer kind of things, so working with Academics, University or Microsoft Research and getting technologies from Academia into the real world, so a variety of things related to Haskell.
3. Your talk yesterday was about a rather interesting topic nowadays, which is parallelism and concurrency, so can you tell us what is the difference between parallelism and concurrency, aren’t they the same?
Yes, so people often use them synonymously but it helps to clarify your thinking by describing what the difference is. Parallelism is all about trying to make your programs run faster by using more hardware like multiple CPU cores or GPU cores. Everyone wants their programs to be faster and parallelism is one way to do that using more hardware. Concurrency is about how your program is structured into multiple threads of control like threads or processes and those threads or processes can synchronize by locks or mutable shared variables or message passing, so concurrency is really about how you structure your program whereas parallelism is about making it run faster, so those are orthogonal things, you can clearly do both at the same time, if you have multiple operating system processes running on multiple cores, then you’re doing concurrency and you’re getting parallelism, You can also do concurrency without parallelism of course, that’s what we did before we had multiple cores, we still had multiple processes running on the same core, just being multiplexed, so that’s concurrency without parallelism. And then of course most programs are not concurrent and not parallel , but is also possible to do parallelism without doing concurrency and that’s the really interesting one and that’s what the focus of my talk is about.
Both. You get to write programs which do not express any concurrency themselves, which means you can’t make any of the concurrency mistakes, you can’t introduce deadlocks, there are no locks, you're not writing anything which has locks, you are not describing any threads. There’s no non-determinism. So all these nice things, you just write a declarative program and somehow the underlying machinery evaluates it in parallel. You have to give some hints or descriptions about how it should be evaluated in parallel or what should be evaluated in parallel but you’re not explicitly writing anything that is concurrent.
The people who do the implementation they have to do a lot of hard work to make that happen but it means that you get to write a nice program. Let me give you an example, to make it a bit more concrete: SQL and the database engine, a database execution engine. When you write a SQL query, that’s very clearly you are not writing a concurrent program. It’s declarative, pure and you expect the answer, there’s no non-determinism in a SQL query, it should always give you the same answer given the current state of database, it’s a kind of pure declarative thing but when it’s executed by the database, the database has the opportunity to execute that query in parallel, not all the standard open source ones do that but maybe Oracle does, it’s possible.
So that gives the distinction between what the implementation is doing, the evaluation mechanism, may be using threads and locks, it may be using concurrency, but it’s achieving parallelism, but the program that you wrote was not concurrent, which means you couldn’t make any of those concurrency bugs, you couldn’t introduce any of those mistakes. So that’s an example, SQL is not a full program language obviously, but that kind of idea of the program you write, having no concurrency in it, but the way it’s executed, runs on multiple cores. That idea can extend into programming languages. And so, Haskell has several different ways of doing this, libraries for doing parallelism without concurrency.
Yes, one example is describing algorithms that work on arrays, like bulk operations on arrays and this is something, for example, that you do in Mathematica or these sorts of systems, you describe your algorithm in terms of operations on vectors, arrays, matrices and then the library and execution engine takes care of splitting that work up and doing it on multiple cores, so you write a program which just looks ordinary, sequential, purely functional and yet it gets translated into a program that runs on multiple cores.
Parallel map is a nice simple example of that sort of thing, that’s part of this idea of “parallel evaluation strategies”. So you’ve got pure expressions; because we're working in Haskell, these are expressions that don’t have side effects, so it’s always possible to evaluate them sort of more or less in any order, it doesn’t matter, which means you can evaluate it in parallel and you’ll still get the same answer. If you’re working in a hybrid language that’s mixed, functional, imperative, then you’ve got side effects to worry about and you don’t always have that guarantee that if you evaluate it in this order, the side effects would happen one way around or the other way around in a pure functional language or in a hybrid language where you’re very careful then you could evaluate multiple parts of the expression in parallel and get the same answer and not have to worry about scheduling giving you different answers depending on… you know, run it one time it gives you one answer, it will always guarantee you’ll get the same answer and that’s why parallel map is a good example of that, you’ve got a big long list of items and you have a function you want to apply to each one, you can do that in parallel and there’s libraries in Haskell which give you strategies for building things like parallel map or more complicated strategies.
You might have a strategy over a tree or you might have a strategy where there is parallelism within work items, so you might have maybe a short list of complicated structures, so there might be parallelism available within each of those structures, so you might have… so we have a function in the strategies library called parList and it takes as an argument another strategy, which is the strategy to use for each of those elements, so that might be whether you can combine strategies or another good one is… ok, granularity is always a problem with parallelism, no matter what language you’re dealing in. You always have to make sure that the amount of work you’re doing in parallel, each sort of chunk or task of parallel work, is big enough to overcome overheads.
There's always going to be overheads in doing stuff in parallel and you’ve got to make sure that the amount of work that you are doing overcomes overheads. So sometimes what you need to do is split things up to get more work items or combine things together to get fewer work items to get a reasonable number. So you can use strategies like I just mentioned parList, but there is a strategy called parListChunk where it divides… if you’ve got a very long list with lots of tiny work items, you might take a hundred of them together in a batch and say let’s do that in parallel and then take the next hundred and do that in parallel… and that’s reducing the overheads, you’re batching things up and that’s another strategy. So you can take these strategy combinators and combine them together to get something and tune the parameters to work well between two and sixteen cores or something. So that gives different strategies, it depends on what sort of data structures, what sort of algorithm you’ve got.
Yes, he wants you to write in terms of collections and then associative reductions rather than linear reductions because you have to think very carefully about the data dependencies within your language, within your program. That’s always easier when you’re in a functional setting. You don’t have to worry about side effects having hidden dependencies due to the side effects. You don’t have to worry about the explicit dependencies. So you have to stop thinking about lists which are fundamentally sequential and start thinking about arrays or sequences or trees for example and then think about not summing up your list or your tree sequentially with a foldl but doing it in a tree like associative reduction, so that’s what Map/Reduce is all about, you do a map across all the elements and then you do a tree like reduction and you could do the whole thing in parallel. So you still have to rewrite this program to achieve good parallelism, there’s no free lunch, doing parallelism without concurrency, gets rid of the concurrency bugs but achieving good parallel speedups is hard, you still have to think about how your program is structured.
Yes, you think about the issues of granularity and data dependencies and profiling your program, you’ve got profiling tools to help diagnose some of these problems. So you expect it to get a speed up because you rearranged the code and you don’t and you scratch your head and look at the profiler and then you change some things and you… often it's down to this issue of granularity, that’s the standard problem.
Concurrency in Haskell is very traditional, it’s got “forkIO” which is a function that forks off a new thread, so you’ve got threads and this all happens in the IO monad, so if you know anything about Haskell, you know that all the imperative programming that we do in Haskell goes on within the IO monad, so concurrency in Haskell is part of that IO world, so you fork off multiple threads of control and they can communicate via shared mutable variables.
Now of course in Haskell we don’t really have this notion of mutable variables, by default everything is pure and immutable but for the concurrency we introduce these things called “MVars”, or there’s also transactional variables and you explicitly identify the mutable variables that would be shared between threads and those come with a lock, it's a combination of a shared mutable variable and a lock so you can never accidentally read or write the mutable variable without taking a lock. But it’s basically very traditional, still threads and locks and you can still get deadlocks and all those traditional things are still there. On top of that, you can build things like actor libraries and async, futures, all that sort of stuff is built on top of threads and these MVars mutable variables.
Yes, this MVar is a single mutable variable, there are mutable arrays as well which you can use and obviously you can build structures of these things.
Yes, when you read from an MVar, it’s called “taking an MVar”, so MVars have a sort of empty or full notion. When they’re empty, if I’m trying to read from an empty MVar, I block until someone else fills it, so in a way it acts like a one place queue and that gives you the synchronization between threads, as well as the shared variable aspect. That gives you a slightly cleaner way of doing the same sort of things you do ordinarily with condition variables and mutexes and locks and shared variables. But it’s still fundamentally ordinary, concurrency with locks and you can still get deadlocks by taking locks out of order.
Additionally to that we’ve got software transactional memory, which was sort of originally prototyped in Haskell and it works really nicely in Haskell where it doesn’t work quite so nicely in other languages and that gives you composable concurrency, so instead of MVars, which are locks, you’ve got transactional variables and you perform atomic transactions, transactions consist of multiple reads and writes on a whole set of transactional variables and then you could perform that atomically and the runtime sometimes takes care of doing it and if there was a conflict, rolling it back and trying it again. That gives you composable concurrency whereas normal lock based programming, its just fundamentally noncomposable.
A transaction looks just like any other ordinary IO code, but it’s in a different monad, it’s in the STM monad, so for the duration of your transaction, you’re in a STM monad and the atomically function wraps that up and performs it atomically as an IO action, so inside, syntactically it looks just the same as ordinary IO code in Haskell, you’re using 'do' and you are reading from variables and writing to variables and that sort of thing and then you say atomically do that whole thing and that’s an atomic transaction. So you can write one bit of code in STM and another bit of code in STM and combine them together into a single atomic transaction and that’s the sort of composability that you can get with STM and you can’t get with MVars and with locks.
So we’ve had MVars around for much longer, so people, I think initially were… it’s taken people a little while, I think, to get into the idea of STM and also people initially worried that they have a big performance overhead but it turns out that’s not really the case. Once you take into account doing everything safely with MVars and that means taking account exceptions, async exceptions, you have to have… you have to be quite careful with your try/catch/finally, all that sort of stuff when you’re using MVars. So once you’re doing that properly, the costs end up being rather similar, whereas with STM, you don’t have to really worry about exceptions, if you get an exception during a transaction, it's just rolled back and you don’t have to set up extra exception handlers on a stack in the implementation. So it means that in the end they come up pretty similar, which is quite nice, I hadn’t really realized that until recently, Simon Marlow has been going around telling people that STM is not as slow as you think it is, it’s actually a pretty reasonable default and the properties are just much nicer.
Werner: Ultimately it doesn’t have to be slower than anything else, because you have to do the same thing anyway, you have to do the same tasks anyway.
Yes, there is a slight overhead to this validating of the transaction log, that’s where the overhead comes from, but when you balance that out against all the other things you want to do with MVars then the costs end up being quite similar in the end and it’s so much nicer from a programming point of view using STM than MVars, so people are gradually moving towards that as the sort of “go to de-facto tool for that sort of thing”.
Werner: And it's not transparent STM, as you say, you have to mark out these things.
Again you're explicitly identifying your transactional variables, because by default all your variables in Haskell are immutable so you are specifying exactly which are your transactional variables and that’s one reason it's cheap in Haskell because you only have to record in your transaction log the modifications to the transactional variables and there may be very few of those whereas in an ordinary imperative language you might have to record… you doing mutation everywhere for everything, that’s how things work, so you’re having to record in your transaction log lots of accesses reads and writes to all the mutable variables that you are using so in the end that’s one of the reasons that’s actually cheaper in Haskell than it is in other languages because we can do it with just the transactional variables that are really essential for doing the synchronization or whatever.
Werner: It seem to be similar to the way Clojure’s STM works, it also uses explicit variables for that.
Yes, again by default they’re pure so you’ve got to explicitly indicate when you want to have a mutable transactional variable.
Werner: It turns out that there are in large Clojure programs, there are only a tiny number of these mutable places that are necessary, so Rich Hickey mentioned in Datomic, which is a database, they need two or three variables.
Yes, that’s very often the case, if you’re going to look at some of the abstractions built on top of STM or MVars like semaphores and queues and all these other sort of things, you’ll find they use a very small number of transactional variables and very often, yes. I was going to say another cool thing about concurrency in Haskell, is about IO, how we do file or network IO in Haskell with concurrency. Because I think this is one of the really good stories about IO and concurrency in Haskell as compared to many other mainstream languages, is that we have proper lightweight threads and traditional blocking IO.
So you know there is the debate about events and threads, performance and how nice it is to program and that all sort of stuff. Basically in Haskell I think we’re doing it right, we have very lightweight threads, you can have tens of thousands of threads, no problem, so following the model of one thread per client is perfectly ok, even if you’ve got tens of thousands of clients and then IO in a sense of file network, input output, is the way you’d want it to happen, you block just that thread, just that thread blocks, all the other threads carry on and then you get ordinary sequential actions in your thread, whereas if you look at a Node.js program, you write something out to the network and then on the call back you chain off to the next thing and then when that one completes, you chain to the next thing and what are you doing? You are just doing ordinary sequential IO in the sense of “Do this, do that, do the next thing…” but each of those things could block, but if you’re using lightweight threads and blocking IO, you could just write it “Do this, do that…”and each of them will block at the appropriate points, so that gives you simple model, this is just simple imperative program.
Werner: The way it’s supposed to work
The way it’s supposed to be, exactly, and even in languages like .NET languages, where they do have threads, so it’s not nearly as bad as in Node.js where you have got to chain of these callbacks, but you still have this async business, because their threads are really expensive, they don’t have proper library threads, they have to, again, they’re chaining these async things together and yes, ok, you can use monads to make that slightly better and blah, blah, blah, but it’s not as simple as just classical, simple, imperative programming , reading from this file writing to a network just in a single block and if you have lightweight threads and blocking IO, you get a nice programming model and you get the performances advantages of events, right? So, in the runtime system it really is using epoll and all that sorts. It’s running , the GHC runtime system is using one OS thread per core and multiplexing your tens of thousands of library Haskell threads on top of that and then it’s using epoll on the next and other similar things on other operating systems to do the event style handling. So you get the performance advantages of that event stuff but the nice programming model of classical, imperative programming, I would admit it's a slightly funny thing to say. In Haskell we have classical, imperative programming, but that is a very sensible model.
Werner: It’s basically an m:n sort of scheduler
Exactly, it’s m:n threading, yes, whereas all the other systems go for this 1:1, which there’s advantages in that too but it doesn’t mean that you can’t do… it means you have to play all these tricks like using async for everything. So we have async in Haskell as well, you can build async and futures on top of… and do that when that’s the right model for what your program is doing. If you really are doing multiple things at once simultaneously, than async and futures is great, but if you look at most examples of async and futures in sort of .NET languages, what they’re really doing is sequential things they're saying “fork this off” do this asynchronously and when this one completes do the next one, and when that one completes, do the next one… That’s not async really at all, that’s just sequential programming, so, we can do the sequential when it’s sequential and asynchronous one when you want it to be asynchronous.
Werner: Basically let the runtime do all the nasty things for you, as it’s supposed to be, let the solved problem be done by the runtime or the compiler.
And it’s very efficient, so the webserver frameworks in Haskell are able to achieve extremely good performance dealing with tens of thousands of clients and scales across multiple cores, and then benchmark it throws Node.js out of the water, it's competitive against things like nginx and others.
Werner: You just made a lot of friends in the Node.js community. You might want to wait for those, but it’s clearly an advantage here, over systems like Java or .NET which have the heavy weight threads and also have to these tricks in async to become competitive again.
Yes, exactly because they want to avoid the model of having ten thousand Java threads and because they're so a heavy weight and that would be a bad idea.
Well, the cloud part of that is not my fault. What Cloud Haskell really is, it is distributed concurrency in Haskell. So distributed across multiple physical machines, so distributed memory, that’s what it really means, distributed memory and of course these days, Cloud is the buzzword to use, so that’s why it got called, that’s just the way it ended up. So what it’s doing, it is taking the Erlang model of distributed concurrency, that is lightweight processes that communicate by message passing and monitoring and linking for error handling all that sort of stuff. It’s basically a complete copy of the Erlang model because the ErlangErlang model is, by demonstration extremely successful at this.
So it's just taking that model and implementing it as a library in Haskell so that we get the benefits of that model for distributed concurrency and we can use it in Haskell along with the other advantages that we have in Haskell so we can use it with… So we only have to use that for the distributed aspect, you don’t have to structure your entire program as actors, you might structure it using… pure parallelism within a single machine and only using the distributed actors for cross machine communication. So it means that you have a substrate for writing reliable distributed applications. So you have these light weight processes, they communicate with message passing, they can monitor each other, so that if one goes down the other one gets notified of it and that’s the right underlying infrastructure the Erlang guys have demonstrated that, that’s the right underlying infrastructure for building reliable applications. So on top of that you can libraries like… what Erlang's OTP give you on top of that basic substrate that Erlang has, is like supervision trees and also sort of higher level patterns for doing reliable distributed concurrency.
This was an idea developed by Simon P.J. at Microsoft research and we were implementing for him, with him, a new implementation of this, which is reliable and robust and cleaner code and better architected. It was a prototype, and we were doing a new version of it, and we released that about a month ago or so. Now there are people working on OTP style libraries for the supervision trees and that sort of thing. We think it’s pretty good, we’re quite happy with it. We think it’s a very promising idea.
People just started to use it because we started telling people to use it about a month ago [Editor's note: the interview was recorded in December 2012], so there are some people using it, experimenting with it and people looking at writing these higher level libraries on top of it… so far so good. We’ve also got, one of the nice things is that we have multiple back ends for the system. Because there are lots of different environments in which you want to do distributed concurrency, you’ve got the Cloud like Azure and Amazon so they’re cloud environments and then you’ve got your local cluster and these are rather different environments: one uses like a cluster job scheduler and the other uses the Azure web interface for managing instances, and the way things are configured, the way they start up are different in these different environments.
So we have these different backends to allow you to… which is sort of customized to these different environments to make it easier to write code for those environments. So we have an Azure backend and we have a very simple backend that just broadcasts on the local network to find peers. So that’s one of the key things that are different between different environments is how do you discover your peers, and you might discover them or might be told about them or you might create them if you’re in a VM environment you create your peers by starting up new instances. That aspect gets dealt with by the backend and then the main part of your program is independent of the backend. At that level you just got nodes and you spawn process on nodes and processes talk to each other, and you have a reasonable separation between the generic distributed parts of the code and a sort of initialization set up environment parts.
And so the backends are quite small, that’s quite nice, it’s quite easy to write a new backend. It’s harder to write a new transport layer. We also have a design which allows multiple transport layers, we already implemented a TCP layer but that allows for maybe a TCP with SSL or… People here… Yesterday we were talking about OpenFlow and you could integrate perhaps OpenFlow into that transport layer and your high level Cloud Haskell program doesn’t have to worry about that, it’s just thinks in terms of abstract nodes, ids for the nodes and it’s network that it knows about and it’s separate from the physical real underllying network.
Werner: And the backend is just a configuration…?
The backend just sort of sets things up so it maps between the real peers on the network, the real nodes and these abstract node ids that you have within the system.
Werner: So it’s Cloud Haskell, I guess…?
Yes, you google for Cloud Haskell, you go to the wiki page, there’s a whole bunch of resources and links to the packages on Hackage and videos and other documentation [Editor's note: we googled it for you, here's the link: http://www.haskell.org/haskellwiki/Cloud_Haskell ]
My favorite monad, well I have to admit I do quite a lot of programming in the IO monad but I don’t think I could say that’s my favorite one. It wouldn’t be really politically correct for me as a Haskell programmer to say IO monad. Tricky, there’s so many.
STM is a pretty cool one, I don’t use it very much, I use it quite selectively. Ok, there is one I had to write the other day, I had to write a new monad from scratch, which doesn’t happen very often. Normally I just combine existing ones like State or others So I had to write one the other day which was to do with matching and a confidence of matches… you’re trying to look at a bunch of strings and match against various things. So that was quite nice. That’s not a standard one.
Werner: What does it do, it matches… so monads sequentialize operations, I think, is that fair to say?
That’s one thing that it can do, so certainly for State and Error and IO and STM, you’re thinking like in an imperative way, but there are also monads which are like logic backtracking monads and my matching one was actually one of those types, it is able consider many different possibilities.
Werner: Like a parsing monad?
A little bit like a parsing monad or the list monad. The list monad is rather unlike, sort of imperative style monads. It’s more like logic programming that you can select from… take the current value of this variable which might be one of many possibilities and that’s represented by the list of all these possibilities and so the matching thing I was doing it was, in way, kind of a special case of this sort of logic stuff that you’re trying all the possible matches and you need to be able to combine those things, because once you’ve matched this… all the possibilities matching that, and that… and then you just combine them, and that combining turns out to have a monadic structure. So it’s an interesting monad in that way, like the list or logic backtracking monads…
Werner: Have you named it?
It was called the Match monad. It was just used in internally in one program I’ve been working on.
Sure, most people, most of the time when you’re writing a custom monad you’re just taking building blocks and sticking them together, so the standard ones that people are using all the time is the State monad, the Reader monad, the Writer monad and the Error handling monad, and most applications you see are just those things stuck together and then wrapped up. It’s not very often you have to use something more complicated than that. Ah, I know what the answer is: the Continuation monad that’s my favorite one, yes, definitely…
Werner: Because it allows you to… it’s sort of like async workflows?
Yes, it lets you simulate… asynchronous stuff.
Werner: Sequential programming essentially… Is that right? No? Yes?
It lets you do lots of things, and it’s rather mind bending. It let’s look at how you’re writing sequential code, but at any point you can block and return to a caller. So you can use it for example when you’re parsing. So if you want to make a parser like a restartable parser where you supply it with more input; you take the parser and you shove a block of data into it and then it tells you “I’m not done yet” and you shove another block of data and it goes “I’m done, I can give you the result”, but when you’re writing inside that parser monad you need to be able to say “Get me a new block of data and then carry on with that block of data” and getting that new block of data, is when you use the continuation to get out to the outside again. Internally in that monad, in that parser you’re looking as if you’re writing just sort of sequential code but at some point when you say “Get me the next block of data, because I need it” that’s using this sort of continuation aspect and then you’re escaping to the outside and it suspends the parser and it says “Now I need more data” and then you go “Ok, well here’s more data”. So, that is sort of co-routining, that you’re using the continuations for. But you can use continuation for all sorts of things like that. It’s rather powerful and slightly mind bending. I like it because it’s slightly mind bending.
Werner: As continuations always are.
Yes, continuations, but continuations wrapped up inside a monad makes them less mind bending because they’re not everywhere. You just expose a few monadic primitives and those don’t force you to think about continuations; the only guy who has to think about continuations is the guy implementing the monad. So when you use that parser you don’t have to think about continuations, it just works nicely, that doesn’t break your head it’s only looking inside the monad that breaks your head. And that is quite nice, that’s why I like it.
Werner: So, yes the continuation monad, we should all look at that, take another look at continuations, and thank you, Duncan.