BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Martin Thompson Discusses Reactive System Design

Martin Thompson Discusses Reactive System Design

Bookmarks
   

1. Hi. My name is Harry Brumleve. I'm the editor with InfoQ. I'm here at QCon, San Francisco with Martin Thompson. Martin, can you tell to us a little bit about yourself?

I'm an independent consultant. I go around fixing software problems at the moment, particularly around performance and stuff. I've been doing that for a few years now, but I'm probably better known for previously being the Co-Founder and CTO of LMAX and various other startups going back to the early '90s.

   

2. That's great. So what are you here to talk about at QCon

Ah, a couple of different things. One is my top ten performance myths. So I'm going to talk coming up in a couple of days where I'm sort of going to enumerate the top ten things I keep seeing people having problems with and they actually believe it's the other way around. There's a lot of myth and folklore in our industry. And one of the other ones is I'm going to be doing a sort of being on the bleeding edge of software performance for the last 20 odd years. And so I've seen Java from the start and constantly trying to beat it into shape to perform well enough.

   

3. Nice. And you're also a large proponent of the reactive manifesto. Could you tell us why the reactive manifesto is relevant?

It's really interesting. I've seen it pop up earlier this summer, and I felt I wanted to contribute to it because I thought it was very strong in many areas. But one area I thought I could do with a bit of fleshing out was running the responsive side. It was originally fixed on being interactive. And I've seen right through my whole career that how systems are built and the reactive manifesto sums up very well how I've been building systems.

I think it's very important that they're designed to scale. They do it in a very sensible way so that they're responsive, they're resilient. And I think event-based systems are the best way to go for anything of scale. I think we sort of – we're smoking the crack cocaine of synchronous systems and people think it's easy to begin with. As soon as it gets big and complex, it becomes a mess and hard to maintain, hard to make it scale, hard to make it perform, hard to make it reliable. And the answer is really asynchronous and we're sort of driving back to that and proper event-driven system now. It's what we used to do before the web came along.

   

4. And so a lot of the reactive topics that address asynchronicity focus on the client layer, specifically like JavaScript and AJAX. But can backend services also contribute to reactive systems?

I think very much so. A good service design is fundamentally reactive. It's sitting there, waiting for things to happen in the world and it’s going to respond to them. So you're reacting to events. By building systems that way, you can deal with failure particularly well and scalability very well by taking incoming events and responding by making them non-blocking. Typical synchronous designs end up being blocking by nature and Amdahl's law starts to rule.

And as we’re now going into the multicore world, Amdahl is getting up very early in the morning as our systems struggle to scale. People don't realize this. There's some real fundamentals like Little's law, Amdahl's law, some basic mathematics. When you apply it to systems design, you see the limitations very, very quickly. Yet we seem to be more interested in fashion than doing some basic sums.

   

5. Could you talk a little bit more about those different laws that people seem to be ignoring as they're writing non-reactive systems?

Yes. So let's start with Amdahl's law which is kind of interesting. So if we want to parallelize something, if we have got no interaction between any two things that are going on, it parallelizes very well. But most things don't. There’s usually some sequential component to that, and everything has to wait on that sequential component for it to happen and it becomes a fundamentally limiting factor.

When you start building synchronous systems, people don't realize that there are synchronization points and what's going on. It could be synchronization on a pool of connections to a database. When people are writing to web servers and other things, they're holding open thread from the client right through to the server with having a connection for HTTP and getting the response back.

It's much more scalable if you don’t even use HTTP to send down data and close straight away on that connection, let it be reused and then get a backchannel that's Comet or Websockets-based pushing events back. Now, when we do that, we can flow data through much faster. That's when the second law starts kicking in quite interestingly because when you have the contention points, you get queues following behind them, and that's what Little's law is about is how we measure our queues and manage those. If we take so long processing something and we have queues building in front of them, latency then is directly impacted because if those queues build up, things wait in those queues.

And you can see especially badly designed synchronous systems that under load, latency starts getting longer and longer and longer. Well-designed systems can keep latency almost flat right up to the point of saturation, and you can do that by building reactive systems that are event based with algorithm approaches that don't have those bottlenecks and also amortize the costs as we get bursts of load coming in.

   

6. So that does make it sound a little more complex than just writing software. Do you need a PhD in order to understand and apply the reactive manifesto?

Good question. I don't have a PhD. So maybe do I know how to apply that reactive manifesto? I think there is some inherent complexity in software. I've heard it said that software is one of the most complex things that humans actually build. There is an energy we need to put into that to get it right. It's something we must learn. What I've discovered is in building things like event-driven systems and asynchronous systems is the initial curve is a little bit higher, but complexity doesn't keep increasing. It flattens out very quickly and scales up well from the development point of view as well.

When you build synchronous systems with lots of coupling and entanglement, you start off nice and easy. But then as it grows, it becomes much more complex and systems end up turning into the really big ball of mud that can't be maintained. And so I think it's a fallacy to think that we just build synchronous systems and it's all nice and easy. There is a hurdle we've got to get over but once we get over that hurdle, things start getting much better.

   

7. So then is there a need for a specialist for reactive manifesto style or is it something that an average developer can incorporate into their practices as well?

I think an average developer can, but it's going to take some time and some thinking. So for example, let's deal with having systems that are resilient. There's a number of facets or traits to the reactive manifesto. One of the things is the system should be resilient. Well, for things to be resilient, you've got to think about failure cases, how do we cope with those failure cases and do people sit down and do that?

So for example, you build systems that are synchronous in nature, what happens if you don't get a response? Do you just block? Do you stay? Do you time out? What if you need three or four calls in a row and one thing fails partway through, how do you unwind your system? This is where the complexity starts to come in. If you designed it event-driven with state machines for modeling what's going on, you take step one. You've gone to the next state. You take step two, you go to the next state. Things fail; you know the state you're in. It's nice and clean. You can now time out. You can have a supervisor watching what's going on and then decide to take the next steps.

So a lot of -- the thing that we get trapped with is they don’t realize that things will go wrong. And once that you're doing anything of any scale especially the distributed systems, things start going wrong. Hardware fails, software fails, everything fails. So just design for it up front, then the complexity does go back and bite you.

You start thinking along those lines. There are other things in it. So how do you make your systems responsive? Well, you got to start measuring. People aren't measuring all the time. I find that as you build everything, you're constantly measuring it for does it functionally do what you do but also does it do it for all the non-functional requirements like the resilience, the responsiveness, how does it handle throughput, is it secure, these different things. Start building it in fundamentally as you go. We've learned from Agile that if the distance between a change and noticing that change has a bad effect, it is very easy to correct.

Whenever there's a huge gap between an action and the response that you get later, it will be very hard to tell what it is and that means you have to constantly have those feedback cycles short. So you should be performance testing all the time. You should be testing for anything that actually matters to your system on a regular basis then you understand it, then you know it, then you start going really fast because you got a system that's built on solid foundations.
,br>People say, "Let's go quick. Let's this knocked out," but now they've got a big ball of stuff. They don't really understand how it works given certain sets of conditions. What happens when it's under pressure? What happens whenever we pull the plug out? Then there's been a long, long time maintaining that. So they try to find out how to get it to work in certain scenarios or extend it. So I think this rushing to get stuff done is very much a false common.

   

8. It seems to me that the reactive manifesto requires a more thoughtful and explicit approach to building services. Should this ever be spent up front when developing a system or should their development be an iterative process yielding a series of algorithms that improve over each cycle?

I think it's iterative. Any good software development is always iterative but with the right amount of upfront thinking. But the thing is it's built on solid foundations. The traits and the descriptions in the reactive manifesto is not new. A lot of these ideas have been around for a long, long time. If you look at sort of systems like Tandem, like Erlang, CSP, we’re talking about technologies that are 30 plus years old that work incredibly well and were working in environments where they had to work correctly like people expect their mobile phone to work, they expect their telecommunications to just work. These systems are built that way.

Financial exchanges, if you have a software bug and your system goes down, you have to answer to the regulator that you've had a bug that's caused your system to go down. Most people don't think like that when they've not got a website. You build systems to a much different level of quality. But it doesn't actually take that much more engineering. I think to get the first release out, it takes more. But then you end up releasing faster and faster over time because you got the solid foundations in place. I think that that's the thing that's important is we got to look at sustainability of development over the long haul.

If you have to develop software, it's for one-off use like I won't develop a website for this conference, it's going to be used for the conference and not use it again afterwards, would you put the same level of engineering into it as I want to build something that some businesses is now based on year after year is bringing in the revenue for that business. Well, that should be designed so that it is of the quality it needs to be and it's maintainable and it's got that longevity in the future.

You can see the metrics when we look at software development. I'd like to take everything back to measurement. Measure what the team does, measure what the software does and it comes back over and over again that it’s only around 20% to 30% of total cost of software is the initial development. It's at least 70% to 80% goes into the maintenance from that point onwards. People don't take that into the cost whenever they're estimating.

And a lot about maintenance burden is because the original software is not of sufficient quality. It's highly coupled. It's not well understood. It's a bit of a mess. It's untidy. It's buggy. It just becomes a burden. You end up carrying this burden forever forward whereas if it's reasonably clean, you're building on something that's solid and you can go much faster.

   

9. Talking about building in accordance with the reactive manifesto, there are some tools and some languages that would probably yield some benefit right out of the box, thinking about functional languages, F# and Scala, they have immutable data and automatic thread management concepts. Would it be safe, reasonable to assume that imperative languages such as C#, Java or C++ could achieve similar results?

That is a very good question. For me, it's much more about paradigm of the language. All the languages is a DSL on top of machine code. What's your approach to programming? It's kind of an interesting thing. So you've mentioned things like immutable data, I think that's interesting. So if data is not changing on you, you can reason about it much better. It is a very powerful concept and quite useful in many ways.

So that doesn't matter which language you're writing and certain languages lend themselves to certain paradigms in thinking. So for example, like Clojure will lend itself very well to immutable paradigms and how it works. Other languages, it's possible to do it in. A good design in C, C++, Java, other languages can do exactly the same sorts of things. The scalability side is kind of interest because people say it but does it mean it from one of two perspectives I was kind of asked the question. Is its scalability of building the software itself? So it's scalability for the development team or it's scalability of the executing software running a code?

If you take the former, for example, I think immutability and a lot of the techniques of pure functions and the functional approaches in general do allow you to design systems that are much easier to reason about, much easier to test, and I think there's a lot to be learned from doing that regardless of what language you actually write in.

Now when it comes down to scalability of the end system, some of those things can be useful in design. Unfortunately, we have issues with modern VMs that they run on. With immutability, we have to do lots of data copying. Data has to stay around for a long time typically in the sore spot of the garbage collectors and we get in to stop the world pauses, and Amdahl's law comes back to haunt us again at this stage and as our core counts go up. We end up waiting on safe points alot for this to happen. We also cause contention points because whenever we design our data structures especially to be persistent data structures that are immutable, we have to design them in trees.

And so I've got a complex data model. People show simple examples start dealing with complex data models. And you've got a complex data model, you've got objects related to other objects related to other objects, but eventually you have to enter that model at some point to make a change. It does path copy from leaf to root in that tree, drives all the contention to the top of the tree, and that becomes contention point. Amdahl's law kicks in, Little's law kicks in, and then we've got issues.

So it tends to not scale as well from a performance point of view. I've built systems both ways and I find that they are easier to reason about some of the functionals and some very nice features in development. But given current JVMs and current techniques that people use for concurrency of those sorts of data structures, they don't scale as well probably in the order of magnitude less and raw throughput and how you actually stand up dealing with them.

   

10. So you've built reactive systems well before the manifesto was established. Where do you see the future of reactive systems going?

I think as we start going forward, as core counts go up, we have to take a different approach because as we build with multiple colors, we can't keep building software the way we have been building. It's not going to scale to that. So I think there's a lot of things that we need to do. One is to have a lot less sharing between different threads of execution. I think this is where approaches from CSP, Communicate Sequential Process, is, for example, now being adopted into Clojure are very effective for doing this. Actor based systems are also very effective.

I like to think of our modern servers are very much like the transputers of old. I started programming the transputers in the early '90s and message passing is the way to scale these sorts of systems. It works incredibly well. It becomes event driven. You don't start poking around in the memory of other machines.

Like, for example, you and I wanted to share some information. For example, I want to know your name. I could ask you what is your name and send you a message and you could respond, or I could go over, crack open your skull, find the neurons that represent your name and read them directly. Why are we writing software like that? That is typically what happens. That's not scalable. It’s fundamentally broken on so many levels. We have to stop sharing memory to communicate. We communicate the shared memory. It has to be the future because we've got local cores with local caches. That's where things are running fast. We then have very good interconnects between those cores for passing messages.

Almost think of a modern server as a network of a distributed computing. And we have cache interconnects that our messages can pass through. We can use main memory as a great big overspill for paging out to, but these things need to run independently. If they start working on the same data from separate cores, you're just going to be hit with latency. We will never overcome latency because of physics. We would not get around the physics. We are always going to keep improving throughput, that's going to get better and better. So we need to have things running in parallel, running very efficiently with minimum communication overhead. And we can’t do that if we share our memory directly. It just won't scale up.

Harry: Thanks a lot, Martin. Have a great QCon. I really appreciate your time.

Oh, thanks for the time, Harry.

Dec 02, 2013

BT