Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Reactive Systems Architecture

Reactive Systems Architecture



Jan Machacek and Matthew Squire give us the answer to the click-baity headline “Four things that make the biggest impact in distributed systems”, together with architectural and code examples to help avoid repeating their mistakes.


Jan Machacek is senior principal engineer at Walt Disney Company. Matthew Squire is technical team leader at BAMTECH Media.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Machacek: We're going to tell you about a system that may or may not run inside our data centers, that helps us with our media processing, and it feels like an IoT style. Let's put in more buzz words.

IoT-Like Media Processing

Squire: One of the things we do at Disney streaming is we're involved in streaming live events, sports events and things like that. What happens in those cases is the media comes to us from a provider, and it reaches one of our data centers. Then it goes through some special hardware to be processed, and then what we need to do is maybe transcode that video, segment it, various things like that, before broadcasting it out to the web.

What we have is a lot of different devices, as we're going to call them, and some of them are literally physical devices. Others could be software, virtual machines, doesn't particularly matter, the important thing is that these devices, maybe they fail so often. Sometimes they just fall over, and so what we need if we're broadcasting lots of streams is to have some way to observe and understand the status of all these devices that are involved in the media processing pipeline. We want to build a system which can allow us to control the stream processing devices but also give us a global view of that pipeline, so that the operations teams can understand what's happening across the various data centers and various streams.

Machacek: The key words are global, and the key words was to bring in all these devices which definitely aren't reactive in the way that we would understand reactive, and bring about into this reactive world. For better or worse, we built this and hopefully, what we would like to do today is to share some of our lessons while we were building this kind of system, and hopefully that will allow you to go to work, maybe not tomorrow but the day after and go, "Yes, this is exactly what I want to do. This is how we're going to improve our systems."

From Distributed System to Monolith

We have a couple of clickbait-y slides, don't worry, there is some substance behind them and we'll talk about them. These are exactly the kinds of things that we've observed while we were building the system, and we'll give you the tips to avoid them. The first one is distributed system without durable messaging easily grows into a monolith; a distributed monolith, that's even worse kind.

Squire: Like I described, we have a device, and what we ultimately want is to put something in front of that device so that if I want to send a command to the device, it will go through this thing that sits in front of it. Then it reaches the device, and maybe we record some information, ongoing information about what commands this device has received in its lifetime. What its responses have been to those commands. It sounds to me like this is all pretty simple. We'll have what we'll call the shadow, which looks after the device, and we'll have the device. Can we just connect all these services up by REST APIs?

Machacek: Yes, that sounds like perfectly sensible thing. This is a simple thing, we have a box somewhere, it has an HTTP API, we want to build another reactive component in front of it, and it's REST, it would be if we didn't need this global view of the world, if we didn't have lots of these devices, these devices didn't like to fail. This doesn't really quite work.

One of the main problems with this is our code is perfect but this device is broken, and inevitably what people would ask us is, "Why do you send this device? How come this thing failed? It must have been you. What did the device do? Maybe you can replay what you did." I was going to suggest we just write it to a log file.

Squire: Because we've got lots of devices and lots of data centers, maybe we can just have some kind of distributed logging thing and we'll just pull all these log files together. I like log files because I can open them in Emax and I can understand them.

Machacek: Some of the problems were of course the log files were getting big, the log aggregation would be problematic. If you have a distributed system across multiple regions, multiple data centers, the log files aren't necessarily the thing to do. How about writing it to a database? There were suggestions of doing that, you could take one of these wonderful cloud databases and then connect to a stream of updates, and have a database as an integration

Squire: Well, we could use DynamoDB and even if you've got multiple regions, we can use a global table, everything is fine.

Machacek: That's also not a good idea if you want to integrate services, if you want to make sure that they can recover, if you want to replay them. Where we're actually heading is you want to have some sort of durable message queue, you want to be able to write these messages and consume them. You want to make sure that when a new service starts up, they can subscribe to events that other services have published. Going down the usual reactive architecture style; it's durable. The question is, how long do we want to keep the messages in these queues?

Squire: Forever.

Machacek: Yes, forever, except forever costs a lot of money. You want to define some sort of policy. Maybe you want to allocate disk space, maybe you want to allocate time, in any case, for a long time. That allows you to build a system where other services can subscribe to events that other services publish and have this sort of wonderfully fluid architecture. This all sounds like a perfect thing to build, unless you're building some sort of global workflow-based system, in which case, you really shouldn't be doing this. That's not our talk, so we can get away with this.

Squire: Fine, X is able to subscribe to messages, it can pull them off Kafka, but how does it understand what those messages mean?

Machacek: What do you mean what they mean?

Squire: I need to be able to interpret those messages in some ways and have some sort of structure, maybe a schema.

Machacek: We'll get to that in a moment. Yes, you need to very carefully define the protocols that your applications use. There are wonderful tools that you can use, and we'll get to that towards the end of the talk, where you can actually use these protocol definitions to usefully damage your system, and that's actually incredibly helpful. We have one more, distributed system without supervision is binary, working or failed. I don't think you can say anything else to it.

Distributed System without Supervision

Machacek: Here's what it looks like. Let's have some code, you might look at this and go, “ok, here's what we're going to do”. Let's make this device request, whatever that is, gives us a future of device response and then we run it, so to speak, and on complete success. On failure you should at least log it and then maybe try again. You're cringing, I get that, this is terrible, you don't want to have this infinite loop, so you might want to add a timeout. Let's invent this magical number, a thousand milliseconds later, we're going to try this same request. What could possibly go wrong?

Here's what's going to go wrong, we've seen this in a lot of our systems, and hopefully we'll give you a tip on how to avoid it. Naïve timeouts cause, what do they cause? They cause more and more timeouts, and it's really difficult to figure out what happened. They cause other timeouts, it's unpredictable, you really don't know what happened and they cause excessive downstream load if you combine them with these sort of naïve re-tries.

Here's what happens. You might say, ok, circuit breaker, how about that? Let's do a really simple thing, so requests are coming in, then when the requests stop coming in, we're going to open the circuit breaker which protects the downstream device, ok, keep that thought. You then design this kind of cascade, and I've seen this loads of times. Service one calls service two calls service three. Then you start with this, you say, "Ok, I'm going to give this one 2,000 milliseconds for a timeout." Seems reasonable, someone put two seconds out. Then the downstream one you'll reduce it by a little bit, you'll go 1,950 because you know that the upstream one wants 2 seconds. This one wants 1,900, that's the first level of mess.

Then what happens is this, you say, "Ok, let's add retries into it. Let's give three retries to this one and two retries to this one and three retries to this one. Great British flags for everyone." Ok, then you do this retry timeout calculus, and you have 600 milliseconds here, and divide that by 2, it’s a total mess. Notice also that this number is bigger than the timeout on the first service, so if the first timeout happens, you can send another request because you've configured it that way. It's just a terrible idea.

Squire: I can't reasonably be expected to know what timeouts are involved and how many layers of timeouts there are when I'm talking to an arbitrary service. What's a better approach?

Machacek: I'll make it worse, and then I'll give you a better approach. You think this is not going to happen to you because you've written this software. The trouble is, this first service, that's the team old radio that wrote it, and team old telephone wrote this other service, and they have a team floppy disk that wrote this other service. They have published the protocols but they have forgot to publish a detailed description of when are they going to respond, what are their deadlines? That's a much saner way of thinking about it, it seems that on the previous slide we forgot one of the reactive principles, and we said, "Oh, timeout," which means the service is just going to forget to respond, that's a terrible idea, you need to respond with something. Otherwise if you don't respond, you're dead, and someone is going to have to restart you or do something to you.

It's much saner to say each service has a deadline, and in this deadline it's going to respond, even if it responds with, "I don't know," or even if it responds with a pre-baked, pre-computed response. Then what you add to it is some sort of quality of service. We found out that not all messages that we send to these devices are the same.

Squire: We're just forwarding all messages, someone tells us, "Tell this device to do such and such." We pretty much just assume we can forward what we get onto a device and everything is fine. The problem is that if you treat it that way, there's hidden details that do all sorts of crazy things we couldn't possibly predict. We failed to pass through that particular header.

Machacek: Where I'm getting with some of the QoS is, if I want to tell the device to stop encoding a stream, that's a little bit more important than just a status message. Baked into this QoS policy that you define on each of your services, you might want to think about a retried budget, not count but budget. You would say over a fixed period, over a time window, I am able to retry 20% of the requests, not each request three times, and I have a deadline in which I have to respond to my upstream service. You then externalize all of these things into some file of some kind, it doesn't matter what it is, but you need to pull it out.

If you have these timeouts baked into your code, it's almost a disaster. In 20,000 lines of code of Scala, you'll never find it, which we will talk about in just a moment when it comes to monitoring. We had some fairly unpleasant experiences by getting woken up in the middle of the night because we didn't really know exactly what our services were doing. Anyway, pull this out, externalize it, think about deadlines and QoS, rather than timeouts and retries.

Now we're here again, we know that this is a bad idea but bear with us just a moment. You got rid of this by doing deadlines and QoS, let's improve it. In connection with the slide that you saw with the previous slide that talked about configuration, pulling out all these details, this is like HTTP. It does wonderful things, it does back pressure, asynchronous back pressure, it talks over to some RESTful API at the back end. Not a problem, this looks reactive, you can even recover with retries, this is all fantastic. Is this the right thing to do? There's so much hidden. What is hidden? Matt had the unpleasant experience of discovering.

How do you think about supervision in this case? Here's the thing. You have these low level - I call them actives or components - they make these device requests. We've already figured out that we have this deadline, we say that these components when they work, they really work, they respond. Even if they cannot talk to the device underneath them but they are lively, you know that they are working. If they don't respond - so these guys, when they run, they make device requests - if they really fail, if there is a genuine timeout, connection denied, that sort of business, then they should fail.

Failures are different than I'm responding because I didn't hear back. This is something much worse that has happened, you allow this failure to propagate to a supervising component, because that's the place where you can maybe recover or make a sensible decision, and then it's turtles all the way down. You build this up all the way down to the level of JVM and then you're running in some sort of container, so you build it up. Your goal is to keep your software running and not restart too much.

Distributed System without Back-Pressure

Squire: A distributed system without back-pressure will fail or make everything around it fail. If you think about what we're trying to do with commanding these devices, what we really don't want to do is break them in some fundamental way.

Machacek: You would be thinking about it and saying, "Well, what can happen to me?" So it's true that even with back-pressure, you are either the biggest thing on the internet or someone else can DDoS you, there is no question about that. You can at least be sensible and be nice to your downstream components and not hit them with everything, as long as they have a sensible way of reporting whether they can consume more traffic.

Let's take a look at this slide. You might remember we worked for this company that makes these movies. One of the critical design decisions when building the Death Star was this. Let's just go with the defaults, it will be fine.

Squire: What could possibly go wrong?

Machacek: What could possibly go wrong? It's linked to the previous slide where I talked about the deadlines and the timeouts. You have to pull out these implicit defaults, this sort of default configuration for components. Not because you don't trust them, that default configuration might be perfectly fine, but you want to be able to see it and inspect it.

Here's what we mean. It's the same code. Where are the defaults buried? They're buried here, it's the TCP pool size, it's the TCP connect timeout, the TCP receive window, we have this. How many retries? What's the timeframe? When we do retry, can we do delete twice? Maybe we can. Can we do post twice? Can we do a put twice? And we have more stuff and Matthew discovered more stuff when it comes to the details of networking in Kubernetes.

Squire: Yes, and it's important to bear in mind, some of these defaults aren't things that we could control in JVM. They are things that happen above us in the kernel perhaps, or things in the container, maybe in the Kubernetes network layer. All of these things as well can influence your performance.

Machacek: Our tip, the first thing that you should do maybe on Monday, is to hunt through your code and find out where things are configured. Pull them out, even if you have to repeat it, even if you take the default configuration and then write your own that's exactly the same, because that's neatly linked to this.

Distributed System without Observability

Squire: A distributed system without observability and monitoring is a stack of black boxes. We're building a thing that allows us to observe and monitor other things, but we also need to observe and monitor our thing, so it's monitoring and observability all the way down.

Machacek: Indeed, my first question of course would be, what's the difference between monitoring and observability?

Squire: That's a wonderful one. I found that if you Google this, you find a lot of very long-winded articles trying to make what I think is a very succinct point, which is that monitoring is able to tell you when your application fails. Observability is the facility for you to find out why.

Machacek: That actually seems reasonable. You combine the pointy end of things, you'd monitor the OS things?

Squire: Yes. It could be a new relegation or something like that, looking at your real resources on your real operating system.

Machacek: Then you monitor presumably the containers as in is the service running.

Squire: Is the container running? Do we have enough of them?

Machacek: Ok, we have containers. Now, what do we get out of our JVM? No, wait, observability, what is that?

Squire: Actor systems are full of all sorts of exciting things. We want to know message queue sizes, we want to know whether there are letters errors, other kinds of errors, actors are restarting. Timeouts as well of course.

Machacek: Timeouts, you want to monitor sizes of threat pools, you want to monitor sizes of queues, you want to have priority queues. These people have priority queues, don't have these unbounded queues, they'll kill you. Drop messages if you have to, but then observe. Observe that your messages are getting dropped, and then you instrument the JVM. You want to know how much memory it's using, how much CPU it's using.

Squire: All the standard things, your heap size, task size, these sorts of things.

Machacek: Now, sadly that's then connected to our mobile phones, which is another business. All of this is actually really closely connected with that configuration that I urge you to pull out. How are you going to configure your incident management thing when it is like an endpoint starting to fail? Oh, I know, when the response is more than 1,500 milliseconds. Why 1,500? How do you know? We made that mistake of just pulling a number out of thin air. We had the [inaudible 00:20:35], we had all these dashboards and went, no, most of it is under 1,500, so surely if it's more than 1,500, that's some error.

Squire: We start out very cautious and then the result is a lot of us got woken up at 2:00 in the morning for things that really weren't issues.

Machacek: If you've pulled out all of your configuration details, you can then configure your incident management to really match up with what your software is expected to do, with what you have told it to do, with how your components are supposed to talk to each other, and then your incidents will be the ones you really have to worry about.

Distributed System without Robust Access Control

Ok, we have another one. You've built this, this is all fantastic, it's configured, it's running, you have all these services.

Squire: It's monitored, it's observable. Everything you want.

Machacek: It's monitored, you're running some reasonably secured cloud, you have VPN connections everywhere, you have firewalls.

Squire: You've got some kind of protocol as well.

Machacek: Sure, we'll get to that. Why are we worried about this?

Squire: Precisely because we have durable message queues and we have a known format for writing messages in our queue. Two things can happen, firstly, somebody can accidentally write the wrong messages, it's malformed, it's got some wrong value somewhere that's going to break something in a horrible way. Or indeed, it could be part of, say you've run an integration testing and it's the wrong environment with arbitrary data that shouldn't have been published there.

What could also happen is someone could be malicious. An open access message queue like this in Kafka, is just as bad as having no checks on a REST endpoint. You're going to do your sanity checks on your REST API, you're not going to accept arbitrary data on Kafka.

Machacek: That's exactly it. You can't just have security at the REST API/load balance level and say, "If anything comes in by the time it's reached internals of my system, it's all cool, I'll just accept anything." We've taken precisely the opposite approach and said everything is malicious, everything is bad, unless we can really prove that whatever we ingest in our systems really comes from a thing or a user that is allowed to publish such a message or make such a request. How do we do it? First question I have is this. These private keys, public keys, are they baked into the containers or what?

Squire: Surely all I need to say is things like RSA and keys, and you'll believe that it's secure and then I'm done. No, perhaps not. The idea is this, we have a message, it's published and it contains some kind of structure to it, and maybe we have excessive common headers for all of our messages. What we can do is say each service when it publishes a command that other services might execute, or indeed an event which other services might observe and react to in some way, we can sign it. We'll use a private key to sign a message, we'll put the signature alongside the message payload, and then Service B will pick it up, it would use the known public key of Service A to verify that signature. It will do the same encryption and hash and see if it gets the same result back, and if it's valid, then it will do whatever it's going to do, and if it's not, then it will reject it.

Machacek: We've gone too far. Why the token?

Squire: The token allows you to specify a set of capabilities, these are things that services are allowed to do. When Service A publishes a messages, it will make a claim, it will say, "I want to do this thing, I have these capabilities, and here's my cryptographic signature to prove both of those two." This is all sounding very much like JWD because it is.

Machacek: Yes, I think that's exactly what it is in fact, that's what's baked into the token. In this world we don't encrypt the payloads between most services, there are some services that do really interesting stuff, I wish we could tell you about, but it's all encrypted, so we can't. Believe me, it's really cool, you'll see that later on when it's all released.

Squire: The next question is going to be about these keys.

Machacek: Yes, the keys. I was going to say about the keys. Where do they live?

Squire: Well, what we do is we just commit them to Git and then ... No, don't do that. The keys need to be stored somewhere secured, they need to be generated for each service, in each environment, they need to be unique per environment, they need to be handled very carefully. Maybe we store them in some kind of key management service, every cloud provider has their own version of this. We do not commit them to Git, we do not email them to people or put them on Slack or anything like that, apart from public keys. Public keys are, by their nature, public, unlike private keys.

Machacek: What this allows us to do is to really be much more comfortable about our testing, so we can really run the performance, we can run destructive tests, and we can be sure that the keys that we were given, even if we misconfigure it, even if we make a horrible mistake, we're not going to break a system that's already running, dare I say in production. That would be a very bad thing.

Distributed System without Chaos

Now we get to even more interesting stuff, this is all fantastic, this all works. Hopefully you've picked up some pretty interesting things, but now, chaos testing. You probably know about chaos testing, you might have heard about this other streaming company, whatever the name is, that does chaos testing. We use that, but there's one more step before you get into chaos testing your infrastructure. It took us a pretty long while to realize that we should actually do this. How about this? I have a protobuf message, and the first line of bytes, the 8, 1, 12, and so on, that seems like a perfectly good protobuf message. Then what I've done is just appended about 2 kilobytes of these 99, it’s just a number, value 99.

Squire: Looks perfectly reasonable to me.

Machacek: Perfectly reasonable. The problem is here, this is a maliciously-constructed message, what's going to happen is this will fail. Does anyone know how it's going to fail? Excellent, extreme silence, good. You think, "Of course it's going to fail, this is an invalid message." If I say, "parseFrom," that has to throw an exception of some kind. Ok, let's change it, let's change it to "validate." Surely, this feels right, this is a code that you got generated by the protobuf generators, so you say, "X.validate," what you should get out of it is in Scala world a try of X. It's either a success, in which case you're holding the instance of X, or it's some sort of exception. What you're actually going to get is a stack overflow exception, that's pretty unpleasant. You think, “No, it's my fault, using protobuf”, you should be doing JSON. Ok, you can try this, let's open up 2,000 levels nested JSON, see what happens. It depends on the parse, but in some parses you go exactly the same thing.

This is first level of unpleasantness. Of course to get to a stack overflow exception, your services have to do some computation for a while, which means that all your deadlines are blown out of water. You can really send 10 messages like that, and you've DDoS'd yourself, that's pretty unpleasant. We can make it even better, this is a protobuf definition of a message that we sent to one of the services. The good thing about protobuf, or in fact any kind of protocol definition language, is that it's machine-readable, you can write the tool. You can go, "Ok, I can see string." you can now start thinking about, “how do I generate these strings?” What are possible strings, empty string, and like a sensible string that says, "get," and like a 10-kilobyte excerpt from Shakespeare.

Byte sensory. What's an array of bytes? An empty array, a megabyte of zeros, 10 megabytes of zeros, 10 megabytes of random text. You have enums, which fix your choice of values. In addition to that, you can add a little bit of heuristics. Methods, that looks like HTTP methods. You can have a little generator, in fact, we have a generator that we can share with you.

Squire: We don't even need to be too clever here, we're just looking for key words in the protocol definition, saying that's probably time, will generate a time.

Machacek: That's exactly. One of the messages that you generate, given this protobuf definition, one of the payloads that you generate is this one, that seems perfectly reasonable. Then, you can start messing with it, this is a generated test, and so go in the past, which is going to cause an exception here. This is good date but now you're going to schedule something for minus four-five years. That's going to fail, you will then change the date, this is not a good date, so now this is going to fail, with an exception, you should check for it.

Then you can generate this. These are fantastic, these are weird Unicode modifiers, I would encourage you to try it and see what breaks, because an awful lot of things start breaking. This was a bit of a problem with one of the card payment services; never mind. The good news is we've generated all of this stuff, which is a total mess. Except the code that you see below you, the only validation you have is for the date and for the retry misfire strategy. The code here is going to work, and what you're going to do is schedule this stuff to happen at some point later. This happened to us and it was horrific because we would accept a message, we would consume it, process it, seemingly everything was fine, and then our service would crash in the future.

The next stage was horrific. We've built this tool that allowed us to generally use our protobuf definitions to publish deliberately malicious messages on our message queues and then we really see what breaks. When we were happy that everything stays, then we added all the infrastructure chaos where we turned off services, scale things up and down. What have we seen? We've seen one of these, you could generate a date with a little smiley face that really isn't a valid date. You can spot that, that was easy.

Squire: That's not a typo.

Machacek: That's to a typo, this is a real generated date. What about this one? Any tips on why that might have failed? We've seen this in our logs, and believe me, you don't want to see that in your logs. Any tips on what might be wrong? It took us also a pretty long time, don't worry.

The hint is that the T, this T, isn't exactly the right T, it's a Unicode special T. Yes, you see it in your logs and you want to kill someone because this is just not right, why is it failing? This is a valid date. What's wrong? Of course your service fails and then it restarts, and it consumes the same message from this durable log, it's insane. You should really test for and you should a build a tool, you should use this tool that allows you to do this crazy testing. Then we have this, which was also great fun.

Squire: You're being silly now. This is obviously a bug in your JVM, it's not my problem.

Machacek: No, I agree it's not our problem, except we are the ones who get woken up because our software is broken. Whoever is using our services phones us up and says, "I tried to use your service, it's down, it's not there." Just a bug in the JVM. Don't worry, it's fixed, it's gone.

What you want to do is you want to generate the messages that unearth it, generate horribly broken messages. This was one of those weird 10 megabytes of bytes that pretend to be content type, which pretends to be some string with broken Unicode, that's it. You wouldn't have thought, no one is going to send me 10 megabytes worth of header. If they do, then they're going to bring your service down.

Then of course you have the final point. This is more our feeling, now that we're a big company, this is perfectly good but of course we could generate data breach further down the line. In our position now it doesn't really matter that it wasn't our fault. If we make a mess, our customers are going to talk to us and they'll say, "You made it, you are at fault." so test everything.

Squire: This was an exploit in Apache Struts.

Machacek: Yes, it was a total mess. There's a big list of these naughty strings, if anyone's interested, completely safe for work. They're actually really good, try every one of them and see what breaks, you'll be surprised.

Do Tell another Anecdote

We have one more thing. Back in our consultancy days, years back, this is where the talk would end. I would say, "Ok, you guys need to be building these reactive systems. This is all fantastic, message-driven, scalable, elastic, this is how you do it." but now we thought, did we improve? We are all nodding, I agree this happens to be the way to build the systems, but is it really?

We measured some things. We took about 30 internal projects that we have at work, and we measured in every commit, in every file, in every project, we built a system that does classification of the kind and quality of the code. We trained it on stack overflow. What we wanted to know is, given this file, what's in it? What is it talking about? Is it doing Scala with back-pressure with actors, or is it doing spring framework with spring data? Based on text analysis of the stack overflow answers, we had the quality. What I mean by that is, did the stack overflow text around it say, "This is a good idea, do this," or did the stack overflow text around it say, "No, this is a terrible idea, don't do this?"

We matched this with production performance from our PagerDuty, and here's what we found. This is what we actually measured, we had many source files. This is every source code file in every commit, all the history of our project. We had some incidents. These aren't exactly the real numbers, so, you know, don't look into it too much, but at least an illustration.

Then we were able to build functions that provide sort of interactive diagrams of what's inside our code base. Here's what we had, here's one example, these all the files in all the projects at a particular point in time. Our question on the top diagram would be where are the best examples of performance testing in these files? Automatic discovery, which was really cool. If you're writing code, you say, "Where do I look for this really good idea? Some other team must have done performance testing. Where do I look? Oh, it's here." It of course correlates to the project, I say score. This is how well they do in production. What I mean by how well they do in production is fewest number of incidents, quickest time to recovery, that's what we measure.

You have another one, the diagram below it shows similar kind of thing, right? It give me the best examples of logging and Kafka for some reason. This is interesting, you can explore our code base this way. We can do one better, we can visualize all projects, we can say, ok, what's a bad idea? That's not that interesting. What's more interesting is, what do the really good projects do, and what do the really bad projects do? If I want to write software that performs as well in production as possible, what should I do?

Four Things Successful Projects Do

We've done some more analysis on this, as you might expect. Grand finale, the clickbait, four things successful projects do throughout their commit history. That was one of the most important things, it's throughout commit history, here they are. You want to do structured logging and performance, structured and performance tested logging throughout commit history. If your project spans 1,000 commits, if you do this at commit 950, it's too late, doesn't make a difference. You have to do it, commit four. What about monitoring distributed tracing? Same story.

Squire: Yes, you want all these things in place. It might seem strange to say what performance testing can I even do? I've just created Hello World Scala, it has one endpoint, health check. Am I going to performance-test that? Yes, if you get the groundwork there in the first place, then you're off to the right start.

Machacek: I guess a consequence of doing that is that it forces you to set up your CI and CD pipelines. Yes, you have to have this code but you have to have this code running right from the start, and really commit number 10 out of 1,000, not 500. By then it's too late.

Performance testing, same critical thing. If you don't do that right from the start, and just add it at the end of the project, from what we've seen internally, you're not going to be happy in production. You will have incidents, they will take a long time to resolve. Finally, reactive architecture and code, this feels like this big, blanket statement. It is the case, you want to think about recovery, you want to think about supervision, you want to think about timely responses to your requests. Interestingly, language doesn't make that much of a difference, you'll be pleased to hear. Framework doesn't really make that much of a difference, we can leave it up to you.

You don't have to fight at work, as long as you do these things, you'll probably be fine. We have two papers about this which we will share on our blog. One describes the protocol testing and what we found out, and what the impact was, and the other one is large scale knowledge discovery and how that works. We'll publish a blog post, that's probably the safest way to communicate with you guys.


See more presentations with transcripts


Recorded at:

Jun 14, 2019