BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Using Randomized Communication for Robust, Scalable Systems

Using Randomized Communication for Robust, Scalable Systems

Bookmarks
41:51

Summary

Jon Currey examines the evolving use of randomized communication within HashiCorp’s Consul, a popular service mesh solution. Along the way he considers how to evaluate academic research for production use, and what to do when real-world deployments go beyond the researchers’ assumptions.

Bio

Jon Currey leads HashiCorp's research initiatives, with the mandate to impact their open source tools and enterprise products, while contributing back to the community with novel work and pragmatic whitepapers. Prior to HashiCorp, he conducted research at Microsoft Research, Samsung Research, and Nortel. He has shipped production systems at Apple, Oracle and several startups.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Currey: Thanks for being here and for choosing this particular talk. I work for HashiCorp. I will talk a little bit about myself and HashiCorp in a bit, but let's dive into something to whet your appetite and to make this a meaningful. The area we're going to be looking at today is Service Discovery and Failure Detection, this is one of the core things that Consul does. There's a bunch of other solutions in ZooKeeper, etcd, there's a lot of different technologies out there that will do this for you. Of course, Kubernetes is increasingly taking on this function as well.

Here's a peculiar problem that some users came to us with, there was a distributed denial of service attack. It doesn't really matter what kind of system this is I'm specifically not saying whether these are VMs or Containers or what. Whether it's bare metal, on-prem, public, private it doesn't matter, it's a generic issue here that we saw in a number of different settings. I'm going to use the denial of service attack as the first motivating example, and then I'll tell you about some of the other ways that it could also be manifested.

Fair enough, we've got these nodes on the edge of our deployment. Load balancing, firewall, web server, whatever makes sense in this particular context. They're getting hammered by this DDoS attack, and we would expect some of those nodes to suffer and maybe get to the point where they are marked as failed and are indeed failed, that's what we expect.

What we started to see occasionally, what our users saw was that there were nodes in the interior that were also being marked as failed, and over time those nodes might come back. They're not going offline in staying offline, in fact they are flapping. When you go and look at the logs on one of these interior nodes that's not directly under attack, you see that the node is healthy and there's nothing in the logs, nothing in the CPU utilization or anything that would show any problem with this node, yet the way that we've architectured our failure detection system was marketing them as failed. Even other healthy nodes thought that these nodes were failed. This is a global consistent problem across the system, so not cool.

DDoS attack is a simple motivating example, but other situations, there was some web servers that were overloaded, they were doing video transcode. It didn't have to be at the edge, edge nodes are somewhere that we can get overloaded, but if you're doing video transcode and you don't configure things right, you could overload your video transcode service. Also, if you choose to use these burstable instances and you're on AWS you've got the T2 micro instances, I think all the other providers have them these days where on average you're only allowed to use 10% of the CPU, and you have these credits. As you use it, it goes down, if you get to zero, you're throttled. That was another scenario that caused this, the exhaustion of those credits.

Behind all of this, the common thing here is that you've got some nodes that are having their resources depleted, usually CPU, it could be a network though. It was those resource depleted nodes that were causing the other nodes to be accused of being failed.

I'm going to tell you how we solved that problem. In the context of telling you about the way that we use randomization, it's interesting that randomization is something that has come up repeatedly as we have built out Consul and as we've solved some of the problems that came up as people started to deploy Consul at a greater scale. Randomization has been our friend, and I'm going to show you how.

Also, the meta-agenda here is using academic research in production systems because that's where we got most of the ideas about randomization from, so because of this double agenda, we're going to move quite quickly through the technical stuff. I'm going to point you to some previous talks that I have given on that stuff for the details. I will flag when we're going to hit fast forward and point you to those talks. Of course, these slides will be available afterwards, so all of the references, all the stuff that I'm citing here, you should be able to find afterwards from the slides, I'll make sure also all the references are linked. Of course, please just contact me if you want to talk about this stuff more.

Who Am I?

I am Jon Currey. I am an industrial researcher. What is an industrial researcher? Well, I'm in industry, I worked for a for-profit company, I'm not in a university, and I do research, academic research, as in, I write and try to get papers published and peer-reviewed academic conferences. When you do that for a while, Google picks up on this and I think you could say you don't want this to happen. They will create a Google scholar page for you, which will do all the usual page ranking stuff and it will collect together all of the papers that you've been associated with, and the people you've worked with, and give you some cool metrics to see how you're doing against everybody else. We'll talk about using things like Google Scholar as a way to discover good and relevant research for your production systems.

I've spent half of my career in research and half in production systems. I was at Microsoft Research, I was very lucky to get to work at this lab in Silicon Valley. Starting in 2005, MapReduce obviously had come up, Hadoop was just coming out, we were very interested in what we could do for Bing and AdSense and all sorts of things within Microsoft. There was a whole line of work based on using data flow graphs, what you actually do is now see Yarn and Tez and things that eventually made it into Hadoop that followed this approach. We also got to have some product impact, Bing still uses Dryad out as it's Microsoft's version of Hadoop essentially.

Also we've got to work on Kinect. They needed to train the skeletal tracking machine learning so that people could move in front of their Xbox and it would see what was going on. They tried to make that work with a HPC platform, they couldn't make it work so they came to us, and it was really cool to be able to do that. We got laid off, politics, and we got shut down in 2014, seven of my colleagues immediately went to work with Jeff Dean, a Google brain and are coauthors of TensorFlow. These are some good ideas that Jeff Dean was salivating at the idea of getting this stuff into TensorFlow.

However, the other half of my career has been spent building production systems. I worked on iTunes early on, I was the first engineer on video and built out the whole video and audio encoding pipeline. Oracle, I worked on the J2EE Stack, Nortel back in back in the '90s, Middleware CORBA, a real-time CORBA for embedded systems, if that means anything to anybody these days, Iris Financial a financial software.

I have definitely walked multiple miles in your shoes, stabbing myself in the eye with blunt instruments as I went. iTunes continued to have these crazy outages years after I left, we could never really solve it. We were doing DevOps without the tools and it was a learning experience to say the least. Bittersweet, at Iris we built a high-performance trading platform, which allowed an enterprising trader at Citigroup to dump 11 billion euro of a European government bonds onto the market in less than two minutes and shocked the market. It took multiple years of lawsuits to unwind that one, there's some social consequences to this software that I'm writing.

These days I'm at HashiCorp, I have a range of tools, they're all open-source core, and always will be. I was just explaining if it's anything to do with performance, and scalability, and reliability, that stuff will always be open-source where we need to monetize so that we people can come and work on this stuff. We make the cut at the level of, more things like corporate governance, and geo-scale failover, disaster recovery. The kinds of things that when you need these features, come on guys, pay it back to the open-source community and allow all these other smaller organizations to benefit from the core technology for free. That's our approach.

Consul, Serf

Service discovery and availability, we actually have two tools. This is a little stack here, I've got two tools and a library at the bottom. People mostly know about Consul, Consul is built on top of Serf, it actually links Serf in as a library. Then in turn, both whether you're running Consul or Serf as a service, at the bottom is memberlist. It's an open-source library that is where most of the service discovery and fairly detection of both Consul and Serf lives. We use an approach called group membership and we're going to talk about that.

Product Requirements and Research

When we were building our discovery and failure detection solution we had some basic requirements obviously, that you would think about as you go to build any production system. You need this thing to be robust, it's where the failure detection occurs after all. It needs to be robust to node and network failures. It has to be scalable, we were thinking for the future. What's the point of architecting something that we know won't go beyond 100 nodes?

This is a key one, I really liked the fact that this was baked in from the start, simplicity, all outcomes raised up, whenever you're given a choice, prefer the simpler solution. For our users, if we can keep it simpler, it's going to be simpler to manage, for us though, it's also simpler to implement and as less bugs. I loved the fact that when I came to HashiCorp, this was a first-order requirement.

We had these product requirements, but what I also really liked about HashiCorp was the research focus. The product requirements at HashiCorp, always feed into an investigation at least of what previous work has been done in academic research. In industrial research they would increasingly, Facebook, Microsoft, Google of all of these people are going to the academic conferences and talking about the work that they've done in that kind of critical, peer-reviewed way of talking about it. By going and looking at published research, you're also looking at very principled descriptions of some of the best and largest systems on the planet.

When you start doing this, what you discover is that it's not a one-way street. As you start to go and read the research, what you will find is this metric that they're talking about there are criteria, how they characterize the system, how they assess the system that you may not have thought about when you were beginning to define your product requirements. At some point in the process, consuming research can actually help you, even if you don't go and use any research, but going and looking at the research can help you to just think about how you described the problem and how you measure the results.

However, if you are interested in dipping the toe and seeing if you can actually start to use research, there's not a big commitment to get started. At HashiCorp, we use an RFC model, we're a remote first engineering organization. We have engineers all over the world, we try to give people flexible working so minimum amount of face-to-face meetings. There's some standouts, there's some team meetings, but as much as possible, every new feature is the subject of an RFC document. Somebody goes and take's that, they write that up, it gets circulated for review. We use Google Docs, we love the way you can give comments and do live collaboration on the document.

When it's relevant, not always, but when it's relevant, people will put a research section into the design document, and there they will start to collate the relevant papers. You can do this in an agile way, people will say, "Oh my God, it takes so long to read all these papers." Ok, break the problem down. Don't read all the papers now, create a backlog, prioritize them, pick the top paper. If this week we can read this one paper, when somebody reads the paper, summarize the paper very briefly. Summarize the pros and cons of the paper, look at the trade-offs that they considered in the paper, but also evaluated against your product requirements. How would this hit our product requirements? Does this perhaps give the idea of a new product requirement?

This is not a completely alien process, you do this when you do competitive analysis, you do this when you do market analysis. You're going to have these matrices with the checkboxes as this one has this and this one has that. The research systems are just another kind of competing technology in some sense.

Where are you going to get this research from? Google obviously is indexing the planet, so Google is an awesome place to start. You may not know the right terminology to begin with, plug stuff in, plug in the names of the projects that inspire you in this area, start looking for the keyword terms. Start with the stuff will emerge from the mist over time, there's a large number of research databases out there, you can also Google "Research databases" to see how many there are. Different ones have different core properties, many of them have a recommendation systems, so don't just go there and look at the papers.

Create an account, favorite the ones you like, because they're going to start suggesting stuff for you. They're really good at this, I use this, it's a great discovery tool, there's so many papers being published now, you need to leverage all the technology that you can. We have this issue now of having too much information and it's an issue of trying to drill down. This is why you have a backlog, you have to have an agile iterative approach to this, something like being on one or two of these research databases could really help you to manage your time as you try to navigate the research.

Then the websites, of course, a lot of stuff is still published through the Association of Computing Machinery, IEEE, USENIX. There's a large number of peer-reviewed venues that publish, sometimes they have the paywall issue. Things are changing, I haven't got time to talk about all the nuances of that, put PDF on the name on the end of the title in Google. Very often, despite the constraints of the organization, the authors will have uploaded the paper or somebody else will, try it before you buy.

Please also do become a member of one of these organizations. If you habitually find that ACM or IEEE is useful to you, join, have your company pay. It's part of your professional training, some of these even offer a program where your whole organization can join. HashiCorp is an ACM preferred employer, it allows us to give ACM membership. It's free to employees who want it, but we were able to get it at a discount rate because we've signed up to this program and we offer this to all our engineers.

Then Twitter, academics are on Twitter, friendly research-interested people are on Twitter, so don't overlook Twitter, another collaborative filter in a sense. If you follow people who are interested you think this person has done some great work or has tweeted about some great work, follow that person and you will see more great work.

In all of this, it helps to think about research as a knowledge graph because that's what it is. The nodes, the vertices, in the graph, we have the entities, who published? From what institution? There's a paper at what conference? Then you have those edges who are they advised by? Where do they publish? Citations of course, citations or references paper X cites paper Y, that's why this is an amazing graph, because every single paper references other papers. Hence we have really strongly embedded a graph here and it's amazing to explore this graph.

This is not lost on academics, there's academic research on the research as a knowledge graph, but you don't have to go down that particular rabbit hole to use this. Just keep it in your mind and develop your own entity model, these guys here, they've got a particular abstract model, but what works for you? Do you care about which lab it's in? Maybe not, but over time you'll zero in on parts of the knowledge graph that are more interesting to you.

Group Membership Protocols

We applied this initially as we were trying to build Serf and then Consul. This is the memberlist part of the discovery and failure detection part. We discovered that this is whole literature on group membership protocols. This is a nice abstraction, there's the concept of a group of members. They typically in the literature called them processes, but don't think any kind of operating system process, it just means a member of this group.

Discovery works by a new member joining the group. It will contact a current member of the group, from that, it will announce itself to the group, it discovers the other members of the group and they discuss it. Then once you've got this group, obviously they have an expectation that they should be able to talk to each other. If that doesn't work out at some point, they could also manage failure detection, people can be removed from the group or that their status can be marketers as being sick or offline.

We zeroed in on peer failure detection, this is where the processes monitor one another. There were no special nodes outside of the group, it's not that the group is being observed by anybody else, the group are the observers of one another. We liked this because there are no special nodes to administer. This goes back to the simplicity and robustness. There's a ton of different work in this line, usually, redundant monitoring, this is a no-brainer. If the members of the group are the only ones responsible for monitoring one another, each member has to be monitored by multiple other members because otherwise it's not a very robust system. If one of those members goes down, the people that it was monitoring are no longer being monitored. There's going to have to be redundancy if we're going to have peer failure detection.

Everybody just sends a heartbeat to everybody else, or conversely, everyone pings everybody else whether it's push or pull. This was the state of the art in 1996, people were trying to build highly available parallel databases, order 10 nodes. It made sense, everybody just checked that everybody else was around. Then in the late '90s, you start to see webscale stuff coming up, and people thinking about peer-to-peer for file sharing and all of this interesting stuff. This didn't work anymore because it's order N-squared, as the number of end nodes in this group is going up it's the square in terms of messages, so this is not going to work. There was an interesting flurry of work in the late '90s starting to think about ways to break this horrible scalability problem.

SWIM

SWIM is an awesome piece of work that came out of Cornell in 2002, this is what we chose in the end. This is what underlies the failure detection and group membership in memberlist, and hence going up into Serf and Consul. A bit of a mouthful as a name, it is scalable because they're fixing the number of messages between nodes. It's not the case that it's proportional to the number of nodes, it's a fixed number of messages for each node. Then mainly the total number of messages in the system is order the number of nodes, not the square.

Weakly consistent, this is a key point. It's not the case that everybody in the group is going to have the same view of the membership of the group simultaneously, it's going to take time for updates to propagate. You may or may not have a problem with that, in the paper they actually called out that if you have a problem with that, you could sample the view of one member and you could make that a consistent view for the whole group.

That's exactly what we do, we use the Raft consensus protocol, another fantastic piece of research, I'm not going to be talking about it today. When you go from Serf to Consul, Consul still has the underlying gossip weakly consistent view. Things do converge and they converge very well actually, even the order of 10,000 nodes. If you need everybody to be on the same page, the view of the master, the current leader of the Raft group, server group will be writing its current view into the key-value store, and anybody who wants that consistent view can go and retrieve that view. You can have it both ways, if you want the consistent view, you can get it, but the underlying protocol is weakly consistent.

It's infection based, so it's a gossip or epidemic approach that when one node has some information that it wants to share, it talks to a few other members and not trying to talk to everybody. This is how we avoid N-squared scaling, and it's a membership protocol. I have not got time to give you all the details on this, I've done a talk on this in the past. We have a white paper about it as well if you want the details, I will come back and get the link to this later, but we're going to just very briefly look at this to observe a couple of things.

These are the three parts to the SWIM protocol, I'm not going to talk you through all the parts, just observe that it has three stages, three steps. Two of which are involving failure detection, and after it thinks there's a failure, it then has to disseminate that in this gossip method through the system. The reason that this scales nicely is that each of these steps has a fixed number of messages that are exchanged. There's only one random peer selected in the first round, three in the second, and three in the third. I've mentioned the randomization.

We could do this without randomization potentially, we're trying to avoid the all-to-all, the N-squared communication. Great, so everyone's just going to talk to a few other people, but that would greatly increase the chance of false negatives. It would greatly increase the chance of something being missed. You still would have the issue of if you pick three or four people monitoring, you could have a correlated failure. You'd have to now try hard to make sure that all those monitors were not on the same rack or that it's quite possible that there'll be some network disconnect that would lead to you not seeing that somebody had died.

What's really nice with the randomization is that every node is periodically checking every other node. They're not checking that particular node so often, but collectively, all the nodes are still checking all of the other nodes. This greatly reduces the chance of a particular node failure not being discovered, we're going to come back to that a little bit later.

The proof is in the eating, right, we've been using this for over five years now. We have done some refinements to it, we mentioned them in our white paper. We have a lot of users who are routinely taking this up to order tens of thousands of nodes in a single group. I know that some of the large cloud providers, for example, internally would never put 20,000 machines in a single logical unit, they would "5,000 machines is enough. Thank you very much." have that conversation with certain users. We're not saying, you shouldn't go past 5,000 so it only works to 5,000, the good news is, if you want to put 20,000 nodes in one group, we can handle it.

Network Distance

We were really happy with SWIM, I don't think we really appreciated the benefit of randomization at that point though. I think it was we've diligently implemented the protocol, and then sometime later we had some new requirements coming in. One of the things that people were interested in was computing the network distance, and I now know these other members, but there's multiple nodes offering me the same service, which one should I talk to?

Of course, in terms of load balancing round robin or the sharding, there's all sorts of strategies that might be used here. Part of that, you may be interested to know which of those nodes is the closest to you, logically closest in terms of round trip time, or even to rank them by distance that could be fed into your load balancing. Also then for failover in the event that you need to go to a different whole data center. We're now who is my new closest data center?

Continuously understanding the latency/distance to all the nodes in the system is a useful thing. We had the request, we built a solution, and in Consul, you can go and ask about the round trip time. Everything on the command line is against the REST API, so you can do this programmatically or from the command line you can script it or code it. We have this capability within Consul to go and ask for what we call the network coordinates of any given node or set of nodes.

Vivaldi

In the end, we implemented a system called Vivaldi, a great piece of work out of MIT. Frans Kaashoek is just a serial innovator since his own Ph.D. in Amsterdam. Russ Cox, one of the Go developers if you're into GoLang, this is a paper he worked on in grad school at MIT.

Because it's so cool, I have to share with you how Vivaldi works. Imagine each peer as a ball, it’s a physics experiment, we're going to push these all together. Then each time we get around trip time, we have a ping between them, it's like if there wasn't already a spring inserted between these two balls, we're going to put a spring in there. The natural length of this spring is going to reflect the round trip time, the pink time or the distance between them. Then we relax the system, we let the springs try and find their natural length. As you get more pings and you get more round trip times, what's really cool is that essentially everybody moves into space and the thing adjusts itself.

Inside of Consul, I think we should add a visualization of this actually. There's essentially this little multidimensional world where these little nodes all drifting apart to be the appropriate distance from one another, and when they settle the distances between things.

What's really cool is, you don't have to have everybody ping everybody, we can't do the N-squared thing, we can't say the only way I know the time or distance between two servers is by to having to have a ping between them. Here we are essentially only sampling very sparsely all the possible routes between all the different nodes. This is where it turned out the randomization was really nice, if we had opted for a group membership and failure detection system based on something else like a ring or a tree topology, which were the other things that were considered, not only were we concerned about the brittleness of those and how unstable, very little work done on proving how stable those would remain under continuous churn for example.

Also they would only have a very narrow pattern of communication, you wouldn't have a lot of lateral communication between these hierarchies, we think that the Vivaldi round trip time stuff would not converge anywhere near as fast as it does converge if we didn't have the randomized pattern. We're very happy at this point, score. That randomized stuff is two different ways that it helps us now.

Back on the research tip, so the Vivaldi thing, we trolled network coordinates, network distance sometimes different communities within the same community have different terms, and you have to manage that in your backlog and your document as you're just searching the research graph. Interestingly, Vivaldi had been done 10 years prior, what's cool about that is with that perspective, the research community had gone in and done follow-up work, they talk about having a response to something.

Those papers are not cited in the Vivaldi paper, but if you go to Google Scholar and you click on the Vivaldi paper, they've done all the indexing work. You can see the list of every paper that's ever cited this paper, that's gold. You just can immediately go and look at those papers, put them in your backlog, try and triage what you think is more likely to be. You read the abstracts, skim them, agile iterative approach to trying to process this new subgraph of the knowledge graph that you've bumped into.

I haven't got time to take you through that story, but Armon Dadgar, the Cofounder and CTO of HashiCorp to do a fantastic Papers We Love talk. I'm going to plug Papers We Love now. I don't know how many people know about Papers We Love, have been to Papers We Love. I will mention it again later, but if you're interested in research that's relevant to production systems, Papers We Love is a fantastic resource that I can encourage you to check out. Armon did a great talk going through all the different bits of research that we looked at and the ones that we, the bolt-on extensions that we used to make Vivaldi stable.

Flappers

What about that flapping business back at the beginning? I teased you with that randomization was going to be the thing here that helped us. First thing to point out is, this problem does not occur because of the use of randomization and SWIM, that was lucky. I'll talk about the production side first and then the research. We figured it out and this came through a number of escalations paying and non-paying users, "Help us out here, what's going on?" We were intrigued after we saw it a couple of times.

The epidemic nature of the SWIM protocol is mostly good at handling slow running or dead nodes. In this case, messages trying to get from B to A. D is dead, so D can't help, people don't know that yet, it hasn't been announced. E and C, and people are trying to send messages to D, they're not going to get there anytime soon and it's not going to propagate the message. Luckily, we can flood around it, if there's bad or slow or dead intermediaries, that can be handled by SWIM.

We're trying to find out in this case actually, B was being accused of being dead and it's gossiping back to saying, "No, I got that rumor that I'm dead. I'm not dead." A, unfortunately, was the one who was asking, and the Achilles heel is that if the node that was asking if another node was dead, if that node is running slowly, it's not dead itself, but it's running slowly, the message can even have come back and be in the machine. It could be in the process, it could be in a Go channel or a message queue, if it just doesn't get processed and that timeout doesn't get canceled in time, then B is going to be accused of being failed when it's not.

Suddenly you see why all those different ways that there were CPU starvation, the burstable instance or the competing process or a video transcode is going overboard or a firewall doing overtime in a DDoS. Whatever it was, well the memberlist agent in node A was only getting to run periodically, and while it wasn't dead, and if it was dead, people would, or if other people thought it was dead, they would ignore it. At this point, no one else knows that it's unhealthy yet, and it's going around accusing other people of being unhealthy. That was the root of the pathology here.

It's something they missed, something the academics missed. They had Werner Vogels, the CTO of AWS was a professor at Cornell at the time. They thank him in the paper, he helped them to get together 55 computers so that this was a distributed systems research in 2002 at an Ivy League school. They had good resources for a university, but 55 computers. It's amazing it works as well as it does, given that they couldn't drill down into these subtle pathologies.

I didn't have time to give you all the details here, I'm going to show you a couple of slides to give you the feel for our solution, but that talk will take you through the details. We call our solution Lifeguard, and the heuristic is based on local health that we were saying the issue is that the node that had the suspicion, the node that issued the probe that wanted to know about somebody else locally there's a problem. We call this local health, it’s a very basic heuristic, we essentially say, "We know this protocol, if you sent five messages in the last 500 milliseconds, you expect to back five replies," or whatever the right number is for the protocol. "If you don't get messages back from anybody given the nice, randomized embedding here, it seems more likely that it's a problem with you rather than with everybody else."

All we did was back off, it's a simple back-off protocol, that node just says, "Hmm, no one is talking to me right now, what are the chances it's me versus somebody else? I'm going to slow down and give those guys a longer." We'll talk about the implications for the overall failure detection, I'm going to come to that in a couple of slides, because that would be an obvious concern. If people are running more slowly, failure detection is going to take longer.

It worked really well, on the left is without the thing, on the right is with, it has three parts because the protocol has three parts. We slowed down each of the three stages of the protocol depending on the signals we get. This is testing in pathological situations, we're torture testing, we're not killing them, but we're keeping up to a quarter of the cluster is being tortured, we're able to really heavily filter out this pathology.

I intimated that randomization really helped with this, we were surprised at how easy it was to tune this thing. Like, "Wow, we got lucky with the tuning parameters." Then we changed the tuning parameters and we still got lucky. Well no, we weren't actually getting lucky, what's going on here is that every node is still checking every other node, just not as often. If there are three nodes we think are bad, depending on how aggressively we've tuned this, we just slow down those three nodes. Because of this probabilistic randomized approach to the failure detection, there's still 97 nodes checking all 100 nodes. You've essentially weakened the strength of your failure detector by 3%, it doesn't actually mean that it will take longer to detect a real failure, you still got 97 healthy failure detectors on the job, so it's graceful degradation. It actually gets better as the group gets larger, so this was a very pleasant discovery. Now we're up to three ways that win, we even used the randomization to plug the one apparent flaw in SWIM's original architecture.

Lifeguard

This is where it gets interesting with the research, we were able to pick up the literature search, we had that backlog, we went back and looked at the other papers, we re-read the papers that we've looked at before. We went back further into the backlog, explored deeper into the graph to try and find where somebody else had hit this problem, either with SWIM or generally. This local health notion actually still doesn't seem to have already occurred in the literature, which is a really interesting discovery to be made at an open-source tool company and not at a university or at Google.

What we did get from looking at the work was, what are the right metrics to use? What are the right ways to construct good benchmarking experiments? That we can actually test stuff in a principled manner and prove causality in some way. Once we had those benchmarks, they have persistent value, you can actually use them as part of your CI, you want to see if there's a regression, who killed the performance? Why is failure detection suddenly taking 50% longer? Build the benchmarks, keep the benchmarks, because a live part of your engineering organization. You can also use it for competitive analysis and so forth.

Also you can start to write a paper which is what we did. Ultimately, we were able to get this paper published, we got it published at IEEE DSN, which is the conference that SWIM came out at, which was fitting. We brought this home and we talked to the authors, the lead author of SWIM, they were very happy that we had taken their work and managed to extend it and prove it.

It was an iterative process, I'm a research team, the beginning of the research team at HashiCorp, but I was working with engineers who are on the hook for continuing to deliver the product. We did this iteratively, we took those benchmarks, we had an internal report, we have a webpage which things are run and generated. Ok, now we can have a white paper, we can communicate, preserve this knowledge within the engineering org. Sales might be interested to share this under NDA with a prospective client or a large customer who is concerned to understand how robust our stuff is, "Well, look, we did this."

Archives is an interesting place, people put preprints. It used to be for physicists, since deep learning came along, there's such an explosion, people just want to get their work out there. They can't wait nine months for it to get published, you're going to read. If you start doing this, you're going to read a lot of archive papers, that's where things appear before they get published in peer-review.

Eventually, as we iterated on this, we built it out and we got to a published version of the paper, which had a lot of value for the org, even I can go to the CEO, we've got blog posts, we've got tweets, we do conference talks. There's all sorts of different ways in addition to strengthening the product.

Now we're in the graph, which is really cool because I get these emails telling me, "Hey, there's a new paper that cited your paper." these two papers are both really relevant to our work. It's fascinating it's just, "Yes, sure, I could go back and check the SWIM work, but by becoming a node in the graph - it's the social filter - people will come and talk about the same stuff if it's relevant.

Benefits of Research and Where to Begin

Then they say "If you want to go to this conference, here are the talking points to tell your boss." If you're going to go back and say, "Hey boss we need to start thinking about research." Here are the talking points, we could get better algorithms out of this, we could at least get relevant metrics, this could help with talent, maybe we can get some interns here who want to work on a research project, and maybe some full-time people who think that a company that does research is a cool place to work. Will boost our reputation with customers, potential customers, and internally, the employee satisfaction could go up. All of this is the experience of HashiCorp, inbound now the number of requests I get for internships, the number of people we hire who say, "When I saw that you guys had done that piece of work, it tipped the balance for me.", so there's many ways that this can benefit your organization even if you don't end up implementing one of those algorithms.

Where do you begin? Papers We Love, go to the Papers We Love website. London has an awesome Papers We Love, it's all over the world. They have a GitHub repo that people file archive papers by topic. Go to the meetup and all the videos from the previous meetups are there, they are on YouTube, you can go watch these awesome talks. Go there and network with other people in industry who are interested in research.

"The Morning Paper," Adrian Colyer, Monday through Friday, not every week of the year, he'll drop a really interesting paper in your inbox with his explanation of it. It's a really nice resource and it just gives you that little drip feed of the stuff to think about. Then at work, start a reading group, have a brownbag, do a little mini Papers We Love within the company, this is something I found really cool. Actually, Bryan Cantrill from Joyent organized Systems We Love. That was nice because his observation was, "There's all this amazing technology that never did have an academic paper." If you want to broaden it from Papers We Love to Systems We Love, go ahead, because you can also talk about a really cool system, it doesn't have to have been published in an academic conference.

Then maybe you have a colleague who started graduate school. Maybe they did a master's or dropped out of a Ph.D. program, you may already have someone who knows a little bit about the research method in your organization.

Outbound, email the authors, tweet with them, go to a conference as part of your professional training, you can go to an academic conference. Blog about or tweet about your problems that may attract interest, have a Ph.D. candidate as an intern, or an undergraduate who's interested in research even. Then you already have people in your organization someone who have a Ph.D. or who've worked towards one and stopped and then pivoted.

If you do bring a Ph.D. intern, mutual benefit here, they would love to engage with industry. You've got a real problem, you've got a code base, you've got data and all of those things they don't have those things usually in academia. They would love to come and engage with you on an industrial problem, the work they do as an intern, they are an employee, it's your intellectual property, it's your company's intellectual property. There's no compromise here, it gets tricky when they go back to school, you have to be a bit careful, you have to either patent what you want to patent before they go back to school or give up on patenting. Then don't do this by Twitter, go to a conference, go to the poster session at the conference, talk to them about their work, tell them about your problem, see if there's a mutual match or email people.

Last of all, researchers are people too, they're doing a job, they get graded. They get raises based on impact, they're talking about it on Twitter. In my experience, when you reach out to somebody to talk about their work, they're always excited to hear from you. I hope that you will now, if you aren't already doing it, go out and think about trying to apply research in your production systems.

 

See more presentations with transcripts

 

Recorded at:

Jun 24, 2019

BT