Open Source Linkerd Project Celebrates First Anniversary in Quest to Become TCP/IP of Microservices
Buoyant, a cloud-native services company, announced the one-year anniversary of Linkerd (pronounced “linker-DEE”), an open source “service mesh” project for cloud-native microservices-based applications. As stated in the announcement:
Similar to TCP/IP’s transformation of network communication in the 1990s, which enabled an industry-wide shift from mainframes to client/server architectures, Linkerd’s growing adoption as a fundamental network layer for next-generation cloud applications is enabling enterprises to shift their computing architectures from monolithic applications to microservices without sacrificing reliability.
Released as version 0.1.0 in February 2016, Linkerd was created by former Twitter engineers, William Morgan, now CEO at Buoyant, and Oliver Gould, now CTO at Buoyant. Linkerd was built on Finagle, “a protocol-agnostic, asynchronous RPC system for the JVM that makes it easy to build robust clients and servers in Java, Scala, or any JVM-hosted language,” deployed in production at Twitter.
The diagram below demonstrates how Linkerd can be deployed to form a service mesh for application instances:
Buoyant recently released version 0.9.0 of Linkerd. This latest release features:
William Morgan, founder and CEO of Bouyant, spoke exclusively to InfoQ about this milestone.
InfoQ: What are your current responsibilities at Buoyant, that is, what do you do on a day-to-day basis?
William Morgan: I'm the CEO, which means in an engineering-heavy company like Buoyant I spend most of my time just trying to keep up with the engineers. Their job is to build great products, my job is to build a great company around them that can support them, and can translate the value they're creating into something that that the external world can appreciate.
InfoQ: Can you tell our readers more about your company, Buoyant?
Morgan: We build open source operational reliability software for cloud-native environments. Our mission is to empower companies everywhere to build safe and resilient mission-critical applications. We're a very engineering-heavy company, and we draw on our collective experience at companies like Twitter, Google, Microsoft, and Yahoo to build the things that engineers, especially SREs and DevOps folks, will need to operate their applications reliably and safely. We've all been on-call and had to wake up at 3:00am for silly reasons, and so our goal is to try and reduce that. Call it empathy for our fellow on-call engineer.
InfoQ: What is Linkerd and why do microservices and cloud-native applications need a new "service mesh" at the communications layer?
Morgan: Linkerd is a “service mesh,” which is an infrastructure layer dedicated to dealing with time-sensitive service-to-service communication. In contrast to traditional network stuff, the service mesh operates at the request level. So we’re not talking about packets or bytes, but rather requests which result in responses. Service Foo will talk to service Bar, and wait for it to respond, and when it does, Foo will process the result, and then relay a result of its own back to its caller. And if Bar doesn’t respond in time, then Foo has to do something about it.
Of course this request-response pattern has been around since the beginning of the networked programming, but what is changing is that with microservices, with cloud-native applications, this communication is happening tens or hundreds of times inside of an application for every call that’s made to the app. So if there are hundreds of services, and each service runs as hundreds of instances, and you’ve got hundreds of requests every second, then you end up with an immensely complex request flow going through the application. And instances are dying or getting overloaded or being rescheduled all the time…it gets really complicated.
The goal of the service mesh is to decouple the operational complexity of this model. Move it outside of the application, so the application stays pure. The app code just says, “Hey, I’m service Foo, and I need to send this request to service Bar”. And the operational stuff, things like retries, timeouts, deadlines, load balancing, and service discovery, which are not only incredibly difficult to get right, but are critical to get right - the application won’t stay up for long if they’re not right - are handled separately. They’re in a separate layer, where they can be managed independent of what’s going on with the application.
Linkerd, in its current form, is a userspace proxy because that’s what is easiest for people to use. It’s kind of like HAProxy on steroids. But ultimately the service mesh concept reaches far beyond the proxy model.
InfoQ: Can you tell our readers a little bit about your experiences at Twitter with Finagle and how Linkerd builds on Finagle (one of the key technologies often credited for helping kill the "Fail Whale" at Twitter)?
Morgan: Finagle was a huge part of how Twitter defeated the Fail Whale (I’ll say “defeated” instead of “killed”, because I like to imagine the whale swimming off forlornly to go harass someone else, rather than, like, whale guts everywhere).
Twitter decided to break down the monolith into lots of different services. We didn’t have the word “microservices” back then, but that’s really what we were doing. And Finagle started out quite simple; it was going to be the library we used to make calls from one service to another. And almost every service at Twitter would use a Finagle client to talk to another service, which would use a Finagle server to receive that request. So Finagle was sitting on both ends of every request.
It turns out that a vast number of problems that Twitter had with uptime and reliability in the early days were due to the way that services were interacting with each other. So Finagle was the place where we could solve all these problems. Over time Finagle got load balancing, and service discovery, and retry logic, and circuit breaking, and routing, and naming abstractions…all sorts of good stuff. From the Finagle perspective it was almost like Twitter was a bunch of Finagle-controlled connections with some messy application code in between.
And so many operational lessons were encoded into Finagle. Twitter would go down in some new or novel way, and tolerance for the root cause would be encoded in Finagle, and that cycle would repeat. So over time Finagle became incredibly mature and able to handle all sorts of weird edge conditions. Because that’s the story of distributed systems, they’re 1% normal operation and 99% weird edge conditions.
InfoQ: In your earliest days of Linkerd you cited some goals to bring the power of Finagle to the masses. How does Linkerd expand on Finagle and make those abstractions for services communication more easy for mainstream enterprises to tap into?
Morgan: The big difference is that Linkerd wraps Finagle up into a standalone proxy that you just run without having to understand the details of how it’s implemented. It doesn’t care what languages your services are written in or what server framework you’re using. Finagle is a library, so you can really only use it if you’re on the JVM. Linkerd can be used anywhere.
Another difference is that Linkerd gives you just the operational model. Finagle is two things: it’s a programming model and an operational model. And the programming model is very elegant, very expressive, it’s functional programming of RPC calls and it lets you write just beautiful code. Amazing code. The best code. We’ve thrown all that away. We just give you the operational stuff. In fact, we work very hard for you to not have to know anything about Finagle when you’re using Linkerd, other than “hey it’s built on this thing that’s really reliable, and powers Twitter and Soundcloud and Pinterest and a bunch of other companies”.
InfoQ: Linkerd for microserivces has been compared to how TCP enabled the move to client-server two decades ago. Is this a reasonable comparison?
Morgan: Eminently reasonable. I see myself as Vint Cerf in a hoodie. No, of course that’s a bit of an aspirational comparison. But the opportunity is there. For one, it’s a layer of abstraction. TCP allows a network programmer to say, “hey, send these bytes from here to there” without worrying about packet loss or duplication, about routing, about flow control, about the fact that multiple applications might be sharing the same IP address. Similarly, Linkerd allows the application programmer to say “hey, send this request from here to there” without worrying about timeouts, about deadline propagation, about service discovery and balancing across multiple endpoints, about retries or circuit breaking.
More broadly speaking, there’s a huge, industry-wide shift that’s happening in the way that applications are being architected. That’s the whole “cloud-native” transformation. Companies are moving onto things like Docker and Kubernetes and microservices, onto the cloud-native stack, and they don’t really have a choice about it. It’s just a question of when. The scale at which they’re expected to operate is ever-increasing; but the virtualized hardware has really poor reliability semantics; and they all need to iterate really fast on product at the same time. Those roads lead to cloud-native architecture.
And that huge shift, at such a fundamental level in the stack - it’s analogous to the way that the shift from mainframes and onto client server enabled a whole industry built around TCP/IP. A whole industry around networking hardware was created from that shift in the 1980’s, companies like Cisco. A similar thing is happening now, up the stack a bit. So that’s our secret plan: be the Cisco of microservices.
InfoQ: What are some example use cases where Linkerd is important?
Morgan: Any system where there are multiple services, and where performance and reliability are critical and SLAs are in play, is a candidate for Linkerd. Many of Linkerd’s earliest adopters were payments processors and banks like Monzo and Zooz, who were building their infrastructure on top of cloud-native platforms and really cared about reliability. Every second of downtime is money lost. The transactions going through their system literally have money attached to them. That’s the kind of use case where Linkerd really excels.
InfoQ: Tell us about the roadmap for Linkerd. What's been added over the last year since it was first introduced, and how are you expanding its capabilities?
Morgan: Well, performance and resource consumption are always things we’re working on. We want Linkerd to be smaller, faster, lighter. We have some pretty cool top secret things we’re working on to make that happen that I’m really excited about, which we’ll be announcing soon.
Also in the works is support for more protocols and support for raw TCP. And, of course, tighter integration with our cloud-native brethren, especially Kubernetes - it’s pretty easy to plug Linkerd into Kubernetes now but there is some stuff later in the future that will make it much, much easier.
Finally, we believe that multiplexing protocols like gRPC and HTTP/2 are going to be the obvious choice for large scale distributed systems in the future. Linkerd has great support for those protocols already and we’re continuing to invest in making that even better.
Additional information on Linkerd can be found via the following resources:
- Getting Started with Linkerd
- How to Use Linkerd
- Linkerd Joins the Cloud Native Computing Foundation blog post by William Morgan