This week on The InfoQ Podcast Wes Reisz talks with the CTO of Bouyant Oliver Gould. Bouyant is the maker the LinkerD Service Mesh and the recently released Conduit. In the podcast, Oliver defines a service mesh, clarifies the meaning of the data and control plane, discusses what a Service Mesh can offer a Microservice application owners, and, finally, discusses some of the considerations they took into account developing Conduit.
Key Takeaways
- Service mesh is dedicated infrastructure that handles interservice communication.
- There are two components to a service mesh: the data plane handles communication and the control plane is about policy and config.
- LinkerD and Conduit are two open service meshes made by Bouyant. Conduit has a small memory footprint and provides a convention over configuration approach to service mesh deployment.
- Adopting Rust (language used for implementing the data plane in Conduit) requires thinking of memory differently, and the best way to adopt Rust is to read other people’s code.
Subscribe on:
How did Buoyant get started?
- 10:30 I left Twitter on Friday and started the company on a Monday.
- 1:40 I really liked the tech I was working on at Twitter, and the team.
- 1:45 The company changed over the time I was working there, from 250 to 2,500 people.
- 1:55 I learned a lot there, from orchestration to Mesos to distributed systems in a containerised world.
- 2:10 We thought we had learned a lot and help people with those problems in production.
How has the vision evolved?
- 2:30 Although the libraries we had been working on in Twitter were really nice to program in, they had a lot of operational code in them as well.
- 2:45 When we had incidents, we remediate that incident by reworking the library to avoid the problem or be more graceful.
- 2:55 I understood the operational value where in a containerised world everything is over the network and the need for tools.
- 3:10 We started with Finagle in a proxy - so linkerd is proxy heavy but allowed teams to innovate faster.
What is a service mesh?
- 4:00 In a microservice when you have communication over a network between services it becomes important operationally.
- 4:20 I view a service mesh as the tooling that helps you manage your microservices operationally.
- 4:30 Twitter has a bit of a service mesh with Finagle being able to control some aspects of the communication.
- 4:40 What we have seen with service meshes is taking all that code out of the application and into the proxy layer where we can have a uniform way of handling the traffic.
What’s the difference with libraries like Finagle and service meshes?
- 5:10 Netflix and Twitter are able to dictate how software is written, and they can impose a way of working.
- 5:25 In the real world we can’t impose that, and companies tend to have diverse technology stacks.
- 5:30 Technology stacks change over time and so there is the need to solve this problem outside of libraries.
- 6:00 Having the unit of deployment being a Kubernetes pod, with a set of colocated services, and containers being the packaging format really makes it appealing.
What tools exist for service meshes?
- 6:20 I would say Finagle, Hysterix, Linkerd and a swath of new projects in the last year: Istio, NGINX - it seems like everyone has jumped on the bandwagon.
- 7:20 They are all doing the same sort of thing - proxies that do some extra routing.
- 7:25 The service mesh is really about putting this between every call in a microservice.
- 7:30 In our old architectures we didn’t have that many services - a front end balancer, a big app, and a database.
- 7:40 As we have decomposed those monoliths, the communication overhead has increased a lot.
- 7:45 You can build a service mesh with HAProxy and NGINX- and we’ve talked to many people who have.
- 8:05 They have a bunch of configuration scripts or custom tools to help manage it.
- 8:15 We are building a more well defined service mesh API so that you don’t have the hand rolled scripts.
What’s the threshold for needing a service mesh?
- 9:10 We have seen people go to production with linkerd, and configuring hundreds or thousands of these is non trivial.
- 9:30 What we are trying todo is focus on incremental adoption to make the service mesh something people want to use.
- 9:55 We are focussed on observability - is the service running, what are the response latencies and so on.
- 10:20 Once you get that nailed down in an organisation, you can start looking at adding more features or complexity later.
What is conduit and how does it relate to Linkerd?
- 10:5510:55 Linkerd is a generic tool, and when we started it we were in a different world.
- 11:20 It’s great for stitching together lots of different environments.
- 11:30 The flexibility is great for that kind of need.
- 11:40 As we have seen more adoption of Kubernetes specifically and running in the JVM there are lots of resources running Linkerd in a sidecar process.
- 11:55 The smallest we can get Linkerd down to is a couple of hundred megs.
- 12:00 That’s small for a JVM but maybe quite large for a demo host.
- 12:10 We also saw that Kubernetes was a lot simpler, and offered primitives with its APIs that means we don’t have to have an abstraction over the environment.
- 12:25 We wanted to have a more tightly integrated system with Kubernetes.
- 12:30 We wanted to tackle the resource costs at the same time.
- 12:40 We re-wrote the proxy in Rust, and we have a control plane written in Go.
- 12:45 The other big lesson is that having to learn a configuration file format to get started is a negative - you just want to turn on and go.
- 13:00 With Conduit we don’t have a configuration file with the proxy - you can only interact with the control plane, which is a set of APIs.
- 13:10 When you start it in Kubernetes you get visibility and stats by default.
- 13:15 We will add more flexibility with design later.
Conduit is open source software, right?
- 13:25 Buoyant only builds open source software, so yes.
- 13:30 Linkerd, namerd, conduit are all 100% open source.
There is a paved road for Netflix’s tools - same for conduit?
- 14:00 Yes - we have some constraints that will be loosened over time about what types of application we support.
- 14:05 We favour convention over configuration - if you do it a certain way it will just work, but if you want to do it a different way then you can override things.
What is a good fit for conduit today?
- 14:20 Today, it includes GRPC and HTTP are supported protocols.
- 14:30 We will have general TCP protocol support in a future release.
How is it installed?
- 14:50 You install the conduit controller with a kubectl apply.
- 15:00 It runs Prometheus to gain telemetry data.
- 15:10 You can run the conduit application as a sidecar, or you can inject it into the proxy configuration as you deploy it.
- 15:30 The way that works is to set up iptables routing so that all of the traffic goes through the proxy and we see it on the wire.
- 15:40 We are thinking of adding support for TLS so we can wrap that on for you.
What is the difference between the control plane and the data plane?
- 16:10 The data plane is the communication between your services.
- 16:30 The control plane is the set of services or APIs that we put all of the policy and configuration.
- 16:35 The control plane will handle the service discovery, routing rules, timeouts.
- 17:00 In Linkerd we have a control plane as well, which is namerd, a route management process and service discovery.
Why did you decide to implement the data plane in Rust and the control plane in Go?
- 17:40 We have been looking at Rust for some time.
- 18:00 In 2015 The Rust community was pretty small; there wasn’t even async IO, which is important in a proxy.
- 18:10 Fast forward two and a half years, and the networking libraries are fully featured.
- 18:30 It was a good time to get involved, and we have hired a bunch of the Rust community to help drive that.
- 18:55 With Linkerd we found that people put code into the data plane which made it harder to analyse performance of the data plane.
- 19:05 What we want to allow people to do is add policy into the control plane, or modify the control plane, and Go allows easier adoption.
- 19:25 Having Go makes it easier for people to get involved and contribute.
What did you find about developing with Rust?
- 20:15 The borrow checker and memory model with Rust is very different.
- 20:20 I spent about two weeks banging my head about that.
- 20:30 Now I think about programs differently - how memory is laid out, what the execution flow is and so on.
- 20:50 Switching my mental model to be memory first has improved my code.
How can Java developers get their head around Rust?
- 10:30 I really like Rust’s type system.
- 10:30 Avoid writing OO In Rust and learn to love its type system.
- 10:30 The best way to do that is to read other people’s code.
- 10:30 The other big hurdle was adopting Tokyo, the async networking framework.
- 10:30 It has its own model as well, like the notification model.
- 10:30 Pair programming and reviewing are an important way to learn Rust.
What have you learnt over the last three years about the people side of service meshes?
- 23:15 We have seen different approaches to adopting service meshes.
- 23:20 It might be incremental, where there is an existing monolith application and they are taking small micro services out of it.
- 23:30 The other way is a greenfield approach where they turn off all the features with the new development.
- 24:00 The incremental approach has generally seen more success.
- 24:30 We see people get feature hungry and try to do everything at once.
- 24:50 The people making these decisions aren’t often the end users.
Does conduit support tracing?
- 26:00 That would require application code changes, and we don’t want to be in the position where we need to require application code changes.
- 26:15 We will introduce tracing eventually but it’s not a first feature.
- 26:20 We have also put in a tap feature that allows you to put in a query and see results in real time.
- 26:50 We can sample requests that meet a certain pattern and route the results through, instead of having to look all requests to a central database for subsequent analysis
- 27:15 It’s those kind of tools that the service mesh unlocks, but we are focussing on the short term first.
So if you had some kind of correlation ID, you could use that feature to trace requests as they come in?
- 27:40 Yes, and we hope to have such tracing functionality out of the box in future.
- 27:50 It will probably be something like: if you have these headers then you get tracing for free.
- 28:20 You shouldn’t have to know the questions you are going to ask up front.
- 28:30 If you want to know the status code for requests from this IP to this path, you don’t want to have to add logging and redeploy to answer the question.
What do the latencies look like?
- 28:50 Going native is a whole new world from a JVM.
- 28:55 Our P99s are sub millisecond.
- 29:00 We may still fall down for some tests but we are going to get there.
What does the overhead look like?
- 29:25 The first improvement is moving from a 100Mb JVM to a 2Mb proxy.
- 30:05 On the CPU side we have more still to do.
- 30:10 We want to be a small sidecar proxy.
What’s happening with Linkerd?
- 30:40 We have support contracts for Linkerd, and development continues.
- 30:45 However we aren’t pushing so hard on feature work.
- 31:00 The focus has shifted as Linkerd has moved into production.
- 31:10 With conduit we are focussed much more on the control plane.
- 31:20 I think you’ll see a lot more feature developer happening in conduit rather than Linkerd.
What does the conduit road map look like?
- 31:30 In Q2 we are focussed on transparency and opening the protocol support.
- 31:45 We are looking at security, to establish identity and cryptographic security by default.
- 32:15 The other thing we can do is make the telemetry rock solid.
- 32:20 We need to give people tools so that we only get the data that’s needed.