InfoQ Homepage Podcasts Oliver Gould on the Three Pillars of Service Mesh, SMI, and Making Technology Bets

Oliver Gould on the Three Pillars of Service Mesh, SMI, and Making Technology Bets

Sep 20, 2019

In this podcast we sit down with Oliver Gould, co-founder and CTO of Buoyant. Oliver has a strong background in networking, architecture and observability, and worked on solving associated technical challenges at both Yahoo! and Twitter. Oliver is a regular presenter at cloud and infrastructure conferences, and alongside his co-founder William Morgan, you can often find them in the hallway track, waxing lyrical about service mesh -- a term they practically coined -- and trying to bring others along on the journey.

Service mesh technology is still young, and the ecosystem is still very much a work in progress, but there have been several recent interesting developments within this space. One of these was the announcement of the service mesh interface (SMI) at the recent KubeCon EU in Barcelona. The SMI spec seeks to unlock service mesh integrators and implementers, as this can provide an abstraction that removes the need to bet on any single service mesh implementation. This can be good for both tool makers and enterprise early adopters. Many organisations like Microsoft and HashiCorp are involved with working alongside the community to help define the SMI, including Buoyant.

In this podcast, we summarise the evolution of the service mesh concept, with a focus on the three pillars: visibility, security, and reliability. We explore the new traffic “tap” feature within Linkerd that allows near real-time in-situ querying of metrics, and discuss how to implement network security by leveraging the primitives like Service Account provided by Kubernetes. We also discuss how reliability features, such as retries, time outs, and circuit-breakers are becoming table stakes for infrastructure platforms.

We also cover the evolution of the service mesh interface, explore how service mesh may impact development and platforms in the future, and briefly discuss some of the benefits offered by the Rust language in relation to building a data plane for Linkerd. We conclude the podcast with a discussion of the importance of community building.

Key Takeaways

A well-implemented service mesh can make a distributed software system more observable. Linkerd 2.0 supports both the emitting of mesh telemetry for offline analysis, and also the ability to “tap” communications and make queries dynamically against the data. The Linkerd UI currently makes use the tap functionality.
Linkerd aims to make the implementation of secure service-to-service communication easy, and it does this by leveraging existing Kubernetes primitives. For example, Service Accounts are used to bootstrap the notion of identity, which in turn is used as a basis for Linkerd’s mTLS implementation.
Offering reliability is “table stakes” for any service mesh. A service mesh should make it easy for platform owners to offer fundamental service-to-service communication reliability to application owners.
The future of software development platforms may move (back) to more PaaS-like offerings. Kubernetes-based function as a service (FaaS) frameworks like OpenFaaS and Knative are providing interesting features in this space. A service mesh may provide some of the glue for this type of platform.
Working on the service mesh interface (SMI) specification allowed the Buoyant team to sit down with other community members like HashiCorp and Microsoft, and share ideas and identify commonality between existing service mesh implementations.

Subscribe on:

Show Notes

What have you been up to over the last year?

02:45 About a year ago, we were just about to launch Linkerd 2 officially, so it's been a wild ride over the last year.

What's new in Linkerd 2.5?

03:15 As with all Linkerd releases we are reacting to customer feedback from previous releases.
03:25 We have observability features, Prometheus support, Tap (an on-demand diagnostic and tapping feature).
03:35 We want to make these securable, with a secure data plane with TLS between proxies, and we want to start doing that in the observability plane as well.
03:50 We've started to secure the Tap communications so that an unauthorised user cannot start to access Tap requests.
04:00 We've been doubling down the feature set and making it production ready for those that are using it in production.

Can we talk about visibility, reliability and security?

04:35 Observability is our starting point - as you go into Kubernetes, things are harder to understand over the nodes and containers.
04:45 We need to make the network debuggable - if you're moving into a micro-services architecture into a service mesh and it's not more debuggable, it's going to be awful to maintain.
05:00 One of the main reasons to use a service mesh is to make it visible.
05:10 Once you have that, you can start to do a lot more interesting features - but you can't, for example, talk about securing something until you can talk about how you measure and audit that.
05:20 Starting with Prometheus out of the box was a big thing we learned from Conduit.
05:25 We tie labels in through the Kubernetes metadata so that we can do arbitrary Prometheus queries about the traffic flowing in the Kubernetes nodes and resources.
05:35 We've written these new features like Tap where we can on-demand request metadata from the proxies, instead of using logging and Splunk.

What's the UX of Tap?

06:00 The Linkerd UI uses Tap in ways you might not know; there's a top endpoints dashboard which shows success rate per endpoint; that all uses Tap data.
06:20 There are some ways that we surface that, but Tap is also available as a CLI command which can dump the stream of text that you can do ad-hoc queries.
06:30 There are certainly things that we could do better, if a GSoC [Google Summer of Code] student wants to take that on next year.

What about TLS?

06:50 A lot of what we try and do is not to introduce more things to think about
06:55 In the same way that the visibility is on by default, you don't need to do anything else apart from adding an annotation to your pod.
07:05 TLS needs to work the same way, where you aren't thinking about SNI and how to name things.
07:20 Our goal is to tie it into the Kubernetes identity - we didn't want to have to do identity management - and that allows us to use service tokens to bootstrap identity.
07:35 We generate TLS certificates inside the pods, so the keys never leave the pods.
07:40 It's quite a nice model; it deals with HTTP traffic to a Kubernetes service, where you are using Linkerd on both sides of that connection.
07:50 In Linkerd 2.7 we are working towards making that ubiquitous across all TCP connection, as long as Linkerd is on both sides of that connection.
08:00 That will make it easier to validate and make auditing decisions.
08:05 We're less focused on the authorization, which is the policy side of that - that's part of the SMI [Service Mesh Interface] spec we've been working with other folks.
08:15 Ultimately I see that of a later need, where the first need is to understand what's going on the network, with trusted identity and knowledge that it is secure - if you don't have that, you can't do authorization effectively.

Could Service Mesh Interface and Open Policy Agent come together in the future?

08:40 There's a talk in the works with Linkerd and Gatekeeper integration, which does some OPA policy as you deploy.
08:55 There's a lot of things active in that space, but it seems to be a lot of need in organisation.

Is there anything new for reliability?

09:20 There are improvements; we're improving our load balancers incrementally, for instance - but there are no huge breakthroughs in that space.
09:30 Where it's really important is that the mesh sits between the platform owner and the application owner, that reliability being in the platform owner's space is really important.
09:50 The mesh proxy sits in the platform owner's control; it's part of the offering of the platform.
10:00 If you're offering Kubernetes as a service to your customers, they don't want to worry about service discoverability or retries - you want that to be built into the system.

Is Linkerd looking at a universal data plane?

10:45 No, we see a universal data plane as the inverse of SMI, where SMI is upward focussed to consumers of a service mesh, but a data plane for those implementing a service mesh.
11:00 Our goal for Linkerd is not to be generic of proxies; we want to be in control there, and we do different things than other data planes, like automatic protocol detection, Prometheus stats.
11:20 Having tight control of being able to iterate the product together instead of having barriers is really important to us.
11:30 If you look at where barrier service meshes have evolved, they are dealing with friction between those pieces.
11:35 For example, Istio is doing a lot of time re-writing metrics for Envoy or dealing with a lot of latency issues.
11:45 The decoupling is not an important piece for us.

Getting feedback from production use cases is a critical step in development.

12:10 We have a specific philosophy for not only product development, but software engineering in general.
12:15 I come from the BSD/Unix backgrounds of "worse is better".
12:25 We're focussing on making that are usable than being specified and correct.
12:30 Rather than starting with standards and building up from there, our goal is to find something that is going to scratch an itch in our workflow and make it so that you don't have to solve problems for yourself.
12:40 There's so much in this space where it's easy to add complexity; and if we can add a couple of tools that can part that complexity and allow you to see your system and operate a little better, then focus on that and expand that.
12:50 That's Linkerd's philosophy; start with that kernel of usefulness (like observability) and then grow that until we can do all the things that you need.

Are you doing any work in being able to support multiple clusters?

13:30 I don't think it's just big enterprises that care about this; I think many companies are using multiple clusters.
13:40 I think it's on top of everyone's minds at the moment, how we deal with federation and multi-cluster.
13:50 My philosophy is that I want to build on top of Kubernetes, rather than being something that abstracts over it.
14:00 I'm hoping we can lean on Kubernetes 6 to solve a bunch of these problems, like replicating data sets of pods across clusters and make them discoverable via DNS.
14:10 That doesn't have to be solved with Linkerd; in fact, the more things that are Kubernetes standards (like service tokens) that we can integrate with, the better.

Is Linkerd looking at event-driven or message-driven protocols?

14:35 We've been doing a lot of infrastructure work to get the frameworks to be able to add these things.
14:40 In the Rust ecosystem, we've been working on Tokyo and Tower, which are a manifestation of a lot of that work.
14:50 We've been able to move the proxy in that direction in the last few months.
14:55 We're starting that work by adding TLS into everything, and getting identity into non-HTTP protocols before we go and expand the protocol surface area.
15:05 That's definitely on the road map; we have a GSoC student who has prototyped a Kafka codec into the proxy.

What's your take on integration between platforms and other technologies like enterprise service busses?

15:50 My view is that Kubernetes is eating the world; we need Kubernetes to learn how to talk to these systems in primitive ways, instead of the service mesh incurring those costs.
16:00 Some of the other service meshes are building abstractions over other schedulers and environments; we really do not want to do that.
16:20 We are building everything on Kubernetes, and we shouldn't be building bunch of core infrastructure decisions and just let Kubernetes handle that.

What are the biggest pain points for enterprises adopting Kubernetes and other service meshes?

16:35 The people - the management and teams and how they structure these projects.
16:45 We see a lot of projects being successful where they move team over at a time.
16:50 You need to be in a world where you are comfortable running bi-modally, with new projects and old projects co-existing at the same time.
16:55 Some companies are much better at mandating moving, usually where you have strong CI/CD practices where users are decoupled from deploy process.
17:10 Everyone is taking small steps in this space; I had some friends at Twitter who as part of their Kubernetes migration was going to do it quickly, but doing a lot of work to get there.
17:25 I think we're figuring out what mature Kubernetes looks like at this time.

Do you think that a typical (infrastructure) platform is too complex for developers these days?

17:50 I'm sure we will move to a simpler deployment process in the future.
17:55 We have serverless things that are floating through now.
18:00 I'm sure there will be abstractions that will build up; we've torn down the platforms and says that PaaS is bad.
18:15 It's the same cycle over again.
18:20 What I hope is that we find that there are a set of things that are custom in everyone's organisation which shouldn't be commoditised into infrastructure.
18:30 However, so much of the infrastructure can be pulled out of the organisation domain and doesn't need to be an organisational decision, like monitoring.

What role could Linkerd play in a serverless world?

18:50 The data plane is there to light it up with visibility, security and reliability.
19:00 I don't think Linkerd is the end of the road in terms of functionality that you need to build up your platform, but it is a vital component.

What's your take on the Service Mesh Interface initiative? -

19:20 When it was first mentioned to me, I didn't understand who it was for.
19:30 As I've talked to a lot more people who cared about it, it's dawned on me as to why it's useful.
19:35 Many of us have been working on service meshes for a couple of years, and building opinions and APIs in the process.
19:40 SMI was a chance for us to sit down at Hashicorp and Microsoft and find out what we've done together.
19:45 We've all done traffic splitting; if we were going to create an API for this in Kubernetes, what would it look like?
19:55 We did a similar exercise with authorisation; there are more iterations that we need to do on these things.
20:05 That's how standards get written; you implement it, play around with it, and then get feedback from the users.

What's the tradeoff between vendor-driven and user-driven specs?

20:35 We were in the position to make an easy decision, especially on traffic split where there wasn't a lot of discovery to do - we put a prototype together and asked what people thought of it.
20:55 You can't ask people what they want; they'll just get a faster horse.
21:00 All the SMI things will iterate and progress versions.
21:10 A lot of people will think they have to pick on one to bet on.
21:20 You're either going to bet on marketing dollars or take a wait and see approach.
21:25 What SMI does is say that you don't have to wait on a particular vendor; if you want to do CI/CD, then build on the traffic split API and trust that whatever service mesh you end up with will support it.
21:40 That is the way of assuring users that there is a path forward.

Do you think there's a potential for SMI to get stuck on the lowest common denominator?

22:10 [Kubernetes] Ingress will rev, so there will be another API - and to a point, it's been very successful.
22:20 As an implementor, I found it very frustrating to write ingress implementations, but that's a different story.
22:30 It would be a problem if SMI was afraid to rev its revisions and release next versions.
22:40 The problem with Ingress was that it got lodged and then there was no more future road map.
22:45 All of these things are never finalised.
22:50 Traffic split we're good with, authorisation will have a further roadmap, and we'll see what other APIs come out in the future.

What are you personally most excited about working on now?

23:20 One of the things I really enjoy is making technical bets, which may pay off in the future.
23:30 There have been times in the process [of using Rust] that I felt like everyone who had told me there would be problems was right.
23:40 Then when you hit a milestone and you can look back at the progress and the technical decisions that you have made and can enable us to move more freely now.
23:55 When looking back to when we released Linkerd 2 and now seeing the excitement when we have a new release - that long term growth is what motivates me, not the short term.

Building community takes a lot of effort.

24:20 I'm talking to people at 7 am when I wake up because many people are in different timezones.
24:30 We're here to make people successful, and a lot of time it's just listening to them and their problems, and wanting to talk to them about solutions.
24:40 That's what I like to do with my team here at Buoyant, and with people all over the world.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.