InfoQ Homepage Podcasts Matt Klein on Lyft’s Envoy, Including Edge Proxy, Service Mesh, & Potential AI Use Cases

Matt Klein on Lyft’s Envoy, Including Edge Proxy, Service Mesh, & Potential AI Use Cases

Jun 22, 2018

Podcast with

Matt Klein

Wesley Reisz

On this podcast, Wes Reisz talks to Matt Klein about Envoy. Envoy is a modern, high performance, small footprint edge and service proxy. While it was originally developed at Lyft (and still drives much of their architecture), it is a fully open source driven project. Matt addresses on this podcast what he sees as the major design goals of Envoy, answers questions about a sidecar performance impact, discusses observability, and thinks out loud on the future of Envoy.

Key Takeaways

Envoy’s goal is to abstract the network from application programmers. It’s really about helping application developers focus on building business logic and not on the application plumbing.
Envoy is a large community driven project, not a cohesive product that does one thing. It can be used as a foundational building blocks to extend into a variety of use cases, including as an edge proxy, as a service mesh sidecar, and as a substrate for building new products.
While there is performance cost for using sidecar proxies, the rich featureset is often a worthwhile tradeoff. With that said, there is work being done that is greatly improving Envoy’s performance.
Envoy is built to run Lyft. There were no features that were in Envoy when it was open sourced that were not used at Lyft.
Envoy emits a rich set of logs and has a plugable tracing system. The goal is observability first and one of the main project goals.
Lyft deploys Envoy master twice per week.
Envoy’s roadmap includes work on automating settings (rate limits and retries), focus on ease of operation (such as where things got routed what the internal timings), and additional protocol support such as Kafka.

Subscribe on:

Why do you think Envoy resonated with the QCon NY audience so well?

01:50 It’s been amazing journey - when we open-sourced Envoy, I thought we were solving a lot of problems that people in the industry had.
02:00 The reaction that we’ve had since open sourcing two years ago has exceeded my wildest imagination.
02:20 We have a lot of people who are rolling out microservice architectures, and there’s a lot of common problems that aren’t easy to solve.
02:30 Envoy and its current technologies that are being built around Envoy are by no means perfect, but they go a long way to help people roll out these architectures.

Envoy came out in September 2016. What was Lyft’s environment back then?

02:55 We started development of Envoy in May 2015, and at that time Lyft was a typical hyper-growth startup with a monolithic PHP application with a MongoDB backend.
03:20 When I joined there were 70-80 developers - now it has grown to 10x
03:35 Lyft had started its microservice journey, with 10-20 services written in Python.
03:45 Lyft was a polyglot organisation, so we had microservices written in PHP, in Python, plans to bring on Go or Java - and NodeJS for the front-end.
04:10 We were in an environment where people wanted to roll out their own services, but wanted to surface observability in a cross-language manner.
04:20 We were in a situation where the microservice roll-out was stalled - they wanted to do it, but reliability and debugging problems (particularly around the networking perspective) that they didn’t trust microservices.
04:45 I had seen this evolution at Twitter and at Amazon, and I knew there was a potential solution; to run a proxy software agent (not a multi-language library) to be an abstract network.

What’s the difference between Envoy and Finagle?

05:30 Two things - polyglot is a big one for me, just because these libraries become complicated.
05:40 You either have to write it in multiple languages, or you have to pick one language and then export to the other languages.
05:55 Even if you have a single language, people underestimate the the library upgrade problem.
06:00 At Twitter, as libraries like Finagle are developed, it may take half a year before you can get the latest library in all the microservices.
06:00 If you’re trying to do rapid development, or if you are trying to roll-out load balancing changes or tracing, which might require making a change to every service at once, having to deal with the upgrade cycle can be complicated.
06:40 So it’s the combination of library upgrade pain and polyglot services together that made the outer process architecture a no brainer.

How would you define Envoy?

07:30 The goal of Envoy is to abstract the network from application programmers.
07:40 We’re building big internet applications at this point; most organisations spend a considerable amount of time writing non-business logic.
07:55 Adding monitoring, scaling, marshalling and unmarshalling, there’s so many concerns.
08:10 Any time spent on that is wasted money.
08:25 So we’re trying to abstract things from the application programmers to allow them to focus on writing the business logic that makes the company money.
08:35 From an Envoy perspective, we’re doing what it takes with networking and observability, and remove the plumbing so that application programmers don’t need to worry about that.
09:00 It includes load balancing, routing and other things - but that’s where we draw that line.
09:10 We’ve been very clear at what the core of Envoy is, and a rich set of extensibility points for filters, access logging, tracers, and stats backends.
09:35 That allows users to know what’s in the core code, and take what’s important to them and build up their own system.
09:50 Envoy is a complex piece of software, and does a lot of things.
10:30 From a project perspective it can be difficult to satisfy end users as well as core developers from a documentation perspective.
10:45 Envoy, we believe, can be a powerful building block for building other systems.
11:05 It is becoming fairly ubiquitous, and it’s being bundled as part of a number of other products.

Is there a one-to-one deployment of Envoy and containers?

11:30 It can be run in a number of different ways.
11:35 It can be run as an edge proxy at the internet layer, or as a firewall, or rate-limiting or traffic shaping proxy.
11:50 It can be run as a sidecar process, with server-to-server proxies.
12:00 It’s possible to run Envoy on a kubernetes host, connecting to all the pods.
12:10 It can be run in an internal network or center proxy configuration.
12:20 We run a number of these kind of deployments at Lyft; we run it at the edge, as a sidecar proxy, a couple of proxy deployments - it’s a versatile piece of software.

What are the common entry points to using Envoy?

12:35 One entry point is those wanting a flexible edge proxy.
12:50 Envoy has a couple of advantages over using the de-facto standard nginx.
13:05 Another possibility is replacing a cloud’s load balancer with Envoy.
13:20 We have people building service meshes, with people building there own control planes on top of Envoy.
13:35 We have istio [https://istio.io], which is built on top of Envoy.
13:40 We’re seeing both edge and service-to-service.
14:10 We’re not trying to build a singular product with Envoy.
14:20 It’s a community driven project, with no enterprise solution or open core.
14:35 It’s amazing at the contributor growth.
14:45 We’re not going to build Envoy into a database, but when people come with reasonable features, the extension points let people build their own solutions.
15:05 It’s a very rich piece of software that doesn’t just do one thing.
15:15 It doesn’t just focus on a plug-and-play kubernetes experience, or focusing around a UI for service routing.
15:20 It can be used as a foundational building block as part of these products.
15:30 It’s likely that in a couple of years, Envoy will be running in a large number of places, but most people won’t be aware that their products use it.

Is there a performance impact to using Envoy?

16:30 It’s true that if you’re sending traffic on each hop through an out-of-process proxy, there is a cost to that.
16:40 There are two separate things: firstly, for 99.99% of applications (IMHO) performance is important but aren’t going to be highly loaded in production.
17:25 When you look at what the proxy gives you - observability, service discovery, load balancing, rate limiting and so on - it’s a worthy advantage.
18:10 The second point is - people are doing lots of work on performance.
18:40 Envoy itself isn’t slow; it’s the context switches between the application and the kernel, and back into Envoy.
19:10 I think a lot of the timings can be cut down a lot.
19:20 Between the benefits of the system and the costs, and the fact that the costs will decrease over time, isn’t a concern unless you’re counting single-digit microseconds.

How does the project configuration work?

20:05 You can opt in to almost any feature - you can decide what type of load balancing you want to use, and there is a rich set of stats and logs.
20:15 Our goal is to give people tools, not to mandate how those tools must be used.
20:25 Some users who come to the project expect it to solve all the problems by default.
20:35 There isn’t just one way of doing things, so at the Envoy level it’s impossible to provide a configuration or a product to do everything that they want.
20:50 Envoy is opinionated in some ways, but it allows an incredible number of things to be configured.
21:00 The flip side is that this makes it harder to use.
21:05 We look for opinionated products to be built on top which wi hopefully tune those knobs and guide the user to better defaults for their deployments.
21:15 As an example, if you have kubernetes, you can make a number of assumptions about how you can configure Envoy.
21:30 the closer to how the software is deployed or provisioned, the more opinionated you can become.
21:40 When people start using a PaaS, we can make things opinionated because we know how it will be deployed - but because Envoy can be used in many different types of deployments we can’t bake in defaults for everything.
22:05 There’s also the idea that because Envoy does timeouts that application developers don’t need to be aware of them.
22:25 In reality, you still have to be aware of retries and timeouts because Envoy can’t know if it’s safe to retry a particular request.
22:45 You can have rules for different request types to know if they are idempotent or not; but Envoy doesn’t make assumptions about idempotency.
23:05 You’re never going to fully get away from some of these concepts, but you can get away from writing retry logic with exponential backoff or load balancing.
23:25 Envoy doesn’t make these things disappear, but it does remove the worry of how they are implemented consistently.

Can you talk about Envoy’s design philosophy about observability and metrics?

24:25 Envoy emits a lot of metrics.
24:35 When we open-sourced Envoy, we were using it at Lyft, and we were using all those metrics.
24:40 Of course now there lots of features that people have added that we don’t use at Lyft.
24:45 Envoy grew to be a product that hundreds of developers at Lyft would use, so every stat that was added was done because we were trying to debug systems.
25:00 One of the complications of using HAProxy or ngingx is that it can be difficult to figure out what was going on with a lack of stats or tracing.
25:25 Our philosophy with Envoy is to emit the kind of metrics that will help people run this system in production.
25:35 From a logging perspective, we have a rich set of filters and sampling - because you can’t process 100% of proxy logs, for example, or for figuring out what tracing gets turned on.
25:55 Our goal is observability first.
26:05 We’re trying to abstract the network, but the network is unreliable - and when it breaks we have to know what’s broken.

What’s the story about tracing with Envoy?

26:25 We have a pluggable tracing system, like our pluggable stats backend.
26:45 When we open sourced, we added zipkin support.
26:55 Envoy has its own tracing abstraction layer, so it can generate tracing data for different back-ends.
27:20 I think tracing can be difficult to use to match events.
27:30 At Lyft, we have found that tracing is useful in non-firefighting debugging, looking into performance problems, understanding what is going on.
27:45 Some times in distributed architectures, people don’t know what services is actually talking to them.
27:50 It’s an amazing way to visualise what’s going on and to support debugging difficult problems.

What’s the roadmap look like?

28:10 We have so much development now, I have a hard time keeping track.
28:15 It’s also how we do releases; we do releases because we know people want them.
28:30 They are arbitrary point-in-time releases, about every three months or so.
28:35 Envoy master should be release candidate status all the time - Lyft deploys twice per week from master.
28:55 For most of the major contributors, it’s important to keep a high quality at all times.
29:00 It’s also why I have a hard time of keeping track of releases - because we’re always running master.
29:10 From a roadmap perspective; there’s other protocols that people want to add, like Kafka.
29:20 There’s going to be more work on making things automatic.
29:40 Whenever you set up a networking mesh infrastructure, it’s hard to set up defaults, like what’s the error budget.
29:45 Things like how many retries should you allow, how do you set your circuit breakers and rate limits.
29:50 There’s some interesting research and production use cases about building more automatic systems to do this.
30:00 Netflix has recently open-sourced some code they use that does adaptive concurrency control.
30:10 Instead of requiring the user to configure what the circuit breaker limit is or what the error rate is, Envoy would be able to dynamically determine what the throughput could be.
30:25 We’re going to look at better ease of operation, like better debugging facilities.
30:30 You can ask Envoy to ask it to emit all sorts of debugging information with each request - where it got routed, what the internal timings are.
30:40 We want to have more adaptive settings for concurrency timeouts.
30:45 We want to add additional protocol support like Kafka - there’s a never-ending list of things.

How do you see machine learning affecting the networking layer?

31:10 It’s not just machine learning - in particular it’s what happened in the last ten years with neural networks.
31:20 Rule based systems were difficult to program.
31:30 Envoy takes control of information, but it can emit logs, stats, and in future dumping TCP data.
32:00 With a neural network learning system, you could imagine that it is going to a machine learning system that is building a model on the fly.
32:15 An anomoly detection system - whether that be DDoS, or a bad host, it becomes possible to take the output data and train a model to say what’s normal.
32:45 If data coming in isn’t recognised as normal, it could be flagged for further investigation.
33:10 From a security perspective, there’s alerting.
33:20 You can imagine a next generation system which is taking information, figuring out what is happening, and then taking corrective action such as blocking a set of IPs.
33:50 You could even look at traffic patterns between hosts, and maybe take a host out of rotation.
34:05 If you look at state-of-the-art in the networking space, it’s a lot of manual configuration. We can do better in the future.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.