InfoQ Homepage Podcasts Stefan Prodan on Progressive Delivery, Flagger, and GitOps

Stefan Prodan on Progressive Delivery, Flagger, and GitOps

Jul 28, 2020

In this podcast, Stefan Prodan, developer experience engineer at Weaveworks and creator of the Flagger project, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: how progressive delivery extends the core ideas of continuous delivery; how the open source Flagger Kubernetes operator can be used to implement a progressive delivery strategy via canary releasing with an API gateway or service mesh; and the new “GitOps toolkit” that has evolved from the Flux continuous delivery operator.

Key Takeaways

Progressive delivery is a term used to describe the incremental rollout of any changes within a system that optimises for reducing risk and limiting “blast radius” of any negative outcomes.
Progressive delivery uses techniques such as canarying, feature flags, and A/B testing at scale. Being able to observe the system is a prerequisite for progressive delivery.
Flagger is an open source add-on to the Kubernetes API (an “operator”) that augments the service deployment and release machinery. Flagger adds more strategies and options for releasing functionality, such as canary releasing and A/B testing.
Flagger supports the incremental release of functionality via both API gateways and service meshes. This covers the ingress (north-south) and service-to-service (east-west) use cases, respectively. Tests can be run automatically upon deployment of a new service, and synthetic test traffic can be generated and evaluated.
Weaveworks is evolving Flux, a CNCF-hosted continuous delivery operator for Kubernetes, into a “GitOps toolkit”. This toolkit defines clear abstractions and interfaces for each stage of a typical continuous/progressive delivery pipeline, and allows community or custom modules to be composed into a complete toolchain.

Subscribe on:

Transcript

00:01 Daniel Bryant: Hello, and welcome to The InfoQ Podcast. I'm Daniel Bryant, news manager here at InfoQ and product architect at Datawire. And I recently had the pleasure of sitting down with Stefan Prodan, developer experience engineer at Weaveworks and creator of the Flagger project. Stefan has been doing a lot of work in the continuous delivery and progressive delivery spaces for quite some time now. And so I was keen to tap into his experiences here. I wanted to understand the problem space for something like progressive delivery and also explore how tooling like Flagger and Flux fit in. I was keen to pick Stefan's brains around the role of API gateways and service meshes play in this domain and hear his experiences with running Canary releases, testing APIs, and integrating this within delivery pipelines. Hello, Stefan, and welcome to The InfoQ Podcast. Thanks for joining me today.

00:45 Stefan Prodan: Hi, Daniel. Thanks for having me. I am very happy to be here.

00:49 Introductions

00:49 Bryant: Could you briefly introduce yourself for the listeners, please?

00:51 Prodan: Hi everyone. I am Stefan Prodan. I work for Weaveworks. I'm part of the developer experience team. I'm really an engineer. I also do things around community and open source here at Weaveworks. I mainly work on open source projects in the last almost three years, and I happy to talk today about the open source projects that I'm contributing to.

01:11 Could you briefly explain the problem space where the Flagger progressive delivery tool sits in?

01:11 Bryant: Awesome stuff. So off mic, we've already chatted about Flagger and Flux. We're definitely going to cover both of those. I was most interested around the Flagger tool first up. So Flagger is being marketed as a progressive delivery operator for Kubernetes and it's just seen it's one that I'll launch. Super exciting. You and I were chatting about that on Twitter. Could you briefly explain the problem space where Flagger sits in?

01:32 Prodan: Yes. So Kubernetes adds a lot to deployment machinery when you, let's say, you want to release a new version of your app on a production cluster, Kubernetes offers you a couple of deployment strategies on how we can do that safely. You can think of Flagger as an add-on to the Kubernetes API and to the deployment machinery, that adds more strategies, adds more options, on how we can do deployment. For example, you can use public traffic to test your application, or you can trigger some tests that you want to make sure those tests are passing in your production system before you expose your new version to the end users. So it's kind of an extension to the deployment part.

02:15 Bryant: Very nice. So looking on the website, the Flagger website, and it was saying safer releases, flexible traffic routing, and extension validation, were the key things. What do you mean by those three things?

02:26 Prodan: So Flagger, it's an operator. What an operator means in Kubernetes land, means that you can run it inside your cluster. Then you control it through the Kubernetes API. So you have a thing called the custom resource, which let's say it's a YAML file, which you can place in a Git repository, that you can apply with kubectl. And that resource in Flagger is called Canary. So you can define a policy on how you want to do these type of deployments. And even if the resource is called Canary, you can actually do more than just Canary deployments. You can also do things like A/B testing or normal Blue/Green deployment and so on.

03:08 Prodan: Now for the observability part, when you deploy a new version of your app, you want to look at how it behaves in your production cluster. And what Flagger lets you define is a way to progressively shift traffic towards the new version. And you can also define in this deployment policy, is what kind of metrics and what kind of KPIs are you interested in.

03:35 Prodan: For example, the easiest example I could give is you deploy a new version and you want the error rate to be under a certain threshold. Let's say if my application return is 500 errors, more than 1% of the total traffic, then maybe it's not a good thing to deploy it in production. So Flagger can do an automatic rollback based on that KPI that you have defined.

04:01 Prodan: Now, defining KPIs is not something that's fixed in a way. Like, okay, Flagger comes with two built in KPIs. One is error rate and the other one is latency. So we can say if my new version is the average response time is below, let's say, half a second, then my app is too slow. I'm going to break some SLA and so on. I don't want that to end up in production, but usually you have specific KPIs for each application that you develop. So the extensibility part is where you can define your own KPIs, your own metrics, and Flagger will take those into account when doing the analysis of the deployment.

04:42 Bryant: Very nice, Stefan. So, as an example here, I might in my application code, let's say average basket size, like if I was doing an eCommerce app. And if I deployed something and the basket size massively dropped, so people couldn't add products to their basket, then I want to roll back my latest release. So Flagger will be perfect for that. I could define that metric basket size, say, and have Flagger watch as I roll out my new version of the app, so that measure doesn't fall below a critical threshold. Is that right?

05:09 Prodan: Yeah. We can imagine many, many use cases for doing your own KPIs. In a way, multiple teams should collaborate when defining those KPIs. So for example, let's say the engineering team will have its own KPIs, like, okay, success rate, latency, maybe how many connections are open to the database, right? If the new version opens one connection per request, and that could make that bottleneck at the database level. So those, let's say, are more of a technical side of the KPIs, but you can also have this kind of metric checks coming from the business side, like you said, with the basket or how many clicks are on a particular button or banner, stuff like that, right? So we can also get into the A/B testing land, where you want just test things and compare them.

05:57 Who is the primary audience for Flagger? Is it developers? Is it operators, or a mix of the two?

05:57 Bryant: That actually leads perfectly to my next question here. Because I was going to say, who is the primary audience for Flagger? Is it developers? Is it operators or a mix of the two perhaps?

06:06 Prodan: Yeah, I think it's a mix of the two. And also maybe marketing could be involved in this process or for large e-companies that have dedicated QA teams. All these things could define their own metric checks, their own KPIs and define an SLA for the app from multiple perspectives, like technical perspectives, impact to the end user and so on.

06:29 Prodan: But primarily I think Flagger is intended to be used in the operation level because it also interacts with service meshes and other components in your cluster. So you have to have an understanding of what is your underlying routing provider? What proxies are you using? So you take the right decision, how you install Flagger for what kind of provider, and then you can collaborate with other teams to define the KPIs.

06:54 Does Flagger support both the release use cases at the edge and within the service graph?

06:54 Bryant: And that actually leads perfectly onto the next thing I was keen to dive into, is around this sort of North South versus East West traffic management. So we say North South is kind of your Ingress, typically where your edge proxies or API gateways operate. And then the East West is around the kind of service comms where you've mentioned already service mesh typically operates. Does Flagger support both use cases at the edge and within the service graph?

07:20 Prodan: Yes, it does. And you can also mix them. So the idea is you have, let's say, different types of applications. One, some applications get directly exposed outside the cluster through a proxy, through an API gateway. It's a proxy with a fancier proxy, let's say, right, an API that lets you define many other things. So for front end applications that gets exposed, Flagger can talk directly to the API gateway or let's say to the Ingress controller, which is a specific implementation for Kubernetes. And there are a couple of Ingress controllers that are supported in Flagger. I am planning to expand that list when Kubernetes Ingress V2 will become a thing, become popular. And that's a story for applications that can expose outside.

08:09 Prodan: For applications that are running inside the cluster, let's say all sorts of backend apps, APIs, and so forth, then you'll not be using an Ingress solution for that. The traffic goes directly from service A to service B. And in order to control traffic between two services that are running inside Kubernetes, the Kubernetes CNI is not enough, no matter what kind of implementation is it, because Kubernetes CNI I usually layer three, layer four. In order to route traffic, you need to understand there are several, let's say HTTP and GRPC are the most used.

08:45 Prodan: So in order for Flagger to manipulate traffic, to change traffic dynamically, it needs a service mesh solution that allows some external automation to define all these routing rules based on their seven protocol like HTTP headers is a good example. Another example is route 1% of the whole traffic specific deployment and 99% to a different deployment.

09:13 What API gateways and service meshes are currently supported by Flagger?

09:13 Bryant: So what gateways and service meshes are currently supported by Flagger?

09:16 Prodan: So in terms of gateways, the first one that was supported was Solo Gloo. Then people asked for NGINX because NGINX is so popular. The NGINX Ingress controller is part of the Kubernetes project itself and it has its own maintenance and so on. There are some downsides of using NGINX due to the fact that the current Ingress implementation in Kubernetes doesn't allow for a declarative model of how routing works. So we have to use annotations. So what Flagger does, it manipulates annotations on the Ingress object to route traffic. And the latest edition is the Contour project made initially by Heptio. And now I think it's in the process of being donated to CNCF. Gloo and Contour are based on Envoy, NGINX is of course NGINX. And once Ingress V2 will become popular, then I think Flagger will work with any kind of API gateway that just implements Ingress V2, because in the new Ingress API spec, you have a special object that allows Flagger to route traffic based on weight or based on headers. So it could be completely agnostic to the actual Ingress implementation. And that's for the Ingress story.

10:34 Prodan: For service mesh, it's even more complicated because, well, each service mesh come with its own reach APIs, reach set of APIS. So when I first started Flagger, Istio was a thing. People actually started using Istio. And Istio API allows you to do so many things. And Flagger first worked with Istio and we maintain a compatibility with all the Istio versions that are currently used. I think it works from 1.4 to the latest 1.6 and so on. And we have a pretty extensive end-to-end testing infrastructure to make sure that every time we change something in Flagger, it works great with Istio, it works great with Linkerd as well.

11:17 Prodan: So the next service mesh that we implemented is Linkerd with the note that Linkerd implementation goes through SMI. So SMI is the service mesh interface project that defines a set of APIs that we hope someday more service meshes will be using it. I know for example, Console Connect is part of SMI and I hope at some point they will implement, for example, the traffic shift object API, and then Flagger will just work with Console Connect like it works today with Linkerd. And finally in the service mesh area, Flagger works very well integrated with Amazon's App Mesh.

11:56 Bryant: Oh nice. That was also Envoy-powered. Is that right?

11:58 Prodan: Yes. Yes. So Istio and App Mesh are Envoy-powered, and Linkerd comes with its own implementation of the proxy made in Rust.

12:07 Could you explain what you mean by the term progressive delivery and how that relates to continuous delivery?

12:07 Bryant So something I wanted to move onto now, Stefan, is around progressive delivery. So I know you've mentioned that progressive delivery is core to both Flagger and to Flux that you're working on. Could you briefly explain to listeners what you mean by the term progressive delivery and how that relates to continuous delivery?

12:23 Prodan: Yes. So in continuous delivery, how many of us are doing it today is, for example, if we, let's say, use a GitOps approach, right? We have our workloads definitions in a Git repository, all the Kubernetes deployment, services, stateful sets, everything that makes up your production cluster.

12:43 Prodan: Then you, let's say, release a new version of your app. What that means you, let's say, change the container image inside the deployments pack and you say, "Hey, now from 1.0, it's 2.0." Once you make that change, the GitOps operator, something that watches your repository will do that change inside your cluster. Right? And what happens is the old version is fully replaced with the new version one pod at a time.

13:10 Prodan: Now Kubernetes offers health checks through liveness checks and readiness checks. Right? So if those two checks are passing, then your new version is fully rolled out in your production cluster and all the users will be using it.

13:25 Prodan: Now, what happens if after a minute or two minutes, your new application starts to crash? Progressive delivery tries to improve this process where you don't fully roll out a new version but you roll it out to a percentage of your users. So you need a way to segment your users. That's the first requirement, how you can segment it. For, let's say, front end apps, you can use things like a cookie or a specific header and you can say, "Only users that are on Safari will be testing my new version," right, "or all users that are coming from a particular region." There are things like CloudFlare, for example, that inject inside the request the country or the region of the user. So you can use this data to do the user segmentation.

14:17 Prodan: And while the new version is in testing, then you need some kind of observability component to your delivery process that can look at what happens with the new version when a segment of your users are routed to it and based on that, take some decisions, roll it forward or roll it back. So yeah, progressive delivery tries to extend on continuous delivery and add more safety nets.

14:44 Bryant: I like it, because I've chatted to Steve Smith in London, quite a bit about this concept of continuous delivery is all about going as fast as the business wants, but as safe as the business wants. And what I'm hearing you say here is progressive delivery gives you both really in terms of its focus, perhaps on the safety in terms of you could stop something bad from getting completely rolled out, but it still allows us to go fast in terms of we're constantly releasing new services and gradually shifting traffic between them.

15:12 Prodan: Yes. And these are race condition here, after all, right? You can't release so fast that your tests, your production tests on real traffic, that's not fast enough, right? So now we're catch up. So what Flagger does is when, let's say, you have a version that is being tested, and you release a new version, then Flagger sees that, okay, it's a new version. Let's cancel this analysis and start over with the new version. So it doesn't go through with the current revision because a new one is there. So this way, you have a chance to catch up with, let's say, deployment velocity.

15:49 How does Flagger support webhooks so that you can run tests as part of a rollout?

15:49 Bryant: Yeah. Very interesting. And so I did read on the Flagger site that you've got sort of like web hooks and so forth where you can run tests as part of a rollout. Is that correct?

15:57 Prodan: Yes. A lot of users that are relying on Helm to do deployments. Helm has a thing called Helm tests, where you basically have some pods that you add to your Helm chart and inside those pods, you'll be running integration tests, smoke tests, and so on. When using Flagger, when you define the Canary analysis, you could tell Flagger, "Hey, before you start routing any kind of traffic to the new version, run the Helm tests."

16:27 Prodan: If those are failing, it doesn't make sense to expose your app to the end users because it fails, of course. So roll it back. That's one use case on how we can trigger integration tests inside your cluster. But Flagger has a generic web implementation, so you can actually implement your own web hook receiver that talks to your custom test runner and so on. And there are a couple of integrations there that I know people did. Like they call external services like CircleCI, like Jenkins and others, start all these tests and Flagger will wait for that web hook response to finish.

17:03 Prodan: And once it's done, let's say, in 10 minutes, then you tell Flagger, "Hey, now you are allowed to move forward and start routing traffic because my tests are done." So it's not in any way tied to a particular technology, like Helm. It works with Helm, but as well as others.

17:20 Bryant: You could imagine someone like would almost dark launch a service. They could like do the web hook to CircleCI to run some like semantic monitoring or some kind of user behavior tests with this service that's not exposed to the public. If those tests pass, then they kick off the progressive rollout?

17:36 Prodan: Yeah. There are also hooks during the rollout. One of the first thing that I helped to implement in Flagger was load testing. Why? The problem is you want to deploy at any point in time, but maybe now when you are doing the deployment, there is no one using your application. So Flagger cannot take any kind of decision because there are no metrics help. No one is doing a new test, so how can Flagger decide if it's good or not?

18:06 Prodan: So a different type of web hook is the one that Flagger runs during the analysis and can call out to a load test tool that will generate traffic for particular end points that you want to test and so on. So it has information and can take that decision. Of course, it's not the same as doing the analysis on real traffic, but at least it's something. So it gives you the possibility to run load testing. And it has a small service as an add on Flagger that you can deploy, which uses a tool called "hey". It's like an awesome tool that you can bombard your app and decide, is there a latency problem or are my routes erroring out after a particular number of requests per second? So that's another use case for Flagger hooks.

18:53 How successful has Flagger been? Do you know folks that are using it in production?

18:53 Bryant: Very nice, Stefan. Before moving on to some of the bigger context stuff, I want to dive into Flux as well. How successful has Flagger been? Do you know folks that are using it in production? Is there sort of use cases folks can read online?

19:04 Prodan: I've started a list of companies that are willing to say, "Hey, we are using Flagger in production." Let's say Chick-fil-A is one of the biggest Flagger users. There are others listed on the website. I know there are a lot more users that when you work for a bank or a big enterprise, it's hard to get that kind of acknowledgment, but I'm working with the community, making this list bigger. I think the Flagger team has a lot of feedback from users.

19:34 Prodan: What I really like about the feedback that we are getting is the fact that people actually have all these kinds of, let's say custom deployment patterns that they've implemented to fit some business need for deployment. And having all this feedback, we can make Flagger more dynamic, allow all these use cases to be rolled out. And the last one was the fact that... Let me explain a little how a Canary deployment works.

20:02 Prodan: You have a fortune running in production. Let's say you have 10 pods with an autoscaler, and you deploy a new version. That starts with that one pod. Then you start shifting traffic to that new version. So there is a different autoscaler that will also look at the traffic, look at the resources that need to scale the Canary deployment, right? So you have two autoscalers and Flagger in the middle. And that works great. The problem is before Flagger 1.0, when the analysis ended, and let's say everything worked great, I'm going to fully roll out this new version. Then the rollout happened at once. Like, all the traffic went to the new version at once.

20:46 Prodan: And we had some users saying, "Hey, we are running hundreds of pods. You cannot scale that fast if you do the final switch, 100% traffic." So we implemented a new future and that's available in Flagger 1.0. and when Flagger does the promotion, it also uses a progressive roll out.

21:06 Bryant: A throttling.

21:07 Prodan: Yeah. It moves the traffic gradually. In the same way, it does the Canary testing, in the same way it does now the promotion. So if, let's say, you move 10% of the traffic every two minutes, then we need the promotion to happen in the same way, or you can change the weight. But I mean, it behaves better at high load with tons of traffic.

21:31 Bryant: That's the key thing.

21:32 Prodan: And that's something that came from production users that they were seeing, "Hey, we have a spike in latency where Flagger does the last step." And we identified where the problem is, came up with a solution. And on the pod request, you can also see graphs from the users where they showcase, "Hey, now it works like it should."

21:50 Prodan: There's nothing like actually when your software is being used at large scale in production. You always see the edge cases, the stuff that you just couldn't have dreamed out as you're coding it, right? When the rubber meets the road, you suddenly go, "Whoa," like... And that is a sign of a project being adopted successfully, I think.

22:06 Prodan: Yes. It's one of the signs. Another sign is if you don't get any kind of bug reports, something is wrong.

22:13 How does Flagger fit into the bigger picture? How does it work with tools like Flux?

22:13 Prodan: Okay. Totally. Totally. So I wouldn't mind to step back a bit now, Stefan, with our time remaining. How does Flagger fit into the bigger picture? Because obviously, we work doing a lot of great stuff around say, Flux, continuous delivery and so forth. I'm kind of keen to get your take on where does Flagger sit in with these other tools that we are looking at.

22:30 Prodan: So if we think about the whole process as a simple pipeline, Flagger sits right at the end, because it is basically the final gate until your app reaches your end-users. At the beginning of the pipeline, and I'm talking here about what happens inside of Kubernetes cluster, of course you have CI, you have to run unit tests, integration tests, you have to build your app. But the company I work for addresses the continuous delivery part, which is something that you can fully orchestrate inside your Kubernetes cluster with, let's say, no outside processes.

23:06 Prodan: And Flux is the component that brings external changes inside the cluster. How you define these external changes, we thought about Git as the primary way of storing the desire state of your clusters, because you can version it. You can carry street access to it. You can use pod requests when you do changes, right? So multiple teams can collaborate on how your production system works and what kind of operations you are trying to do, if you want to upgrade something, downgrade something, deploy a new version and so on.

23:42 Prodan: So the journey starts with someone making a pod request that gets merged into a branch that Flux watches, and then Flux will apply those changes on the cluster, and that's what's happening. And from there, Flagger detects that change. So Flagger watches for changes, if you want, for Kubernetes events and so on. And when it detects a change, then it runs an analysis for that particular change.

24:08 Prodan: And how Flagger works in this setup, it doesn't only look at, "Hey, someone changed the app itself that contained the glitch." We've seen, let's say in the last years, that the biggest incidents for cloud vendors, for example, weren't a bug in the code. It was a configuration problem. So Flagger doesn't only look at, "Hey, did the code change. Did the container change?" It also looks at all the things that are making your app inside Kubernetes, and what are those things. Could be some ConfigMaps. Could be secrets where you have all kind of other sorts of configuration, but it's, in the end, it's a configuration, right?

24:48 Prodan: So Flagger reacts on any kind of changes, and for that change runs a Canary analysis. And one example is I'm going to change the limits of my container. I'll say, "My container should be doing good with 500 megabytes of Ram." Maybe it doesn't, right? So instead of just rolling that new limit out in production, Flagger sees the limit, spins up a pod with that particular configuration, then the new configuration starts routing traffic to it, and maybe after a minute or two, your application will run in out of memory. Flagger detects that and throws it back automatically.

25:24 Prodan: So the idea is with GitOps, You basically define everything that makes up your app in a repository. You have some operators, some tool that watches changes in Git, applies it on the cluster, then Flagger comes and makes sure those changes are a good fit for your end users.

25:44 Is Flux a CNCF project?

25:44 Bryant: Very nice. That's a compelling argument, Stefan. Very nice, indeed. Now I understand Flux is a CNCF project now?

25:50 Prodan: Yes. Flux is in CNCF since last year and we've seen a lot of users trying out Flux, using it in production. The Flux production user is huge. Flux has been around for a long time now. And the Flux maintainers, the Flux team right now is working on a new proposal for how the future Flux look like. What we are doing is making Flux more modular. So Flux right now deals with Git operations, like monitors, Git repository clones, that Git repository verifies PGP keys. It does SSH and so on. That's one thing. It also does reconciliation. It applies all those YAMLs, it does garbage collection over Kubernetes objects and so on.

26:36 Prodan: So we envision a Flux where you can pick and choose what kind of behavior and how you want your cluster to be reconciled. And the first step we did, we created a thing called the GitOps Toolkit, which is developed inside the Flux CD organization.

26:51 Prodan: And the first component of the GitOps Toolkit is called the source controller, which is a specialized Kubernetes agent, Kubernetes operator that manages sources. A source can be a Git repository. Another source can be a Helm repository that you store there all your Helm charts and so on.

27:11 Prodan: Another source could be, I don't know, there's three bucket, or something like that. And we are now allowing others to deploy and develop specialized reconcilers. For example, we are developing a customized controller, which knows how to deal with customize overlays. Another controller will be a Helm controller That is specialized only on applying Helm releases and so on. But you can imagine that other companies, other people will contribute to this toolkit with their own specialized things, that maybe they span across clusters or even outside Kubernetes, you want to create some other things around Kubernetes in terms of infrastructure, cloud infrastructure and so on.

27:54 Prodan: So that's something we are very, very keen on getting the community involved because it's right at the beginning. Everything is driven through a Kubernetes API object. So let's say you can define multiple sources. You can say my cluster, my cluster state is composed of the three Git repositories, one Git repository which has all my apps and another Git repository which has, let's say, all the security model, service accounts, role-based access, OPA policies and all this stuff. And maybe some other repository that's not even in your control. You want, let's say, to install Istio from the history repository, install the official list of thing. So we can add the Istio repositories, thrust it inside your cluster and tell the reconciler every time Istio still does a semver release, if it's in this semver range, I want automatically upgrade my cluster to that Istio version.

28:50 Prodan: Let's take Ambassador for example, right? There is a CVE and Envoy. You'll be doing a patch release. With the GitOps Toolkit, you can say, "Watch the Ambassador repository. And every time there is a patch release, apply it automatically on the cluster." But if it's a, let's say, a major release or a minor release, then open a pod request and stuff like that, because someone has to recover it and improve it, or deploy it only on the staging cluster and let's find the new minor version of an investor there, test it out and at some point, someone will decide, "Okay, we want to promote that version to production."

29:27 Prodan: The idea is that instead of copy pasting all these YAMLs, all these charts and everything, you could reach out directly to the source of it, like the official repository. That's one example that we are trying to target.

29:40 Bryant: I think that's a really compelling use case, Stefan. Because there's something from my Java days, we often had Maven or Gradle plugins that constantly scanned dependencies: are you using the latest version of a dependency? Is there any CVEs known in this dependency? And like you say, with the semver, you at least get the major, minor patch so you can kind of make decisions on is the functionality changing? Is an API changing?. I think doing this kind of thing for infrastructure is quite challenging at times, right, in terms of how do we know when an Envoy security vulnerability happens? Well, we pay attention to it, but do end users? Maybe not.

30:11 Prodan: Maybe not. Yeah. As an end user, let's say an end user is someone from a platform team that has to provide Kubernetes as a service to its own organization, let's say, right? How many things that person has to watch? His mailing list is huge, right? So instead of doing that, he can, or she can decide to trust particular repositories, particular organization and say, "Hey, they are doing a patch release, I want that automatically deployed on my staging cluster."

30:40 Prodan: Once it's there, you are actually using it. If it has problems, you should be able to detect it while doing your normal deployments and so on. We're trying to do it with the GitOps to give people more freedom of what things they want to add the cluster, but on the same time, securing the whole pipeline as well with things like gatekeeper, with network policies and all these things that the more power you give to the operator, in the end, it could mean it's easier to have a security breach and so on. So we are trying to work on that side as well.

31:16 Bryant: Make it easy to do the right thing.

31:18 Prodan: Yeah. And it's kind of hard.

31:20 Could you explain the new Weaveworks GitOps toolkit?

31:20 Bryant: Yeah, it is. What I'm hearing from you is, because a lot of us are building platforms these days. You mentioned like companies often have platform teams that kind of build a PaaS based on Kubernetes for their internal consumption or for their internal team's consumption. What I heard from you is you can look at the new GitOps Toolkit almost as a platform for continuous and progressive delivery, but with clear interfaces and abstractions that people can define their own stuff. Would that be a fair description?

31:45 Prodan: Yeah. I'm saying that the GitOps Toolkit, you could use the GitOps Toolkit to build your own continuous delivery platform. And it's up to you what you choose, what types of operators you mix there. Normally in Kubernetes, most things should work smoothly because if you have operators with their own APIs, so that limits the responsibility of an operator. But things are not that easy. The best example is Horizontal Pod Autoscaler, right? An Horizontal Pod Autoscaler acts on a object like a deployment, right? Flagger also acts on the deployment because it needs to scale it up and so on.

32:26 Prodan: Flux or whatever you are using, RBC or whatever you are using to get that deployment inside your cluster also acts on it, right? So you can see, for example, let's take a replica's field inside the deployment, there are at least three automated operators that will fight over it. There's Flux that is trying to apply what you have described in Git, it's the Horizontal Pod Autoscaler that is changing that replica account based on resource usage, and it's Flagger that will change that replica account based on it's Canary or not. Should I scale it to zero or should I start it? So these operators have to work together in some way, so they will not fight each other.

33:07 What other exciting stuff are you working on? What do you think is exciting in the cloud native space?

33:07 Bryant: It's like real life, where we get operators fighting each other in real life. Yeah. This is just like taking it to the virtual world. Yeah. I hadn't thought about that. That is actually a really, really interesting point. Yes. People watch this space. So I'm conscious of time now, Stefan. This has been super interesting. I think you and I could pretty much chat all day on this stuff. As a kind of final little, couple of questions, what are you most looking forward to over the next six, 12 months? What exciting stuff are you working on, perhaps you and the team, or what do you think is exciting in the cloud native space?

33:33 Prodan: I'll mention Flagger first. With Flagger, we have a new way of defining metric checks. So until Flagger 1.0, a user could only define Prometheus queries. So you have to have a Prometheus server somewhere and all your KPIs must be exposed in Prometheus metrics. And a lot of users came to us and say, "Hey, we are using other things than Prometheus. We use Prometheus because, well, it comes with Istio, it comes with Linkerd, it comes with something that we use. Also Flagger comes with its own Prometheus, if you don't have one. But usually business metrics are in other places, maybe. Right?

34:11 Prodan: So we now have a model that it's called metrics templates, where you can hook up other metrics providers and we've implemented Datadog and CloudWatch as the first two. And in the next month, I'm looking at expanding the integrations to other platforms.

34:30 Prodan: There are so many things like there was an issue the other day to add new Relic. Yeah. There are so many platforms out there. InFluxDB is another request. So I'm looking at people that actually use these providers and are willing to help me implement it or implement it themselves, and I'll be reviewing the per request. And I think this is a territory where Flagger could expand.

34:53 Prodan: Another big thing that I want to get into Flagger is Ingress V2. One of the first implementation we had on with Ambassador, we've implement Ingress V2. I'll be the first time to jump on it. And another space that I really want to get Flagger into is multi-clustered deployments. Linkerd just announced the multi-cluster set up. Istio has a couple of options. And Flagger works with Istio multi-cluster only when Istio is deployed in a shared control plane mode. But okay, you have a shared control plane and Flagger can talk to that. But when you have like a dedicated control planes per cluster, than Flagger needs some serious changes and effecting to be able to

35:34 Bryant: Federating.

35:35 Prodan: Yeah. In order talk to a fleet, not to a single cluster, all right? So that's a big area for Flagger in the future. Also a lot of people have told me that cluster API is making, I don't know, creating clusters so easy. It can spin up a new cluster every time you deploy a new version of your app. Well, it's not that easy, but at some point, we'll get there, right? So a Flagger for clusters

36:01 Bryant: Oh, that's interesting. Right.

36:03 Prodan: ... could also be something interesting for the future where you, instead of just looking at an app, you'll be looking at metrics that are coming from a whole cluster and decide if the new cluster should be promoted or not.

36:14 Bryant: Very interesting.

36:15 Prodan: The principle is the same, right? Canary or A/B testing is the same. The machinery is so much different because now Flagger only talks to the single Kubernetes API and watch its deployments, ConfigMaps, secrets. Now, it has to watch cluster definitions, machines and so on.

36:33 Bryant: We say like turtles all the way down, Stefan. Yeah? But it's containers, then it's clusters, and then like... Yes, there's a lot of stuff in there.

36:39 Prodan: Yeah. Yeah. That's for Flagger. Yeah. In terms of other CNCF projects, I'm watching the Ingress space a lot. I really want to see an Envoy implantation in CNCF and hopefully Contour will get there and mixed together with Ingress V2, I think we will have a bright future for Ingress and Egress on Kubernetes. Because right now, we have this Ingress concept, but the actual implementation is somehow stuck in the past with NGINX, Ingress being the default thing. I think one of the Envoy's implementation should also be there. I hope it will actually happen this year at some point.

37:17 Prodan: I hope the same. Very cool. This is it. Yeah. A fantastic chat. If folks want to follow you online, what's the best way, Twitter, LinkedIn?

37:25 Prodan: Twitter. Twitter is the only social media I use. @stefanprodan, that's me.

37:30 Prodan: Perfect. I'll be sure to link your Twitter handle in the show notes when we do it, as well as the folks can very easily follow you. But yeah, it's been great talking to you today, Stefan. Thanks very much.

37:37 Prodan: Thank you very much, Daniel, for having me.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Architecture Does Not Emerge - a Conversation with Tracy Bannon

InfoQ Architecture and Design Trends in 2024

Venkat Subramaniam on Architecture Patterns and Practices

Continuous Architecture with Kurt Bittner and Pierre Pureur

Related Editorial
Popular across InfoQ

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?