InfoQ Homepage Podcasts Stefan Prodan on Flux, Flagger, and the Operator Pattern Applied to Non-Clustered Resources

Stefan Prodan on Flux, Flagger, and the Operator Pattern Applied to Non-Clustered Resources

Mar 07, 2022

In this podcast, Wesley Reisz talks to Stefan Prodan about Flux and Flagger–two tools built on top of Flux CD’s GitOps Toolkit. After discussing some of the architectural differences between Flux v1 and v2 and discussing some of the GitOps toolkit use cases, the two discuss the operator pattern on Kubernetes. They specifically spend time talking about the operator pattern, why developers may opt to build API’s on top of Kubernetes, and how the pattern can be used on non-clusters resources. The podcast wraps with a discussion on the work being down towards Flux v2’s push to GA.

Key Takeaways

Flux CD is a project, originally donated to the CNCF by Weaveworks, for continuous delivery. Flux v2 and Flagger are two projects built on top of the modular GitOps Toolkit.
Late last year as part of the CNCF graduation process, Flux underwent a third-party security audit sponsored by OSTIF (Open Source Technology Improvement Fund). That audit brought about a formal RFC process and a host of security changes that have hardened and improved the security posture of the GitOps Toolkit. The audit was part of the work moving Flux v2’s move to GA.
Kubernetes' operator pattern concept lets you extend the cluster's behavior without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. The operator pattern for nonclustered resources allows you to level k8s operators outside the cluster to provision external resources like s3 buckets, run terraform scripts, and create/build databases.

Subscribe on:

Introduction [00:05]

Wesley Reisz: Welcome to another edition of the InfoQ podcast. My name is Wes Reisz, one of the co-hosts of the show and also one of the chairs for the QCon Software Conference, which will be returning to London (April 4th through 6th) check out qconlondon.com for more information.

Today on the podcast, we're speaking with Stefan Prodan, about Flux CD Flagger, and the operator pattern applied to non-cluster resources. As many of you may know, operators are our software extension to Kubernetes that make use of custom resources (CRDs) to manage application components. You might use it to like package, deploy, manage a database, or like Kafka, Confluent has the Confluent operator on top of a cluster. The operator pattern applied to non-cluster resources is that idea of leveraging that pattern to extend beyond the cluster to manage Cage and Non-Cage resources alike. Flux CD is a GitOps Toolkit for continuous delivery.

Stefan Prodan, principal engineer of Weaveworks and core maintainer of uses in the project. And we'll discuss that today on the podcast. Today, we're speaking with Stefan about the latest on Flux CD and Flagger, a complimentary progressive delivery framework, and also the notion of building on top of the K8s API and what benefits and challenges and things that it'll give you. We'll start the podcast off about reviewing Flux and Flagger and the GitOpa Toolkit. And we'll get an update on the project and we'll hear the philosophy of Flux CD. And then we'll look at maybe some things about how Flux does something little different than other progressive delivery frameworks. We'll wrap with a discussion of the operator pattern applied to non-cluster resources. As always thank you for joining us on your walks, jogs, and commutes. Stefan. Welcome to the podcast.

Stefan Prodan: Great to be here. I'm very happy to talk about the, latest and greatest in Flux CD.

How do you introduce Flux to someone who isn’t familiar with it? [01:42]

Wesley Reisz: For people who may not be familiar with Flux. Before we jump into the latest and greatest, what is the elevator pitch? You're on the elevator, and you're telling another technologist, what Flux CD's all about. Levelset us. What's Flux CD all about?

Stefan Prodan: Flux tries to bring apart the CI/CD landscape, right? We are used to mixing all together in our CI pipeline, where we build the software, we test the software, then we deploy the software, right? Flux is trying to take the continuous delivery part from outside the cluster, to inside the cluster. So, instead of having some tool that connects to the cluster and from there it performs the actual delivery process, with Flux, it's a shifting paradigm, right? It's the cluster that looks at Git repository or some form that describes the desired state and tries to reach to that desired state. So, it's very different from what we had before with CI/CD.

Wesley Reisz: That's that pull versus push discussion that you see a bit with the CD platform. So, push is like reaching into the cluster to say, "this is what I want it to do." Where pull's actually pulling things in. Why is that such an important distinction between push versus pull?

Stefan Prodan: One thing is, how do you detect drift from what you actually run to what you want to run? If you do that from outside, you'll have to, let's say every five minutes connect to the cluster. Is my cluster in the right shape or not, oh, it's not, let's correct it? Right? So, it's not just a one-off. And CI, for me, and for many of us, is just one of something that happens in your Git repo, you change your source code, that gets built, and that's the end of it, right? While continuous delivery and your production cluster, is a live system where things change all the time and you want to detect any kind of drift. Let's say someone made some changes which are unattended or, whatever. Some other tool made some change. And those changes may get the production system into a bad shape, right? How do you correct that mistake automatically? From CI means, you know, you bombard your cluster all the time with all sorts of opening, connections, verifying and all that stuff. It's like running Terraform in a loop, which doesn't quite work.

Wesley Reisz: I have a couple questions about the project. I hear Flux and Flux CD used almost interchangeably. So one question, is the project named Flux or Flux CD? And the second question is around v1 and v2. v1 I know is in maintenance mode and it's GA. v2 I know has a bunch of new features still being released, but it's not quite GA yet. Do I have that right?

Stefan Prodan: There is no such thing as Flux CD. Flux CD is the organization on GitHub. The project is called Flux and it has two versions. Version one is GA for a very long time. Right after Weaveworks donated flux to CNCF. Half a year after that we've launched Flux v1 GA. For two years now we are working on version two and, we have a roadmap for when version 2 will go GA.

What are the architectural differences between v1 and v2? [04:43]

Wesley Reisz: What are some of the big architectural differences between v1 and v2? I know for example, like v2 had multi-tenancy and supported multiple Git repos. What are some of the others?

Stefan Prodan: Flux version one was built as a monolithic tool, where it could reconcile your whole cluster from a single Git repository. And that was it, you deploy one Flux instance and at deploy time, you configure how it should clone, which SSH key, and so on. And from that moment on, all it does is reconciles that repo on your cluster. In Flux version two, the main difference is that we split all these functions of Flux version one to dedicated controllers. And this is how we build a GitOps Toolkit. So we have like a, source-controller that deals with all types of sources, where you define your desired state (like git repos, helm repos, and so on). Then we have these specialized reconcilers for Helm. It's called helm-controller for customized overlays and playing yamls. It's called kustomize-controller. If we want to configure other thing or webhooks, then we have notification-controller and we have other types of controllers for driving automation.

For example, we have a built in automation where we listen for events on your container image registry. Let's say you push a new version of your app, as a container, Flux is able to listen to that event, match the new tag that is pushed against a policy that you can define. Let's say, I want only stable releases to land on my production cluster. And I want those patch versions to be automatically deployed. Flux can do that for you. It will detect a new version in your container registry, it will patch the manifests back in your Git repo. Then the other components of the GitOps Toolkit will take that modification and apply it on the cluster. And Flux is a distribution of all these controllers combined.

Now, for example, VMware Tanzu only uses some of those. And you know, you can pick and choose, and that's the major change from Flux one to Flux two. If you don't want to buy into the whole thing and you only want to deal with, I don't know, you want to use sources, but instead of reconciling stuff inside Kubernetes, maybe you want to do a Terraform build and apply, right? And you can do that. There are examples out there. There are controls out there that only are using some components of the GitOps Toolkit to build their own continuous delivery systems.

What is Flagger and how does it work with the GitOps Toolkit? [07:03]

Wesley Reisz: Got it. Right. So the GitOps Toolkit is the foundation everything's built on. There's also Flagger, which is a progressive delivery tool. How does Flagger relate into this suite of tools?

Stefan Prodan: Flux is the one that, you know, you make a change in your Git repo. Let's say you have a new app version, Flux bumps the container image tag, let's say from version one, it goes to version 1.1, and it will automatically upgrade your application inside your production system, right? But maybe you want to do this in a progressive manner. Then you don't want all your users to be impacted by the new version. Maybe there is a hidden bug in there. Maybe there is a performance issue, and so on. And here is where Flagger comes into place. What Flagger does is instead of letting Flux roll the new version to all your users, it uses service mesh technologies or Ingress Controllers like Nginx, Contour, and so on, to slowly route traffic from the old version to the new version. And while it increases that traffic weight from one version to another, it can look at metrics, like Prometheus metrics, Datadog metrics, and so on. To determine if the new version breaks, any kind of SLOs that you have set for those. Is there a route?

Maybe the latency is too high. Your users are having issues and so on. When it detects a problem, it'll automatically rollback. What that means is shift the traffic back to the old version, scale the new version to zero and alert your dev team, "hey, something is wrong, these SLOs have not passed all the checks that I've run, do another deployment." So what will happen like the dev team let's say finds the issue, pushes a new version of that source code that gets yet again, tested, validated by Flagger. And in the end, it ends up in production. All traffic goes to that new version, but the thing is, at the end of the continuous delivery pipeline, where you want better control over, what's actually exposed to your end-user.

Wesley Reisz: What do you get out of it? Does it give you the RED metrics to be able to just watch your latency, errors, things like that? What do you get like out of the box when you want to leverage Flagger (as far as the analytics)?

Stefan Prodan: Flagger uses the RED metrics by default that's because most service matches and Ingress Controllers already instrumented with Prometheus, right? So, when you configure a canary deployment with Flagger, by default, it can do two things. Look at the error rate being, HTTP errors or gRPC errors, and it can also look at latency. So it can set some thresholds for latency and error rate. Other things that you can do with Flagger, you can tell Flagger to run tests or smoke tests, before it exposes the new version to any user.

So before it routes 1% of your total users to a new version, it can run some smoke tests for you. It can run helm tests for you and so on, right? So here you can extend it. There are lot of Flagger users which have extended Flagger. Flagger calls out to some webhooks. And it's up to you there in that implementation to determine which tests are running and so on. After those tests are finished, you report back the result Flagger, and that's how Flagger advance and takes these decisions, and so on. And also with metrics, we have this custom resource, which is called (yet again, a kubernetes custom resource or another API) which is called metric template. And you can use custom metrics if you have those, if your application reports, these kind of metrics, and you can define other checks based on those metrics. And it's not only about Prometheus, you can reach to Datadog or Stackdriver and other offerings that expose all sorts of metrics.

When it comes to Flux CD, what has the project been focused on of late? [10:49]

Wesley Reisz: So you mentioned the road to GA, what's the latest for the project? What are some of the big things that you've been focused on of late?

Stefan Prodan: Again of last year, CNCF sponsored Flux with the security audit, which was made by ADA Logics and was facilitated by OSTIF, which is the Open Source Technology Improvement Fund. And, that was kind of revealing for the Flux project. We are on the road to releasing Flux version two as GA, and, having this security audit gave us confidence in that direction where we are going. And, it came up with great findings around where, what part of the code we should improve, how we should write our documentation better and how we should approach security in Flux from a different perspective than what we did before. As a result of that audit, we now have an RFC process in place for any kind of security changes. We made a lot of security improvements on our side. Now we have, for example, fuzzing for the Flux source code, we sign our container images when we push them to our users.

We also publish a software bill of materials with each release. So we are trying to, you know, hardening that part where people actually trust and can verify the instance of Flux that is running on their cluster is the same as what we are delivering. So that's on one side, and on the other side, we made a lot of improvements when dealing with security inside the cluster. You may be aware of, but there are lots of organizations that are running different teams, collaborate on a single cluster, right? And those teams, can think about those teams as tenants. So, these types of organization need some kind of isolation between tenants. And we want to offer that with Flux, basically isolating the tenants desired state from each other. So a tenant could not change some other tenant deployment, could not affect the whole cluster state, could not introduce a new policy or stuff like that.

Let's talk about, specifically at security audit just a bit, and also, I guess the CNCF process of moving, becoming a incubating project and becoming a graduated project. But those who maybe don't know, if you look at the landscape and you see those large boxes out there, that means they're actually a CNCF project. And these things like the security audit that you just mentioned, are things that you have to go through to become a graduated project. Can you talk a little bit for a minute about, just people that maybe aren't familiar with it, what things you have to go through, like that security audit to become an incubating and then a graduated project?

Stefan Prodan: Well, I think graduation means different things to different people, but to me means, once our code is as secure as we can possibly make it. And in order to determine that, we can't do our own audit, right, do have that option. You can audit yourself. I know other projects were doing that, but we felt like, if we can get a sponsorship here and get a dedicated team that is used to this kind of analysis and give us a real feeling about how we are evolving. So that's on one side of course, security. And on the other side is also about adoption. It cannot be a graduated project if you are very niche zone with just a couple of users. So, Flux has seen major adoption in the last couple of years. Adoption from end users, but also adoption from other vendors. There are cloud vendors like Microsoft, for example, which is shipping Flux in Microsoft Arc, or Amazon, which is shipping Flux in their EKS offering. And so on.

Wesley Reisz: Or as you mentioned before, VMware Tanzu as well. We use some of the Flux controllers as part of mission control and a Tanzu Application Platform.

Stefan Prodan: Yes. And we are looking forward to new partnerships with other cloud vendors and, you know, solidify Flux on this market as the continuous delivery tool to go.

Flux and Argo CD often come up in conversation as similar tools. What are some of the philosophical differences between them? [14:39]

Wesley Reisz: A lot of times when we talk about continuous delivery and Flux comes up, people really quickly will also mention like ArgoCD. So I'm curious when a developer/architect is considering two different tools, such as ArgoCD and Flux CD, what are some of the philosophical differences between the two?

Stefan Prodan: Flux software is a vast API in terms of how you configure it. We have especially Flux version two extends Kubernetes with custom resources. ArgoCD offers these custom resource with is called The Application. But we thought that application is not the right concept when you build a continuous delivery pipeline. We are dealing with so many things, from trusted sources, which Git repository do you trust? Or, how many repositories do you trust? What things you are deploying? Maybe you are deploying, I don't know, cluster add ons, which don't quite fit into the app pattern. Databases, applications, or maybe some type of configuration, which affects things outside the cluster, right? So the main difference between Flux and Argo from the API point of view, is that we have these reach API where you define things like, sources, which can be Git repositories, S3 buckets, if you want to store your desire, stay there.

Helm repositories, and we are looking at adding now OCI as a source type. And, once you define these sources, then you can create other custom resources, which are referring those. And there you configure how your secrets are decrypted, how you apply those resources on the cluster, how you do health checking of those resources. And here we have two specialized controllers. And this links back to the GitOps Toolkit idea, where we have controllers dedicated for a specific action. For example, you may want to apply on your cluster, a customized overlay, right? So we have a thing called kustomize-controller, which understands customize and it will look at sources, which define these customized overlays and apply those on the cluster. Or, you may have helm charting and maybe you're deploying the apps, you package apps with helm charts, and you deploy them with helm releases. Well, Flux has a helm release customer resource.

And you define that in your Git repo, and that's how that application gets reconciled. And, at the end of this pipeline, we have another controller called notification-controller, which captures events from all the things that Flux does on your cluster. And then with custom resource called Alert, you can filter those events and funnel those events to external systems like Slack, Microsoft Teams for writing back to GitHub, for example, and at the commence status, is it green or is it red? What did actually failed, right? And because Flux doesn't have a user interface, we rely more on people using the Git SaaS user interfaces, like I GitHub, GitLab and on, and our CLI. So, all our tooling is made towards react to events, take some decisions and emit events, that report what decision we took and what happened after those and right back to the systems that people expect to a Slack channel or Git status, and so on.

What benefit is there to a developer building an API on top of the Kubernetes API using the Kubernetes constructs? [17:49]

Wesley Reisz: Okay. Got it. So event-based system/reactive system that operates with cluster and non-cluster resources. Very nice. So let's switch over and talk a little bit about this operator patterns applied to non-clustered resources. So when you built Flux v2, it's built on top of Kubernetes, that uses kind of the Kubernetes API, it's like the foundation to extend and build the Flux API. Why go that route? What benefit is there to a developer building an API on top of the Kubernetes API using the Kubernetes constructs?

Stefan Prodan: In the last two years, Kubernetes made a lot of improvements around extending it. There is a controller-runtime Project, which you can think of it like an SDK. So you have a collection of libraries and you can use those libraries to build a controller from scratch. It also has codegen features and so on. So it's really nice. And on top of there is also KubeBuilder Project, which has a nice CLI for bootstrapping your whole project. So, back when we started working on Flux one, five, six years ago, there were no such tools. So we had to build everything from scratch. And, in version two, we've seen these tools and we automatically wanted to use them and adopt them because it makes development so much easier for Kubernetes. And another big advantage is that, in a way your controller is developed in the same way and behaves in the same way as a native Kubernetes controller, one that comes with Kubernetes itself, right?

And the main advantage of extending Kubernetes API instead of building your own and having an HTTP endpoint inside a cluster, is that you can rely on Kubernetes RBAC and Kubernetes authentication and security, when people access your API. That's why Flux can be used for multitenancy even if you run a single instance of Flux on your cluster, because each customer resource, depending on which name space has it's deployed, is subject to a different Kubernetes RBAC, and you can limit what users can do with Flux on the cluster without coming with another layer of RBAC and access control on top. What ArgoCD for example is doing, they have their own RBAC and so on. We rely only on Kubernetes RBAC so adopting Flux from that perspective should be super easy because you can create service accounts for each tenant repo.

And then you say, "oh, when you reconcile this particular repository, you all the applications in here use this service account, use this identity." So, no matter what the developer will place in that repo, let's say, cluster all binding or something that's, you know, at cluster level and could impact everything, even if that person does that, Flux will not be able to apply that change on the cluster, because it is limited by Kubernetes RBAC and what the cluster admin decided that a particular repo should contain or not, or how it should affect the cluster state or not.

How are you able to extend outside the cluster to non-clustered resources? [20:46]

Wesley Reisz: Now what about extending beyond the cluster? So those are the benefits of actually on the cluster itself. How do you, I guess, extend some of that, like security model, for example, when you're talking about non-clustered resources, how do you deal with that? How do you think about that? What are some of your considerations?

Stefan Prodan: So part of our GitOps Toolkit, we have, built a series of libraries and test practices on how to extend Flux with a new controller. So, if you use controller-runtime together with our libraries, quite easy to extend Flux to other things than what Flux does. Flux by default, is in charge of dealing with Kubernetes objects and the Kubernetes cluster state. But using our, you can extend Flux beyond Kubernetes. For example, Weaveworks is currently developing a controller called tf-controller, which is a Terraform extension for Flux. What that means is that you, place your Terraform definitions in a Git repo. You use the source-controller as before to, you know, connect to the Git repo, maybe using in authentication maybe, ensuring that whoever committed changes into that repo is actually an authorized person. Let's say, using OpenPGP and commit signing, Flux has no idea about what to do with Terraform.

It totally knows Kubernetes YAMLs and Kubernetes objects, but this, terraform controller built on the same from as Flux reacts to events, "Oh, there is a new change in a Git repo, let me take that Terraform change, create a plan for it, apply it outside Kubernetes to whatever it needs to do. Maybe create roles, create a three buckets or whatever it has to do, and then report back what changes, how they were made inside the cluster." And from there, you can use flux-notification-controller to funnel out those events back to the end users. And there are many other technologies right now, which try to get, to use Kubernetes as an API for, driving changes outside Kubernetes itself. There are Pulumi and others. There are so many things there. In order to make all of these work with Flux is, either you expose a customer resource and that custom resource tell Slacks how to change those things, what type of tool it should use to reconcile stuff, or you build your own controller and you extend Flux in that way.

What are you top priorities as you approach GA with Flux v2? [23:03]

Wesley Reisz: So earlier you talked about the road to GA for Flux v2, when do you think v2 will be ready for GA?

Stefan Prodan: We are getting closer. The security audit was a major milestone. The fact that we got all these eyes inside the project and we have an RFC process. And so on, we got major adoption from cloud vendors and end users. We are getting closer to GA. What we still want to do is get more feedback on our current APIs. So our current APIs are at v1 Beta one and Beta two. That means, people that are currently using version two in production, even if we release tomorrow GA, those APIs will suffer no breaking changes. And even if there will be breaking changes, we'll offer an automatically conversion for version one, beta two to version one final or whatever the API versioning is. So for me, GA is more about, having awesome docs on all topics that Flux can do. And we are still working on that and have better observability for everything that Flux does and what we are working right now.

What's our top priorities around observability. We offer all kinds of things like we offer prometheus metrics for everything that we do. And now, what we are trying to do is have better eventing system. Emit events, only once and with the right information. Now we are emitting way too much events. So we need to trim those down and make them as clear as possible for users.

Stefan Prodan: And besides the observability part, we are also doing the work where we are now consolidating our source code. Being so many controllers, they share a bunch of code, but, you know, as we've developed, we don't build libraries first. We build the actual projects and then we consolidate the source code into libraries. And that's the last mile of what we are trying to do, consolidate the code, adding better observability to it. And, yeah. Write development documentation. We have a lot of people are asking us, "Hey, how can I extend Flux?" And we say, "Hey, go to this project and look at the source code." Well, that's not the best approach, right? So we are working right now in having documentation for developer, not only for end users, we have vast documentation for end users and almost none for developers. And I think that, shows maturity when you have that. And we really want to get at that point before going GA.

Wesley Reisz: Well, Stefan, Flux is an awesome tool. Thank you for taking the time to chat with us and talk a bit about it.

Stefan Prodan: Thank you very much for inviting me. That was really fun.

About the Author

Stefan Prodan

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.