InfoQ Homepage Presentations Software Supply Chain Management with Grafeas and Kritis

Software Supply Chain Management with Grafeas and Kritis

Bookmarks

View Presentation

Speed:

Download

30:11

Summary

Aysylu Greenberg discusses the goals for Grafeas and Kritis used to secure a company's software supply chain, and concludes with the details of current and future development.

Bio

Aysylu Greenberg is a Sr Software Engineer at Google working on infrastructure. In her spare time, she ponders the design of systems that deal with inaccuracies, enthusiastically reads CS research papers, and dances.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Greenberg: I'm going to talk about two really exciting projects called Grafeas and Kritis. Hopefully, you'll perk up a little bit and enjoy the talk. I'm Aysylu Greenberg, I'm a senior software engineer working at Google. I've worked in a variety of infrastructure projects such as search infrastructure, developer tools infrastructure, drive infrastructure for Google Drive. Currently, I work on cloud infrastructure. I'm the Eng Lead of the two projects that we'll talk about today. They're both open-source projects and they're called Grafeas and Kritis. I'm also online on Twitter@Aysylu22. If you're wondering, 22 is not my age.

Let's dive into this. Today the talk will be in four parts. First, we'll talk about software supply chain management and make sure that we're on the same page about what this means. Then we'll talk about Grafeas and Kritis and how they fit into the software supply chain. Then next, we'll talk about the upcoming release, 0.1.0, that we're hoping to release in this quarter and the future of the projects. Let's get started.

Software Supply Chain Management

I assume you guess Google runs in containers and every week we deploy over 2 billion containers. We have a pressing need to understand what happens to the containers, what happens to the code that gets deployed and runs, and where is the code that we just wrote? We need a lot of observability around it. For software supply chain management, it's very similar to food supply. Just like with food, you plant the seeds, you grow them, then you harvest them, you make some food, you deliver it to the dinner table. In software supply chain, you'll write some code and then you as a developer, you'll check in the code, then you'll build the image, the containers, the binary, you'll test and verify it. Often, it's automated using CI pipelines.

Then you will run some QA testing on it - it could be manual, it could be automated with canary services. Then finally, you'll deploy it to production. This is automated often using continuous delivery pipelines. Just like with food, we often wonder where does this food come from? What country? Is it organic? Is vegan, is it gluten-free? All these questions that we're meant to ask about food, same with software we will be asking, what happens to the code from the time it's written and submitted to the source code to the time it's deployed?

What about third-party dependencies? We know even less about them. We'll just rely on them to work. Are there any vulnerabilities inside them? Can we trust the code that we depend on? What is the overall chain from the time that we add it as a dependence to the time it gets deployed? Are we compliant with regulations and so on?

We have the need to have central governance process so that it doesn't slow down development, the developer velocity. Also, we can have a good look, like extra vision to everything that happens from the time code is written and or dependencies added to the time that we get to production. We use CI/CD pipelines to automate a lot of this, but also we need observability tools around that, not just testing and deploying it but also how is that related? Who submitted that code? Why do we believe that this build we can trust and so on?

Grafeas and Kritis come in and they have open focus ecosystem that we're building. We'll have an engineer that builds and deploys code and then she sends it to the CI/CD pipelines. They will do the secure build process, automated tests and scanning for vulnerabilities and analysis and then there’s the deploying of that code. There are a lot of different vendors that offer good solution for CI/CD pipelines. The problem is that we want a centralized knowledge base for this information, so, regardless for what vendors you are using, it would be good to have open metadata standards so that you can define what it means to have build metadata and a test metadata. That's what Grafeas project comes in. It acts as a centralized metadata knowledge base, it will have information about vulnerabilities in your artifacts and build information and so on.

For deploy checks, we want to make sure that they pass based on our policies. We would rather write codify our policies in config so we can control and look through the changes as they happen. That's where Kritis comes in - Kritis is an admission controller that when you're deploying to Kubernetes specifically, it will run the policy checks that your cluster admin defines, then deny the pod to be launched if it finds very severe vulnerabilities in your image or it doesn't trust the image location. If everything is good, it will deploy it to production. How many of you use Kubernetes to deploy? Two-thirds of the room. You're in the right talk, we'll talk a lot about Kubernetes because that's what Kritis does, deploy time, checks and policies.

Kritis itself doesn't store vulnerability information. It just does the policy checking. It has a lot of logic in it, so it will talk to the Grafeas metadata API to actually find out about, given this container, what vulnerabilities sit there? What severity does it have? Given this container, where did it come from? Do we trust it? That's how Grafeas and Kritis fit in their overall software supply chain.

These projects have existed for a while and they're not just a third project. They are actually are being used in production and Google has internal implementations of them that are available on Google Cloud Platform. Grafeas is available as container registered vulnerability scanning, and Kritis is available as binary authorization. They are being used internally and there are internal implementations of this, so we know that it works in production.

Kritis

We talked about software supply chain management. Now let's talk more about Grafeas and Kritis and dig into the details of them.

Kritis was developed open source first. The old code and all the history of the code is available on Github under this link Grafeas Kritis. In the software supply chain, it fits right at the very end during the deploy time. When you're deploying to production, it will verify the policies against the policies that you have, then choose to deploy it or reject the deploy request.

Let's walk through the example since two-thirds of you are familiar with Kubernetes, so a lot of these concepts will be familiar. If you're not, don't worry because we'll walk through this and it will be a pretty high-level overview with some of the details that will be interesting to know how Kritis works. Imagine we're deploying an eCommerce website and so we'll run coop control, applied side YAML, which actually defines our pod with the image that we're deploying. Overall it looks like sending a request to Kubernetes. Krittis is installed inside of your Kubernetes cluster, so we'll install it using the helm install command. Now it's just running inside of your Kubernetes cluster.

The request comes in and the admission request sends the pods back. Then it gets taken by the webhook, which actually is implemented by Kritis, and it reviews this request against a set of policies that we define. Imagine that we have image security policy, this kind of policy would basically say, "Make sure that the container in the pod that we are trying to launch, has had vulnerability checks and it satisfies the policy where it doesn't have any severe vulnerabilities in it." Now we'll define a policy for prod and for QA. If you were in Katarina's [Probst] talk, she talked about namespaces, so it's namespace scoped.

Now, we'll go through the image security validator that actually will validate the request through the policy. Then we'll fetch the metadata from the Grafeas API, which has some sort of database backing it. We'll talk more about the pluggable backend storage in a few moments for Grafeas. Imagine I said there was an image and a push, but there was no vulnerability scan done for it. We want to make sure that the container has actually gone through vulnerability scanning. We'll reject this pod and we will not launch it. Some time passes, vulnerability scanning has had a chance to catch up and inspect the container. Then it will say, "I found some CVE, some vulnerability," this is how we refer to classes of vulnerabilities that we find. There's an open database of different CVE. CVE stands for common vulnerabilities and exposures. Then we just basically sort by the year it was found and then some number attached it. We found a vulnerability in our database, so then we can fetch it. Grafeas API will fetch it from the database and return it, and then again, we won't launch the pod because we have a vulnerability in it.

Then we'll actually inspect this and we'll say, vulnerability analysis is a very hard problem and often it's better to have false positives than false negatives - it's better to be safer than sorry. It finds a vulnerability and then say, "Actually it doesn't apply to me," then not find a serious vulnerability in your container. Then we'll say, actually this vulnerability doesn't apply to me because of something that I know about my application, so then we whitelist it.

After we whitelist it, we will admit the pod. Now we have our application running, our eCommerce website is up and running, so let's scale that up. We'll scale it with scope control command and we'll get four replicas. The second replica will come up - ok, good, we're waiting for the third and fourth to be launched. Then what happens is the new vulnerability is found. Vulnerability scanning, whether you pay a vendor or do it yourself, it's constantly updating because the daily basis of being constantly updated with new vulnerability is found. You're constantly checking against whether it affects your container or not. Does that mean that we can't launch the third and fourth replica and we're just stuck? We don't want that. Because we just confirmed that this pod is fine to run, we just want to scale up our application and then figure out what happens without disrupting our eCommerce website. What did we do here is, Kritis is very clever about using attestations in this case.

Taking a step back, the first time we admitted the pod we'll say, anytime we admit the image we're going to record that we admitted this image. Anytime you want to scale up, you can always scale up or when the pods get restarted because things happen, then it will always get admitted as opposed to being prevented from it as soon as vulnerability scanning gets updated. Kritis has an attestor inside it that you specify using the attestation authorities you configured. Then you will write an attestation through the Grafeas API, it will store it in the database. Anytime a new pod comes up and we find a vulnerability, we just retrieve the attestation from Grafeas and we'll say, "But I did say that this image is admitted so continue scaling up." That's how we are able to scale up, and then later we'll inspect it.

Now, do we not look at new vulnerabilities ever? That would be bad. What if a heart bleed bug is discovered? It’s a vulnerability you want to know about. We have a background Cron job that basically inspects the running pods periodically and then it's able to say, "For all the running of pods, I'm going to check it against image security policy." Then it adds labels and annotations to market as it no longer satisfies the policy even though it's been admitted. Then the cluster-admin can react to that.

Let's talk a little bit about terminology of Kritis. It uses Grafeas metadata API as we saw, which uses it to retrieve vulnerability information and then also store and retrieve attestations for the already admitted images. It also uses custom resource definitions. They are extensions of Kubernetes open-source API and they use to store enforcement policies as Kubernetes objects. They're really cool in how they work and that's what allows Krisis to run seamless inside your Kubernetes cluster. We'll take a look at the definitions of policies in a moment. Another thing that Kritis uses is validating admission webhook, which is basically HTTP callbacks that receive admission request and then decide should we accept or reject the request to while enforcing custom admission policies.

Here's what a generic attestation policy looks like. We have a CRD and then we have the Cron genetic attestation policy where we define separately what that means, and then we'll have the name for it so we can distinguish it from all the other policies that we might have because we might have many different types of policies for compliance. Then we'll have this back, which is "These are the attestations authorities I trust." Now cluster admin can say, "If I verify this image myself, then just trust that anytime we launch it."

Attestations authorities look like this. We have again Cron attestations authority and then we'll give it a name and then we'll have some private and public key information. Then don't worry about the node reference, it's an implementation detail. The most important part is, we have private key and the public data stored in there, so we show the proof that the image has been admitted by the right person.

Image security policy is what we looked at in the example, they have the Cron image security policy, the name, and then, if you are running engine X image, then just whitelist it. We don't care about the vulnerabilities in there, we just trust it to run it correctly. Then maximum severity we're willing to tolerate is medium, so anything above that we'll just reject it right away. Then we'll like whitelist some of their vulnerability saying "We know it doesn't affect us, so it's ok, we can keep running."

Grafeas

We talked about Ktitis, now let's talk about Grafeas and what that does. Grafeas again was also developed open-source first, all the commit history is on Github if you'd like to take a look. Where does it fit in with the software supply chain? It represents all of the different steps. It's specifically meant to be a universal metadata API, so it can store information about it, the source code, and they deploy it and who submitted code when and the test results and so on. Every single stage in the software supply chain, it's able to represent.

You heard me say a lot of times artifact metadata API, so let's unpack that a little bit. Artifacts are images, binaries, packages, any of this, we'll just call them as artifacts, files that are generated as outputs, part of your build process, for instance. Metadata is build, deployment, vulnerability, anything that you care to represent and to keep track of in your software supply chain and then API allows you to store and retrieve metadata about artifacts.

:et's talk a little bit about terminology of Grafeas and how we represent and think about this. Notes: a high-level description of types of metadata. For instance, we looked at CVEs common vulnerabilities and exposures, so those will be represented as vulnerability nodes. For every vulnerability we know that we found out through open databases, we will store them as vulnerability nodes. Occurrences are instances of those nodes in a specific artifact. Say you found a vulnerability in an image, so you will store it as an occurrence of that vulnerability. We will also think about providers and consumers because it allows us to ensure that you can rely on third party providers to do some analysis for you and then you just read those results.

Let's take a closer look. We have Grafeas in the middle of this, then provider would be vulnerability scanning. Say you pay a vendor to look through your containers and tell you what vulnerability you have. They'll store vulnerability nodes about, given all the vulnerabilities that are known out there. Also, it will look through your container and then tell you what vulnerability if found against those images, and so it will sort the occurrences for those containers.For instance, Kritis would be a consumer in this case, all it does is it reads vulnerability occurrences for the container and then it decides what to do with that. It doesn't reason about how bad that vulnerability is. All of that is done by the provider vulnerability scanning. That's stored in Grafeas API, which can be retrieved by the consumer.

A couple of other terms that are useful are resource URL, which are just identifiers for artifacts in occurrences that generally unique for a component within your software supply chain. For instance, Debian images, or docker images, or generic files will have some sort of resource URL associated with them that you can refer to throughout your system.

We also have kind specific schemas, which are strict schemas, very structured, which allows us to, first of all, represent the information across all the different vendors in a uniform way. It doesn't matter if you're using one continuous integration pipeline vendor and then you switch to another one, you can still represent them. Or if you're using different vendors for CI/CD pipelines and vulnerability scan, you can represent it all using Grafeas schemas using Grafeas metadata kinds.

For instance, for deploying in nodes we'll just have a resource URL inside it to represent what is being deployed, and then the occurrence will have a user email who on the team deploy this, deploy time - what time it got undeployed, and that resource URI that it's attached to. No matter what delivery system you're using, any of them can represent it in this way. This is really meant to be open metadata standards.

If you're interested in contributing to Grafeas, let's talk a little bit about architecture and how we think about the development of the project going forward. We have Grafeas API in the middle in the green, it provides nodes and occurrence kinds, the schemas for them and also API methods to store and retrieve them. Then on the bottom below is the database, the storage backend for them. Those findings will live in separate projects, they're not part of the API itself. For instance, we provide Postgres backend is an example, but if your team is using MongoDB or MySQL or prefers anything or internal, we use spanner for the internal product. Any database you want, Grafeas API can be backed by it. If you'd like to contribute it, I’m very happy to hear about it and accept it and it will live outside of the Grafeas project itself. Then clients are used to storing and retrieve nodes and occurrences and they're provided by the Grafeas team currently in a separate Github project because, again, they're not part of the Grafeas API itself. Then the system will be provided as part of the core Grafeas project because we get into strong access controls.

To sum up, Grafeas is an open artifact metadata standard. We've had contributions from the industry, from various partners. It's used to audit and govern your software supply chain, so without slowing down your development process. You throw all the different metadata you care about and then you're able to build and look at what happened throughout the whole process. It's a knowledge base for all your artifact metadata. We specifically focus on hybrid cloud solutions so that you can use it across on-premises and cloud clusters.

Finally, it's an API with pluggable storage backend, so it doesn't matter what your team is most familiar with in terms of storage backend. You can implement the bindings against the API and it would work well, so it's very universal. If you'd like to ask any questions about Grafeas, we have a Google group, Grafeas-users. If you'd like to contribute, we have a Grafeas dev Google group. I call meetings periodically for us to get together as a community, discuss future releases, discuss prioritization. If you're interested in contributing, please join and also, we have a Twitter account that we monitor actively, @Grafeasio if you have any questions.

Kritis & Grafeas 0.1.0

We talked about Kritis and Grafeas and how it fits with the software supply chain. Let's talk about the upcoming release, which I'm very excited about, 0.1.0. What would we add there? It's coming very soon, we're hoping to release it in Q2 and there are three goals for it. First one is, enable users - you - to start experimenting with Kritis and Greafeas on your desktop, on your laptop to be able to do it on-premises so that we can gather more community feedback and move it towards hybrid cloud solution. We really want to make sure that you can run Grafeas and Kritis anywhere, regardless of whether it's on-premises or combine it with any of the cloud providers. It's meant to be an open standard for the industry. Once you are able to experiment with it, we would love to gather feedback from the community because we would like the community help to prioritize all the necessary features so that it continues to be most useful for the industry.

The scope is to have standalone Kritis on Kubernetes with standalone Grafeas. To bring up Kritis inside the Kubernetes cluster with a standalone Grafeas server, which talks to the Postgres, also standalone and your laptop. Then the two user journeys that we kind of think about. What can you do with this is seeing how a container is deployed to the Kubernetes cluster and then also seeing how the container that shouldn't be deployed because if violate some policy it's actually blocked from being deployed, and so that way you know, that actually works.

The features that we are going to add to Grafeas are helm chart to be able to bring it up as part of the Kubernetes and publish the image, having a standalone Grafeas server with Postgres storage backend and basic support for Go client library. Of course, we know that many of you might be using other languages like Python and Java and we should definitely talk about how to prioritize it but so far, the community feedback we've gotten is, Go client support is most necessary by the people who voice their preferences. First of all, provide a good experience with the basic client library and go and then expanded to other languages, and contributions, of course, are very welcome in this field.

For Kritis, we are adding generic attestations policies so then as a cluster-admin, you can just say "This image is good. Just deploy it and trust it" which simplifies some things as you are figuring out what to do for vulnerability scanning. Then also it's providing a default fallback policy, so what if you don't have any of their policies required, making sure that's well behaved and well defined. Finally, making Kritis configurable - again, to ensure that hybrid cloud support is feasible and it's easy to use.

If you'd like to learn more and follow along, please take a look at the Github repositories for Grafeas and Kritis. Take a look and join the Google groups that we have for Grafeas and Kritis users. If you're interested in contributing, please join grafeasdev where we'll have more information relevant to the developers. We are also online on Twitter @grafeasio.

I will end this talk with questions. How many of you are seeing a potential for using Grafeas and Kritis and is necessary in use case? I'm seeing a few hands and how many of you are interested in contributing to Grafeas or Kritis? Couple of hands. We welcome all your contributions. The goal is to develop this with the industry and make this useful for the whole industry. The community feedback is very important because some things that I think are important might not be as high priority to other teams and companies. Let's get together and build the most useful thing and build the open standards for this.

Questions and Answers

Participant 1: It's about the structure that you have shown to us. Which component is the vulnerability scanner of these structures? Can you use any scanner or is something inside of the inside architecture? Can I use the Nestles or something like that? Or is something inside of these components?

Greenberg: We are thinking about providing scanning framework. We do have vulnerability scanners, the proprietary products for the vulnerability scanning like on GCP, but we don't have that right now for Grafeas. Providing a scanning framework where any vendor can plug in and use that or if you want to implement your own because you have certain information about the vulnerability that might be, so that's in the plans for Grafeas. It's not implemented yet, but it's definitely something that we are considering for the future.

See more presentations with transcripts