Transcript
Kumar: Some of these slides will contain drawings by me and some of them will be drawings by my son. This is a drawing by me. Does anyone know what this license is? It's a Creative Commons license. It's a permissive license that allows you to do things to this image. You can distribute it. You can share it with anyone you want. You can modify it. You can commercialize it. You can put it in a meme generator. You can create a meme out of it like I've done over here. You can share this meme if you want. Sky is the limit. You can do whatever you want with this picture.
I am Nisha Kumar. I am an open-source engineer at VMware. Part of my job is to review container images that product teams want to ship. Part of my job is also to define policies on how containers are built and shipped for products. In that vein, I also am the maintainer of a project called Tern. Tern is an open-source project that was originated by VMware to inspect containers for license information. This was drawn by my son. How do you know? I've just told you. I've told you that it has the copyright information and what license it is under.
When I picked out this license, I had a conversation with my son about how he wanted to distribute his image. My son, being eight, didn't really understand that his image might be shared by a lot of people along with the slides that get shared and he didn't understand really the implications of what would happen if somebody else were to get his image. There was a conversation about people sharing images and he had come to the conclusion that he did not want his images to be made into memes. That's why we picked this particular license because it allows you to share the image, but it will not allow you to modify the image or commercialize this image. You share it only with the license that you have here, and you maintain the copyright information. Anyway, that's licensing 101.
In a lot of ways, this particular image represents or illustrates the complexities of license compliance for anything including software in container images. The first step to finding out how you can distribute a container image is to figure out what piece of software belongs to who and how you are shipping it. I've already told you that the image on the left was made by my son and the image on the right was made by me and this is in a slide with a VMware copyright right over there, so it already gets complicated. You start asking questions about, "Can I share this slide? How can I share this slide? Can I take the images in the slide and stick it in my slides?" This is already getting to be complicated.
Building Container Images Using Dockerfiles
Container images were like that. In my day to day job, I often have to look at container images by product teams, and then answer a very simple question, "Is this ok for me to ship?" What I found out is that license compliances for container images is hard, but you will also find images for which it is near impossible to do, and this is why.
Container images are built using what's called a copy-on-write storage driver. What happens is that usually a container image builder will take a BaseOS and so that looks like a minimal Linux file system and then it spins up a container image. For this particular purpose, you can think of it as a [inaudible 00:04:45]. Then it runs installation scripts on it, whatever installation scripts you want, it doesn't matter. It's very easy to run any installation script you want.
What happens is that when you are modifying a file, if it already exists in the image in the layer that is underneath when you start off, it'll copy that whole file over to the next one. What you will find is copies of the same file except with your modifications. You're starting off with somebody else's files. Then on top of that, you are adding your files. You've added some files over with your installation stuff but you're still maintaining some of somebody else's files. What happens is that when you do a Docker push to a registry, you are pushing your files and somebody else's files. Depending upon how the registry implements its blob store, you are either responsible for the licenses of the files that got copied over or the licenses of all of the files and all of the layers. At the most basic case, this is hard for license compliance.
Docker Multistage Builds
Let's talk about Docker multi-stage builds because this is a pattern that is usually bandied about as the best pattern to use for building Docker images. What happens here is that you start off with that original image and then you make a new image that has somebody else's files but completely different from the one that you started off with and then you copy over files from one image into the other. That's a bunch of files that you have to meet license obligations for. Maybe this remains hard, like the previous slide that I've shown you, except the Docker removes that whole container image away. What you're left with is just a bunch of files of unknown type and maybe unknown origin. If you were to just get this image, then it is impossible to find the licenses of those. You can do some searching around on Docker hub maybe, but it is a lot of engineering effort to meet your license obligations for this kind of image.
How many of you are in an open-source program office? How many of you have been assigned to look at this by somebody else? This is typically the case with engineers. Some management person or some senior engineer, says, "You. You go do this." At this point, you might be thinking, "Why me?" Then you'll be thinking, "Why should I care? This stuff is open source. It's free and everybody does this. Why should I think about doing this license compliance stuff?"
At the individual level, I suppose I would talk about the ethical case, which is that you are using somebody else's code and you're going to get away with it because they don't have an army of lawyers coming after you. At a business level, this really becomes a problem. It is an enterprise problem because, guess what, enterprises want to sell you product. They want to sell you the best sandwich, the sandwich for which they know the farmer who made the bread and the farmer who grew the vegetables and the farmer that made the cheese. This is what enterprise customers buy. They buy a high quality, high curated, highly secure product from you. This is why they pay you the big bucks.
What happens when you are consuming containers from open-source or containers that you just build using a Docker file is that you don't necessarily give that kind of image to your customers. A discerning customer might come up and ask you, "Why is there a ketchup bottle in my sandwich?" This is the typical situation that we see at VMware when we're trying to evaluate open-source projects that product teams are using. This is typically the state of the container image that you get.
Over the course of working on Tern and using Tern to analyze container images, I have come across three different types of images. One is the panini, which has got all of those copy-on-write layers. It's just layering up with all of those things and then it gets very interesting and large. Then the next one is the taco, which is like the stripped down container image and you get like a BaseOS and then something in it. Then there's the weird candy one which is just like the one binary.
Inspecting Container Images with Tern
At this point, I will end the show and see if I can go back and forth with the demo. Let's get our feet wet. I am going to run Tern. I am going to log it. I am going to ask for a report. I am going to look at alpine:latest. There it goes. It does it really quickly because this layer is cached. Let's look at what we have here. This is the default report that Tern produces. What you will see is that it starts from the base layer and analyzes the image layer by layer. This only has one layer so that's all it's going to tell you about. It tells you what OS is there at the base. It tells you what commands it used to gather all the information.
The reason why it tells you that is because, unlike any other inspection tool, it wants you to be able to make engineering decisions about the data that it produces, and I will explain later. It'll become apparent why you need this information. It tells you that it's using the package manager to do this. It tells you that it doesn't have any information about copyrights, and it doesn't have any information about what the source packages are for these. It tells you a list of packages that are found and it'll tell you the licenses that it found in it. You will see at the bottom a list of licenses that it found.
Let's run something bigger. Let's do tern-l report-i. It's mongo:1.0.1 and then I'm going to output it to output.txt. Off it goes. Ok, it's working. It's going to take some time. Let's go back here and see. We're looking at the one with lots of layers. That's the panini. What you will find is that it will tell you that the layer that it is based on is a Debian Jessie. This is something that you will not know from just pulling this image from Docker hub. This already tells you off the bat that it's based on a Debian image. It tells you what commands it was using to run it. It tells you the warnings. It also gives you all the packages that are installed. It tells you that it didn't find any licenses.
The reason why it tells you that is because there is no metadata for licenses in Debian-based images but the licenses are available in the copyright text and you can get the copyright text using the other formats that Tern gives you. Let's move on. You will find in this image that there are some layers that use the package manager to install software and so you will get all of the packages that were installed using that. It will also tell you commands that it did not recognize. For example, this one, modify some GPG keys, that's not actually installing any packages. No packages found.
Then there are some places where they copy a file from some external file system. You do not know anything about this just from looking at it. That's what it tells you. Like I said, you can find the license information using the other formats that it has. This one is called SPDX tag value. SPDX stands for software package data exchange. It's a specification to communicate licenses between developers and product teams and companies, etc. It should give you an overview of the licenses that govern the distribution of your packages.
If you look at this, you will already see off the bat how many layers are there in this image that I am running and there are 13, so they are large. You will also see the copyright texts that I was talking about and over here you will see that they've installed mongoDB, and the license that governs it is AGPL.
If you look at output.txt, you will see all the things that I have. There is no page down on this keyboard, so you'll just have to believe me when I say this works. Let me do the Docker images again. Let's look at tern-l. Let us look at that GCR image over here. I copy it. This time I won't print it out, I'll just run it there. This one should go fairly quickly and you will find that it is two layers deep, so only two layers. The first layer is Distroless, but it says there is no package manager for it. There is no package manager for Distroless images. It's only meant for the run time for your binary. It says there's Distroless but I don't know what the packages are. Then it says, "There's another one that got copied from outside, I don't know what this is."
This is the image which is the "Just Enough" image. Really, a container should have just your binary and just enough dependencies for your binary to run. This is the kind of image that we're talking about. In this image, at least at the default level, Tern cannot tell you anything because there is no package manager and it's just a bunch of files. It'll say, "cannot list the packages" and "unknown content included." This is where we have the mystery weird candy area. Over here, this is a functionality that we recently included in our latest release, so I'm excited to show you this. Tern can use an external tool to analyze container images. This is where I kind of need help.
ScanCode is a license scanning tool. You can use it to scan repo binaries as well as source code. We're going to use that to scan this. ScanCode does take a long time and I haven't implemented caching for this yet but what you will find when you run this is that it'll find all of the licenses in your Distroless image but it still cannot tell you what the license of that new binary that you added at the end. Let me switch back and see what we have.
Let me go and look at the images again. To take a closer look at what these binaries are, you can actually look at the working directory that Tern operates in. I'm going to try this out on the weird candy type of container image. This one has a weird tag, so I'm just going to copy it and put it in here. I'm just going to run that and it's going to blurt out and say, "I don't know anything about this." But you can go to Tern and the temp directory and in this you will see the actual container image. It is just one layer long. This is the only place that you can go.
Then there's a folder called "contents" which is the untarred layer. You will find one binary that is 856 kilobytes long. It's a small binary but we don't know anything about it other than it saying BPF tool. Maybe some eBPF thing, but you don't know. It's a GoLang binary just because you can go to Calico's GitHub repository and look and see what it is over there. There's some engineering work that's required over here but if you were just to get this image, you cannot tell where it came from.
What have we learned so far? We have learned that the panini one with lots of layers, that has a mix of software that is installed in various ways. You know some stuff and you don't know other stuff but, in general, you can reason about it with some engineering effort. There is the second one which is the taco, which is, you know something about where some of the stuff comes from but most of it is mystery. You don't really know. These are the ones that are difficult to reason about. Then there's the weird candy one which is just binary of unknown content and you cannot really discern where this comes from. That's Tern.
Is Tern Foolproof?
Is Tern foolproof? As you can see, it really isn't, but the message that I wanted to put forward with all of these knowns and unknowns is that, really, no tool is capable of giving you all of this information because that's how the binary is built and that's how the software is built. Any tool, whether it's a commercial tool that you're paying money for or an open-source tool, is only going to give you some of the information but not all of the information because that is how the software is built. It depends upon the supply chain.
Speaking of the supply chain, I'm an open-source project maintainer. I was lucky to have gone to All Things Open last year and I saw a keynote by this guy Henry Zhu. What I took away from the keynote was this thing where he said maintainership means burnout and anxiety. As an open-source maintainer myself, I have totally come across this issue where folks have filed an issue saying this needs to be fixed right away but I am just one maintainer so I can only fix it whenever I can fix it. This is what the situation is in general with open-source projects.
Lots of people are using open-source projects but there are very few people that are actually taking care of all these issues. As you've heard from previous talks, we've all been talking about the supply chain and we've been pointing out that 99% of software that we're using is open-source. This is what the supply chain is right now. There are very few people that are very tired trying to fix a lot of issues. License compliance is the least of their worries. The processes that they are using are just the processes that can get the thing done. They just want to get the thing done. They're not interested in license compliance, but enterprises are interested in license compliance. There is this disconnect where you are consuming stuff from a supply chain that is not compliant and you expect it to be compliant, and so this is untenable.
I want to take it back to when my son and I had this conversation. My son is eight, like I said. He doesn't know the world. He doesn't understand how things get used in the world. All he cares about is that his people see his drawings and people say, "This is nice. I like this. He's so talented." I, as a more experienced person in the ways of the world, had talked to him about how people use things and he probably needs to think about that. This is the same way that I would urge enterprises, get involved in the open-source projects that they consume, is to actually have conversations with maintainers and talk about supply chain hardening and license compliance and how they can make their tools better meet deploy-ready software, which is something that they don't think about but I assure you that they would highly appreciate it if you can help them out with it. This is not just filed bugs. This is also submit pull requests, clear code. I'm sure that they will appreciate it.
Takeaways
Some takeaways. Actionable items. People find creative ways to use containers. You can find container images being used as build stages for CICD pipelines like the Tekton talk. You can find container images being used in Makefiles and Makefiles being used in containers. Any kind of permutation and combination is possible because the tool just allows you to do that. Keep in mind that you will get some weird containers in your product.
Analyze any container that you consume and distribute. Like I said before, people find creative ways to use it so you will have to look for them very carefully in the source code that you're consuming. If you were to use containers from an open-source project, find the Docker files for them and see what exactly the Docker file is doing.
Here are some tips to how to check for Docker files to see whether this is something that would build you a compliant container or not. In the FROM line, think about what exactly that FROM image is, what it is and where it came from and what's in it. In all the RUN commands, look at what exactly it's doing to run, what commands it's running, what is it [inaudible 00:29:14] getting from the internet, all of those things you need to be wary of.
Then the copy and add parts of it is you have to ask yourself what exactly is it copying into the container image and where does it come from? The ecosystem around the container image is just as important as the container image itself, so you need to inventory that environment as well. This is a lot of environments, so think about the software interactions that happen. What kind of toolchain you're using? Is your software binary statically or dynamically linked? How long are your dependency chains? How many dependencies does the project have? Wat licenses are in those dependencies? The build stages that containers use? There's the Docker-on-Docker thing that's going, so you have to think about, "What is that Docker build stage that is using to make another Docker?" It becomes like an inception thing that you have to worry about now. Think about the SDK that the developer is using to build. This is particularly important for GoLang toolchains because everything is statically compiled.
What you end up having are these binary blobs which you cannot reason about because the whole thing is statically compiled without any kind of information with the binary blob. Then the last one is getting involved in the open-source projects that you consume and help them out with their software supply chain hardening.
Tern Features
Real quick, some Tern features that are notable and upcoming. Tern supports an option where you can give it a Docker file, it'll build the image for you and it'll analyze the image as well. You can use that in your CICD pipeline when you are building container images. It has many report formats, so you can not only do the default report but you can also do YAML, JSON and you can make your own reports if you need to plug that into a CICD pipeline or an audit system.
It supports external extensions. We have ScanCode. We've also integrated a tool called cve-bin-tool so it is possible for you to integrate security scanners as well, but the formatting is not supported because the metadata is different for security vulnerabilities than it is for licenses. Although, I get the feeling that that's going to change because there's a lot of overlap in these two areas.
Coming soon, there's a Docker file freeze option that's coming up that will basically try to pin all of your Docker parts, so the FROM will pin to a digest and all the packages that you're installing will pin to versions. It kind of gets you one step closer to having builds that are reproducible, like actually reproducible, not quite, but better than what it is right now. We work with other projects under the Linux Foundations ACT project which is automated compliance tooling. These are projects like FOSSology, Quartermaster. Tern is the third one. We are also working with tools like oss-review-toolkit and ClearlyDefined, and these are other tools that help you meet compliance obligations.
Here are some resources that I have. This is in the slide, I'm sharing the slides with you, but there's some reading about how storage drivers work and the problems with tarballs that create issues with actual build reproducibility for container images. Here are some talks that help you understand license compliance for container images and the usual link to all the projects that I've talked about.
I haven't spoken about the open-container initiative but that is a group that is working towards trying to embed software metadata. All of this information about what packages are installed, what licenses, all of those will be embedded in the container image so it gets easier for folks who are using container images to reason about this information.
Questions and Answers
Participant 1: Does Tern act as a gate within your CICD pipeline if it doesn't meet certain enterprise policies that you have set in place?
Kumar: Auditing the report isn't part of the Tern ecosystem. That may change because that's a thing that people often ask us. That might be a complementary tool to this particular project. We need resources to have different GitHub repos for different kinds of projects but it's definitely something that we're considering.
Can I ask a question? I did ask a question about how many of you were told by somebody else to figure this out. Do you feel better or worse?
Participant 2: I feel worse because I'm the guy that did the telling. I had to ask one of my guys to go look into this and now I feel really bad because I've thrown him in the deep end.
Kumar: I think this is a problem all enterprises are experiencing right now and they're all trying to figure out what to do about it.
Participant 2: I think it extends a little farther than enterprises though. We're a very small company, 250 people, but we're trying to sell to enterprises. They want to know, "Do you have your shit together?"
Kumar: It's funny, it's not a hint that this is a question that many people will ask. I'm just curious, in what situation is everyone in?
Participant 3: I actually have a question. I think the software Tern is really interesting as it can generate a report. My company actually treats the license compliance as a very important thing. My software, the last release, if there was a little thing that we found out that might have an issue, we actually stop the release and have to fix it before we actually do any release. Then you can actually go to our website and say, "I want to download the license agreement for every single thing that we ship." Regarding the BPF there, the candy thing, I can manually go to their website, identify it, I know the license, now I think is good. Is there any way that I can tell Tern to say, "From now on, when you see this firewall again, generate a report nicely for me?" We want it to give me an error message that I need to figure it out later on.
Kumar: There is a cache and you can do crowd operations on the cache, and by that, I mean you can edit the cache file. It is possible, there's just no API for it. I hope that answers your question somewhat. It's one of those things where we have to say, "Ok, fine. We make the cache and then we create the API for the cache and that's an issue that we will take care of later because the cache is needed now for all of these big images that we're seeing." Yes, it's possible for you to say, "Ok, if you see this file again, this is the license for the file."
See more presentations with transcripts