Transcript
Brikman: This talk is called "How to Test Infrastructure Code." I will go through some automated testing practices that we found for tools like Terraform, Docker, Packer, Kubernetes. There will be a lot of code, so get ready to read code, this is hands-on. I'll try to run some of the code, we'll see how that goes.
First, I'm going to start with a bit of an observation, something I've noticed about the DevOps industry, Ops, sysadmins, whatever you want to call them, and that is that we're all living in a bit of a world of fear. This is the predominant emotion that I'm seeing from most of the people that I work with. They're just living in fear - fear of things like outages, fear of security breaches and data loss, and just generally fear of change. People are constantly afraid to change things because they don't know how late they're going to be up, how bad it's going to be, or are just terrified.
We know what fear leads to. Fear leads to anger, anger leads to hate, hate leads to suffering - the great Scrum Master Yoda taught us these lessons. We all know what suffering leads to. Most teams seem to deal with this in two ways. One, a lot of drinking and smoking. Number two, deploying less and less frequently. It's scary, it's terrifying, so you just avoid it and you do it less and less often. Unfortunately, both of these solutions just make the problem much worse. Your releases get bigger, there's more risk. This actually makes the whole problem a lot worse, and then, you end up in this sort of rule: sixty percent of the time it works every time.
Automated Tests
I don't want to live in that kind of world. I think there's a better way to deal with this constant state of fear and that is automated testing. I don't want to make the claim that this is going to solve all the problems in the world, it's going to make all your fears go away, but automated tests do have one very interesting impact. When you see teams that do a good job with it, this is exactly what you see, which is, instead of fear, you start to see confidence. That's what tests are about. Tests are not about proving that your code works, they're not some perfect thing that says, "Yes, everything's great," they are about confidence. It's about emotions, it's about how you feel about making those changes. That's really important because you can fight fear with confidence. That's really the key.
We do mostly know how to write automated tests for application code. If you have an app built in Ruby, or Go, or Python, or any of these general-purpose languages, we more or less know how to test these things. But how do you test infrastructure code? If you have a whole pile of Terraform code, how do you know that the infrastructure it deploys works the way you expect it to? Or, if you have a pile of Kubernetes code, how do you know that the way it deploys your services is the way you actually need it to work? How do you test these things?
That's the goal of the talk. I'll share with you some ideas, some insights on how to test with some of these tools and we will look at a whole bunch of code. Hopefully by the end of it, you'll at least have some ideas of how to sleep better at night, how to be a little less afraid.
I am Yevgeniy Brikman, also go by the nickname Jim, which most people find a little easier to pronounce. I'm the co-founder of a company called Gruntwork, and this is where a lot of this automated-testing experience comes from. At Gruntwork, we've built a library of hundreds of thousands of lines of reusable code for Terraform, and Kubernetes, and Docker, etc., and it's used in production by hundreds of companies. The way our tiny company is able to maintain all of that code and keep it working, as the whole world around us is changing, is through a lot of automated testing. We spend a lot of time thinking about this.
I'm also the author of a couple books. There's "Terraform, Up & Running," that's actually the old cover, I need to update this slide as well. Second edition is out, go get it. "Hello, Startup," also talks a lot about the software delivery process.
Here is what we're going to talk about today. We're going to look at the various testing techniques that are out there for infrastructure code, look at static analysis, unit testing, integration testing, end-to-end testing. These are loose categorizations. Some people become very religious about what each of these terms means. These are more of a helpful mental model to navigate the space.
Static Analysis
We've got a lot to cover. We'll get started with static analysis. The idea here is, you want to be able to test your code without actually running the code or, in the case of infrastructure code, without actually deploying anything for real. That's the goal of static analysis, "Look at my code, don't run it, tell me if there's a bug or if there's some sort of issue."
There are a few categories in here. Again, these are not perfect groupings, there's some overlap between them, just a useful mental model for navigating here. The first one are the compilers, the parsers, the interpreters for whatever language you're using. The idea is these things are checking your syntax, the structure of your code. The very basic thing "Does it compile? Is this valid?" YAML, HCL, Go, whatever language you're using.
There's a variety of these tools. For example, for Terraform you have the terraform validate command. I'll show you a really quick example of that. I have a little bit of Terraform code. We'll deal with what the code is in a minute. It looks like this, nothing fancy, using a very simple module. In here, I can run the validate command. It tells me, "Everything looks good." Then, if I mess up the code, like I make some silly typo in here and I run terraform valid again, it will give me an error. That is a very basic level of testing that you can do for your code. "Scan it, tell me if the variables that I'm referencing are actually defined. Tell me if the syntax is valid, that I missed a curly brace." There are similar commands for Packer, and in the Kubernetes world, kubectl has dry-run and a validate flag that you can add that'll do something pretty similar.
Moving one level up from that, you want to catch not just syntactic issues but also common mistakes. There's a whole series of these tools. By the way, these slides will be available after the talk so don't worry about all these links, it should be easy for you to grab. There's a whole series of these tools. For Terraform, there's conftest, which actually works at more than just Terraform, terraform validate, tflint, etc. A whole bunch of these tools will read your code, statically analyze it, and try to catch common errors. One of the kind of idiomatic examples these tools give you is you have a security group that allows all inbound traffic. In other words, a firewall that's way too open. Something like that can be caught using tools like this, in a lot of cases. These are good to plug into your CI/CD pipeline, they run in seconds, they're going to catch a bunch of common mistakes, which again is better than having no testing at all.
Third group, which I don't have a good name for, I'll just call it dry run. Here we actually are going to execute the code but we're not going to deploy anything, it's not going to have any effect on the real world. We are running the code a little bit here, so it's a kind of an in-between between static analysis and unit testing. We're going to give some sort of a plan output and be able to analyze that. In the Terraform world, there's some nice equivalents to this, so there's actually terraform plan command that I can run here. On this module, I can run my plan command. By the way, this little thing at the front, this is just how I authenticate to AWS, ignore that. This is the actual command, terraform plan. If I run that, it'll make some API calls and it'll tell me what the code is going to do without actually changing anything in the world. Here's my plan output, it shows me that it's going to deploy some lambda functions, some API gateway stuff, etc. You can analyze this plan as a form of testing.
There are some tools that help you with that. For example, in the Terraform world, there's HashiCorp Sentinel and terraform-compliance, both of them can run terraform plan and statically analyze that thing and catch a bunch of common errors in a static way.
In Kubernetes world, there's a server-dry-run. I think this is an alpha feature, actually it's pretty new, which will actually take your YAML, your configuration, and send it to the API server. That server will process it, it just won't save the results, and so, it's not going to affect the world. Again, this is a good way to check, "Does my code more or less function to any extent?"
Those are quick little overview of the static analysis tools. What's nice about them, they run fast, easy to use, you don't have to learn a whole bunch of stuff. The downside is they're very limited in the kinds of errors they can catch. If you're not doing any infrastructure testing at all, at least add static analysis. It really just takes a few minutes of your time and it'll catch a bunch of these common mistakes. If you can do a little more, let's do a little more.
Unit Tests
That's where unit testing comes in. Now we're going to get a little more advanced. The idea with unit testing is you want to be able to test, as the name implies, a single unit of your code in isolation. In this section, we're going to go through a few things. We'll introduce the basics of unit testing. I'll then show a couple examples for two different types of infrastructure code, so we'll look at Terraform, and Docker, and Kubernetes, and then, we'll talk about cleanup.
The basics - first thing to understand about unit testing is what's a unit. I've had a lot of people come up to me and say, "I have 50,000 lines of code, deploys this enormous infrastructure. How do i unit test it?" Well, you don't, that's not a unit. Unit testing with general-purpose languages is on a single method or a single class. The equivalent with infrastructure code is going to be a single module, whatever "Module" means in the language and tools you're using. Your infrastructure should be broken up into a bunch of small pieces. If it's not, that's actually step one to being able to unit test it. If you right now have a Terraform file, or CloudFormation, or any other language with 50,000 lines of code, that's an anti-pattern, break it up into a bunch of small standalone pieces. One of the many advantages you'll get is you can unit test those pieces.
Next thing is, with app code, when you're testing those units, when you're testing a single method or class, you can typically isolate away the rest of the outside world - all of your databases, filesystem, web services. You isolate them and you can test just the unit by itself, which is good, because then you can test very quickly, and the tests are going to be nice and stable.
If you actually go look at most infrastructure code - so here's some Terraform code - what's this code doing? All it's doing is talking to the outside world. That's 99% of what your code is doing, whether it's Kubernetes, CloudFormation, AWS. All it really does is talk to the outside world. If you try to isolate the outside world, there's really nothing left to test.
The only real way to test infrastructure code beyond static analysis is by deploying it to a real environment, whatever environment you happen to be using. That might be AWS, or Google Cloud, that might be your Kubernetes cluster you actually have to deploy because that's what the code does. If you're executing it, a deployment is the result.
Key takeaway: there is no pure unit testing for infrastructure code in the way that you might think of it for application code. This means your test strategy looks a little more like this. You're going to deploy the infrastructure to a real environment, you're going to validate that the infrastructure works, and I'll show you a few examples how to do that. Then, at the end of the test, you undeploy the infrastructure again. This is where the terminology gets kind of messy, this is more of an integration test, but we're testing one unit, one module, so I prefer to just stick with the word unit test and just think of it that way.
There's a bunch of tools that can help you implement this strategy, not a comprehensive list, this is just some of the more popular ones. Some of them will do the deploy and undeploy steps for you, some of them expect you to do the deploy and undeploy outside of the tool yourself. Terratest, for example, can do deploy and undeploy, can do validation, and it integrates with a whole bunch of tools, including Terraform, and Kubernetes, and Docker. There's a bunch of other tools, some that are specific to Terraform, some that are specific to checking servers. Definitely check these out, all the links are in the slide deck and you'll have access to that soon. In this talk, we're mostly going to use Terratest, but just bear in mind that the same technique will work with pretty much any tool.
Let's try to write a unit test here. This talk has a bunch of sample code. There's some Terraform code, some Kubernetes, and the automated tests for it. I don't know that that's the best link, I should've gone with a slightly shorter link. It's in the gruntwork-io/org, it's called "Infrastructure as Code Testing Talk," I'll tweet this one out, it'll be in the slide deck. All the code I'm showing you here, you can check it out after the talk. One of the things you'll find in that sample code is a simple little hello-world application that we can test. Let me actually deploy that little application. I'm just going to deploy this thing in the background, and then, I'll walk through the code and show you what this thing is actually doing.
Here's the hello-world app. It's Terraform code looks a little bit like this, very simple code, that's really all there is to it. It's using a module to deploy a Serverless application. For the purposes of an example, I'm using AWS Lambda and API Gateway here just because they deploy quickly so the talk goes faster if I do this. This module lives in the same repo, here it is. If you're interested in the code, it does more or less what you expect, to play a lambda function, create an [inaudible 00:15:27] rule for it, deploy API gateway, etc. This code also outputs the URL of this little endpoint, at the end, and what we're actually running in AWS Lambda is some JavaScript code. This is basically the hello-world example, so it just says "Hello world," and returns 200 OK.
It's a really simple piece of code, it's deployed in the background. I can now copy and paste this URL, run curl on it, hit Enter, and there we go, we've got our nice hello world. This is a nice thing for us to test and play around with here during the talk. Let me actually undeploy it now just so I don't forget about that. What you notice is, what I'm doing right now is I'm manually testing this thing. What did I do? Deploy, validate, and now here, I'm doing the undeploy. We're going to actually write a unit test that does exactly these steps but automatically, in code. I'll walk through what the code does in the slide deck, and then, I'll show you the actual code snippet in a second, we'll run it and see if it works.
Since we're using Terratest and Terratest is a Go library, we're going to write the test in Go. If you don't know Go, don't panic, it's not a hard language and not critical to understand everything about the talk. It's more of the concept just to get the mindset right. We create a hello_world_app_test.go. This is the basic structure of the test, and I'll walk through this line by line. This is actually almost the entire unit test. The first thing we do is we say, "Ok, here are my options for running Terraform. My code lives in this examples/hello-world-app folder." I then use a Terratest function, this terraformInitAndApply to run terraform init and terraform apply. This will actually deploy into my AWS account. I'm then going to validate that the code is working, and I'll show you the contents of that in just a second. Then, at the end of the test, we're going to run terraform.Destroy.
This is defer. If you're not familiar with Go, defer basically says, "Run this before the function exits, no matter how it exits." Even if the test fails, it'll always run defer, similar to a try finally or ensure in other languages. That's a test; apply, validate, destroy, that's really what we're doing. The validate isn't particularly complicated. We're using a Terratest helper to read that URL output. Then, we're using another helper to make HTTP requests to that output. We're looking for a 200 OK that says, "Hello world." We're going to retry it a few times because the deployment is asynchronous, so it's not guaranteed to be up and running the second apply finishes.
That's the whole test. Let me run it really quickly, it'll take about 30 seconds to run. I'll jump into the test folder, I'll run go test. This is our hello-world unit test here. I'll let that thing run in the background for about 30 seconds. Let's look a little more at the code. What I'm actually running here is here's my Test folder, here's hello_world_app_unit_test, here's the Go code. It's pretty much identical to what I showed you in the slide deck, there's one little piece that I'll explain in a few minutes. The rest is exactly as I said, terraformInitAndApply, validate, destroy. The validate basically reads the output and does a bunch of HTTP requests in a retry loop.
Speaking of HTTP requests, the reason we're using HTTP as the infrastructure I'm deploying here is a web service. It makes sense to validate it by making HTTP requests. Of course, you might be deploying other types of infrastructure and there's different ways to validate those. For example, if you're running a server that's not listening on any port, then you might want to validate it by SSHing to that server and checking a whole bunch of properties. Terratest's ways to do that, InSpec, all those other tools, they're really good at that. If you're running a cloud service, you might want to use the cloud API's to verify that it works. If you're deploying a database, you might want to run SQL queries, etc. Just bear in mind that validation is very use-case-specific, but for the purposes of this talk, it'll just always be HTTP requests.
Running tests, you authenticate to whatever environment you're deploying to; in this case, I'm with indicating to AWS. Then, you run the go test command to actually kick off the test suite. If I jump back to the terminal, it should be done running the tests. That's always good to see, the word PASS. It took about 35 seconds. The log output, unfortunately, is hard to read because the font size is kind of wrapping around. If you dig through here, you'll see that the test ran terraformInit. Then, it ran terraform apply. Here's the terraform-apply log output. It deployed the Serverless app, ran terraform output to fetch the URL, it then started making HTTP requests, got the response it expected, ran terraform destroy. In 30 seconds, I can now check that this module is working the way I expect to. I can run this after every single commit. That's huge because I just went from a pile of code that, "I don't know, maybe works, maybe doesn't. Who knows? I guess our users will find out," to, "I can test this after every single commit to this code."
That is the unit testing example for Terraform. Just to make the point that this is not something specific to Terraform, let's do a unit test for something a little different. We're going to look at some Docker and Kubernetes code here as well. Let me jump back into my IDE, the sample code is in that same repo. Up here, we have our docker-kubernetes example and there's really just two files. One is a Docker file, and this defines a Docker image for a really simple hello-world server. In the real world, this would be your Ruby app, your Java application, whatever it is that you're building, but for this talk, it's just a really simple hello-world server. The other thing in here is this blob of YAML, this is used with Kubernetes, it defines a deployment. If you don't use Kubernetes, this is basically a way to say, "I have this Docker container over here. I want to deploy one copy of it, and I want to stick a LoadBalancer in front of it that will listen on port 8080. Deploy the thing, put a LoadBalancer so I can access the thing."
I can run this thing as well. I'll show you how I test this thing manually first, and then, we'll write the automated test for it. I'll jump into the /examples/ folder. First thing to do is build my Docker image so you can do that with the docker build command. That will run pretty quick because it's all coming from cache, I've run this before. If you're running it from scratch, it takes 30 seconds to a minute.
That created this Docker image that I can now deploy to a Kubernetes cluster. I can deploy any Kubernetes cluster I want to one running in AWS or in GCP. If you have the latest Docker for desktop app, Kubernetes is actually built-in. You have one running on your own computer or you can push a button to turn it on, which is pretty neat because I can also now test with Kubernetes completely locally. What I can do is I can run kubectl apply on that deployment.yml file. I hit Enter and that thing will deploy my service. We can see if that worked. We can go fetch the pods, so there's my container, it's now in running status. Then I can do get services. There's the service in front of it, that's that little LoadBalancer, and you can see it's external IP as localhost and it's listening on port 8080. Which means I can now curl 8080 and get a nice little hello world. Ok, we got a little Docker example, it's running Kubernetes. Then, of course, at the end, we can also delete it by running the kubectl delete command.
That's how I test manually. How would I test the exact same thing with a unit test, an automated test? As you can probably guess, the structure is going to look very very similar to what we just did for the Terraform unit testing. I'll walk through it again in the slide deck. We created docker_kubernetes_test.go and that's the basic structure of the test. I'll go through it. The first thing we do is build the Docker image, and I'll show you the contents of that method in just a moment. Then we say, "Ok, the Kubernetes deployment is defined in this file. I want to authenticate to my Kubernetes cluster." I'm just using all the defaults, which means it'll just use whatever my computer is logged into, which is the Kubernetes running locally. We run kubectl apply using a Terratest helper, we validate, I'll show you the contents of that in a sec. Then, at the end of the test, using that defer keyword, we run kubectl delete. There's no magic. All I'm doing is taking the exact same steps I was doing manually and we're just writing them down in code. The value terratest brings is just to give you a bunch of nice helper methods for doing this, but you can find similar helpful methods or write them by yourself.
Let's look at the two functions I mentioned. This is the buildDockerImage function, it's using another Terratest helper, this docker.Build, and it's basically just telling it where the Docker file is located and what to tag it with. Not particularly complicated. Then, the validate function looks very similar. We wait until the service is available, basically Kubernetes is completely asynchronous, so it can take a few seconds to actually deploy, depending on the cluster you're using. Then, we start making HTTP requests to this thing, just like we did with the hello-world app. The way we get the URL for a Kubernetes service is to basically automate those steps I showed you with kubectl get pods, get services. I just put that into this method, so there's a GetService and a GetServiceEndpoint method.
To run this test, you will authenticate to some Kubernetes cluster. As I said, I'm already authenticated to the one running locally. At this point, I can just run that test. Let's do that. Just go test, and there it is, Kubernetes. Hit Enter. This test should run very quickly because it's all running locally. There we go, that took a grand total of 4.69 seconds. What did the test do? The test built my Docker image, so you can see the output there. It's all running from cache so that runs especially fast. Then, it configures kubectl, it ran kubectl apply. You can see it started making HTTP requests, and actually the first one failed because Kubernetes is asynchronous, that's why we do it in a retry loop. After another try or two, it succeeded, and then, it cleaned everything up again at the end of the test. In 5 seconds, you can now add this even as a pre-commit hook if you really wanted to. Or, after every commit, you can check if these Kubernetes configurations you're writing, not just that they're syntactically-valid, which is good to do with static analysis, but that they actually deploy a working service the way you expect to.
I showed you the code in the slide deck but the actual code for that test is very similar, buildDockerImage. Here's our space. I skip the name spacing thing, I'll come back to that in a little bit, and then, basically here it is, KubectlApply, delete, validate. That is unit testing. A lot of people see this and they're , "Is that it? There's no magic? There's no magical thing that does this for me?" No, that's it. You're just automating the things you would've done manually. That's the basis of unit-testing infrastructure code, you deploy it for real. For me, this is well worth it, because right now, with these unit tests, I have a lot of confidence in this code. I know that if somebody changes the code and does something silly, these tests will almost certainly fail and will catch it before it makes it to production. That's worth a little bit of work.
I'll mention one more thing about unit testing, which is cleaning up after those tests. Especially tests for Terraform, CloudFormation, things like that, they're spinning up and tearing down all sorts of resources in your Google Cloud, AWS, Azure accounts. For example, we have one repo that deploys the Elasticsearch stack, an Elk cluster. After every commit, that spends up something like 15 Elk clusters and various configurations, pokes at them for a while, and then, tears them all down. That's a lot of infrastructure after every single commit.
You definitely want to have a completely separate "sandbox" account for automated testing. Don't use production, I hope that's self-evident. You might not even want to use some of your existing staging or dev accounts where human beings are using it just because of the volume of infrastructure that's going to be coming up and down will be pretty annoying. We usually have a completely isolated account used solely for automated testing.
There's one other reason to do that, which has to do with cleanup. The tests that I showed you, they all run terraform destroy or kubectl delete, they all do clean up after themselves, but occasionally that fails. You might have a bug in your test, somebody might hit ctrl + C, something might crash. You don't want a whole bunch of stuff left over in your testing account. There are some tools out there that can clean everything up, and the tool, for example, is called cloud-nuke. Don't run it in production but, if you have a dedicated testing account, that's a good place to run something like that. You can run these as a cron job and just clean up stuff every day.
Integration Tests
That's unit testing. Let's move along to integration testing. The idea with integration testing is, just because your individual units seem to be working doesn't mean that they're going to work when you put them together. That's what you want to find out with integration testing. I'll show you just one example of integration testing, and once you see it, you'll see the structures more or less identical to what we've already talked about. There's not a whole lot new to learn. The basic approach we used was more or less identical. Then, we'll talk about a few other things with parallelism, and test stages, and retries.
Here's an example from that same repo where we have two modules that we want to test and see if they work together correctly. We have one called proxy-app and we have one called web-service. I'll show you the code for those. These are using basically the exact same module, so there's nothing really new here, they're using that same Serverless app module. The only difference is, web service, instead of a plain hello world, it tries to pretend that it's some kind of a back-end web service that your company relies on and it returns a little blob of JSON instead. Then, proxy-app, very similar thing. Again, another little service application. The code that it's running will proxy a URL. You pass in the URL you wanted to proxy as an environment variable, it'll make an HTTP request to it, and then, forward along the results. You can sort of think of this as one of these is a front-end application, one of these is back-end, and you want to make sure they work together correctly.
How are we going to test these things? The first thing to note is the proxy application has an input variable which is how you tell it what URL you want it to proxy. Our web service has an output variable which is its URL. We want to proxy that URL, that's our goal. We're going to write a thing called proxy_app_test, another Go file. Here's the structure, so, hopefully, you're starting to get used to this approach. Going through it line by line, you'll see there's really nothing new here. We're going to configure our web service, and I'll show you what this is doing, but it's that same terraform.Options thing from before. We're going to run terraformInitAndApply to deploy the web service. Then, we're going to configure the proxy application passing it information from the web service, so this is really the only new thing here - we're passing information from one to the other. I'll show you these methods in just a sec. Then, we're going to run terraform apply to deploy the proxy application, we're going to validate it works. Then, at the end of the test, in defer, we're going to run terraform destroy on each of those modules. Exact same structure - apply, validate, destroy.
Looking at those methods, here's configWebService. It's just returning one of those terraform.Options structs, it says, "That's where my code lives." Here's the slightly new thing, which is configProxyApp. This thing is also returning a terraform.Options with one new thing, it's going to read in the URL output from the web service and it's going to pass it as an input variable to the proxy application. Here we're chaining one module's outputs into the inputs of another module, just by passing them along using whatever variables those modules support. The validate method is completely identical to the hello-world one, it's just doing a bunch of HTTP requests. The only difference is a looking for a blob of JSON in the response, instead of plain text.
We can run the integration test. The code for it, by the way, is right here. It's exactly as I said, config the web service, run apply, config the proxy-app, run apply, validate, and then, at the end of the test, run destroy a couple times.
I will let that test run in the background. This will take a little bit longer, and that's actually an important point. That's running in the background and it'll take a few minutes to run, all told. That's important. Integration tests in infrastructure code, as you might expect, take longer than unit tests, just like everywhere else. They can actually take a lot longer, so I'm testing these really simple hello-world lambda functions that deploy quickly. If you're deploying a database, that could take 20 minutes just by itself. These tests can take longer.
What do you do about that? There's a couple things you can do to speed things up. One is, run your test in parallel. This of course doesn't make any individual test faster but at least your whole test suite is only as slow as the slowest test, rather than everything running sequentially. That's useful because these tests can take a while. Telling tests to run in parallel, in Go, is really easy, you just add t.Parallel to the top of any test function. Then, when you run go test, all of those tests that have that will run in parallel. If you go back and look at the actual test code, in this example repo, you'll see that every test has t.Parallel as the very first line of code in the test.
There is one gotcha though, which is you could run into resource conflicts if you're not thoughtful about this. Here's what I mean by that. Your modules, whatever it is that you're testing, your infrastructure code, is creating resources. For example, here we're creating an IAM role and a security group in AWS, and those resources might have names. In this case, AWS actually requires that IAM roles and security groups, the name has to be unique. If you hard code the name into your code and you run two tests in parallel and they both try to use the same name, you're going to get a conflict and the tests will fail.
What you need to do is, you need to namespace all of your resources, in other words, provide a way to override the default name so that you can set it to something unique at test time. I'll just show you a couple real-world examples of that. If we go look at our Serverless app, that module I've been using, you can see it creates a lambda function, and the name it sets to this input variable. It does the same thing with the IAM role and basically all the other named resources, the name is configurable. Then, when we're using that code, so if we go look at our hello-world app, we set the name to var.name, which has a default, but at test time, we're going to override that default. This is the one piece that I hadn't shown you before. If you look, we pass in a name variable in our test, which we set to include this unique identifier. There's a little function in Terratest that basically generates a 6-character string and has something like 56 billion possible combinations, it's a randomized value. This gives you a pretty good chance that two names are not going to conflict. If you override all of the names and all of your test code with something that's pseudo-random in here, then you're going to avoid these resource conflicts.
What's interesting is, this isn't just useful for testing. You should actually get into the habit of namespacing resources anyway because you might want to deploy two copies of the Serverless app in a single environment or across multiple environments. Being able to namespace things is useful for production code anyway.
We do something similar for Kubernetes as well, which is Kubernetes actually has a first-class concept of namespaces. At test time, we generate a randomly named namespace and we deploy all of our code into that namespace to ensure that this does not conflict with anything else that happens to be in the same Kubernetes cluster. Namespacing is very important in general but especially for automated tests that run in parallel.
One more concept that's pretty useful to know about are test stages. If we take a look at this proxy-app integration test, there are five stages in that test. We deploy a web service, and the proxy-app, and we validate the proxy-app, and then, we undeploy it, and then, we undeploy the other thing. In the CI environment, you need to run all of these steps, that make sense, but when you're coding locally, especially when you're first writing this test, you might want to be able to iterate on just some inner portion of this thing. Maybe you're working out how to validate the app correctly and you just want to be able to rerun the validate step over and over again and you don't want to run the rest of the stuff. As the code is written initially, you don't really have a choice, and that's a problem because all those other steps have a lot of overhead. You might want to run the validate step that takes seconds, but the test will force you to pay 5 to 10 minutes of overhead for every single test run. That gets very annoying. You can work around that. Whatever test tool you're using ideally supports this idea of test stages.
Here's what it looks like. I'm not going to run this one, I'll just walk through the code really quickly in the interest of time. This was our original test structure. We're deploying the web service, the proxy-app, and validating. What we're going to do is we're going to basically wrap those in functions. It's the same thing, there's a deploy_web_service, deploy_proxy_app, and validate, but you'll see there's this new thing called stage, using that just as an alias so the code actually fits on the slide. I basically wrap all the code with this little function. All the actual deployment code moves into these named functions, and each stage has a name. You can name it whatever you want, as long as it's unique. The point of doing this is, now, if I have a stage called Foo, I can tell Terratest to skip that stage just by setting an environment variable, that's a SKIP_Foo equals whatever, you can set it to any value.
Here's how you might use this. You might run that integration test, and the very first time you run it, you tell it to skip the clean up steps. When you run the test, it's going to run deploy_web_service, deploy the proxy-app, it's going to run validate, but it's not going to clean anything up. Those services will keep running in the background.
Now you can rerun the test, you can skip the deployment steps as well. The next time you run the test, it's just going to run the validate step over and over again. That takes seconds rather than minutes. This allows you to iterate locally much faster. You can also make changes manually, you can inspect things, you can debug things, it's basically as if you're pausing the test in the middle. That's really what we're doing with just some environment variables. Then, when you're done, you can basically tell it to clean everything up again and you're done.
Test stages are very useful. The one thing you have to do to make them work, besides wrapping your coded functions, is, since we're running these tests in separate processes - we're running go test over and over again, those are separate processes - if two stages need to share data, they can't just pass it in memory, like they were doing before, because of the separate processes. Whatever data you need to pass, which is usually just like these terraform.Options things, you just need to write it to desk and read it from desk. For example, the deployWebService code will store the terraform.Options into the temp folder, and there's a helper to do that, so it's one liner. Then, the cleanupWebService code needs those terraform.Options to know what to clean up, it's going to read it from disk. That allows you to have these completely independent test stages. If you want to see the real version of that, grab that repo, and in here, there's the integration tests with stages. Here it is, here's my deploy step, another deploy step, validate. You can see, each of these is wrapped in this TestStage thing and they're all loading and saving various things to disk.
I will personally tell you this simple hack has helped me keep my sanity. Some of these tests take a really long time, and the ability to rerun pieces in seconds, rather than waiting 20 minutes, is huge. It's incredibly valuable.
One other pro tip has to do with retries. Other thing we learned from long experience is infrastructure in the real world can fail for a whole bunch of reasons - intermittent reasons. I don't mean bugs in your code, but just things like EC2 give you a bad instance or there was a brief outage somewhere or some intermittent issue of that sort. If you don't do anything about it, then your tests can become very flaky, they will basically fail for reasons that have nothing to do with actual bugs in your code.
The easiest solution for this is to add retries. You already saw that the HTTP requests in Terratest, we were doing those in a retry loop, but you can actually do retry loops all over your code and some of them are natively supported by Terratest. In that terraform.Options thing, in addition to saying where your code lives, in addition to passing variables, you can also say, "If you see an error that looks like this," this is actually a very common error you hit with Terraform, these TLS handshake timeouts are very frustrating, you can basically say, "retry up to 3 times with 3 seconds per retry." This will make your tests much more stable.>/p>
End-to-End Test
There's one more category of tests to talk about, which is end-to-end testing. The idea here, as the name implies, is to test everything together. How do you actually do that? If you have a big complicated infrastructure, how do you actually test that end-to-end? You could try to use the exact same strategy I've been showing you this whole talk, deploy everything from scratch, validate, undeploy, but that's not a very common way to do end-to-end testing. The reason for that has to do with this little test pyramid. We have static analysis, unit tests at the bottom, integration tests, end-to-end tests.
The thing about this pyramid is, as you go up the pyramid, the cost to write the test, how brittle the test is, and how long it takes to run goes up very quickly. These are some really rough numbers. Obviously, it depends on your particular use cases, but typically, static analysis runs in seconds, unit tests in a low number of minutes, integration tests take more minutes, end-to-end tests from scratch take hours. Most architectures, even if completely automated, to deploy them completely from scratch can take hours, and to test them, and then, undeploy them at the end. That's, unfortunately, too slow.
The other issue is brittleness. You can actually see this by doing a little bit of math. Assume that some resource you're deploying, EC2 instance, database, whatever it is, has a 1 in 1,000 chance of some random intermittent flaky error. I don't know if this is an exactly accurate stat but it's probably somewhere in the ballpark. You can do the math, do a little probability calculation and see what are the odds of a test failing for flaky reasons based on how much stuff you're deploying in that test. If you have a unit test and it's deploying just a handful of resources, about 10, and each one of those has a 1 in 1,000 chance of failing, then, when you have 10 of them, your chances of failure go up to 1%. If you're deploying 50 resources in an integration test, the chance that you get some kind of a flaky or intermittent error is around 5%. If you try to deploy your entire architecture which has hundreds of resources, I mean we're talking a 40%, 50% chance of just some things somewhere hitting that 1 in 1,000 chance.
You can work around 1% and 5% with just retries, that's what the retries help you overcome, but there's nothing you can do. If 40% of the time your tests are failing for flaky reasons, that's going to be very painful. Unfortunately, doing end-to-end testing from scratch tends to be just too slow and brittle, in the current world, to be useful.
The real way to do end-to-end testing is incrementally. What I mean by that is, you set up a persistent test environment. You deploy everything from scratch, which will take hours and become annoying, but you do that once and you leave it running. Then, whenever you go and update one of your modules, you roll out the changes to just that module. This is what your commit hooks are doing. They're not deploying everything from scratch, they're actually just updating an existent architecture with each change, and then, validating. Then you can run InSpec or whatever you want to validate that things are still working as expected. This will be approximately the same as unit testing or integration testing. It's not going to take that long, it'll be reasonably stable, and I'll actually give you a lot of value in seeing that your entire stack is actually working end-to-end.
As a bonus, you can test not only that the thing works after the deployment, but you can actually write a test that tests the deployment itself. For example, one very important thing is, "Is my deployment zero downtime?" Or, "Every time I roll out a Kubernetes service, do my users get 500 errors for 5 minutes?" You can actually test that, and we have a whole bunch of automated tests around exactly that. This is a really nice way to do end-to-end testing.
Conclusion
Wrapping things up, here's a overview of all the testing techniques I talked about, I'll go over them and summarize really quickly. Basically, static analysis, it's fast, it's easy-to-learn, really don't need to deploy any real resources. You should use it. The only downside is it's very limited in the kind of errors it catches, and just because my static analysis pass doesn't give me that much confidence that my code works. If you're doing nothing, at least do static analysis but don't stop there. Unit tests tend to run fast enough, they take a low number of minutes, mostly stable if you do retries, and they give you a lot of confidence that the individual building blocks that you're using work as expected. The downside is you do have to deploy real resources and you do have to write some real code. Integration tests, pretty similar. The only real difference is that they are even slower, which is a bummer, so you're going to have fewer of those. Then, end-to-end tests, similar thing, but if you do them from scratch, they're way too slow and brittle. Do them incrementally, and then, they'll have similar trade-offs to unit tests and integration tests.
Which ones should you use? Correct answer's of course, "All of them." They all catch different types of bugs and you're going to use them roughly in this proportion. That's actually why it's a pyramid. You want to have a whole bunch of unit tests and static analysis catch as many bugs as you can at that layer, then a smaller number of integration tests, and a very small number of high-value end-to-end tests.
Infrastructure code is scary when it doesn't have tests. In fact, I've heard that actually the definition of legacy code is, "Any code that doesn't have automated tests." You can fight that fear, you can build some confidence in your life by writing some automated tests.
See more presentations with transcripts