Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Lessons from 300k+ Lines of Infrastructure Code

Lessons from 300k+ Lines of Infrastructure Code



Yevgeniy Brikman shares key lessons from the “Infrastructure Cookbook” they developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code used in production by hundreds of companies. Topics include how to design infrastructure APIs, automated tests for infrastructure code, patterns for reuse and composition, refactoring, namespacing, and more.


Yevgeniy Brikman is the co-founder of Gruntwork, a company that provides DevOps as a Service. He's also the author of two books published by O'Reilly Media: “Hello, Startup” and “Terraform: Up & Running”. Previously, he worked as a software engineer at LinkedIn, TripAdvisor, Cisco Systems, and Thomson Financial.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Brikman: Thank you all for coming. This is a talk, as Jonas mentioned, of the ugly layer beneath your microservices, all the infrastructure under the hood that it's going to take to make them work and some of the lessons we learned. There's not a great term for this. I'm just going to use the word DevOps, although it's not super well-defined. And one of the things I want to share is a confession about DevOps to start the talk off.

Hey, there it is. There's a limited range on this thing. So here's the confession - the DevOps industry is very much in the stone ages, and I don't say that to be mean or to insult anybody. I just mean, literally, we are still new to this. We have only been doing DevOps, at least as a term, for a few years. Still figuring out how to do it. But what's scary about that is we're being asked to build things that are very modern. We're being asked to put together these amazing, cutting edge infrastructures, but I feel like the tooling we're using to do it looks something like that.

Now, you wouldn't know that if all you did was read blog posts, read the headlines, everything sounds really cutting edge. Half the talks here, half the blog posts out there, they're going to be about, oh my God, Kurbernetes, and Docker, and microservices, and service meshes, and all these unbelievable things that sound really cutting edge. But for me as a developer, on a day to day basis, it doesn't feel quite so cutting edge, right? That's not what my day-to-day experience feels like. My day-to-day experience with DevOps feels a little more like that. You're cooking pizza on an iron with a blow dryer. This is your hashtag for this talk, #thisisdevops. This is what it feels like.

Sometimes it feels more like that where you're like, "Okay, I guess that works, but why are we doing it like this?" Nothing seems to fit together quite right. Everything's just weirdly connected. What's happening here? This is probably your company's CI/CD pipeline. That's what it looks like. So that's what DevOps feels like to me. I feel like we don't admit that often enough. We don't admit the fact that building things for production is hard. It's really hard. It actually takes a lot of work to go to production. It's really stressful. It's really time-consuming. We've worked with a whole bunch of companies and also had previous jobs, and the numbers we found look something like this.

If you are trying to deploy infrastructure for production use cases and you're using a managed service, in other words, something that's managed for you by a cloud provider like AWS or Azure, you should expect that before you're ready to use that thing in production, you're going to spend around two weeks. If you're going to go build your own distributed system on top of that, go run a bunch of Node, or Ruby, or play apps, build your little microservices, you're going to easily double that two to four weeks, if those things are stateless.

If they're stateful, if it's a distributed system that needs to write data to disk and not lose that data, now we go up in order of magnitude. Now to get to a production quality deployment of that, we're talking two to four months for each of these distributed systems. So, if your team is thinking of adopting the ELK Stack or if you're thinking of adopting Mongo DB or any of these large complicated systems, it's going to take you months to operationalize them.

And then finally, the entire cloud architecture. You want to go build the entire thing, go to prod on AWS, Azure, Google Cloud, it's six to 24 months. Six months, that's your tiny little startup, and 24 months and up is much more realistic for larger companies. And these are best case scenarios. So this stuff takes a long time. As an industry, I don't know that we talk about this enough. People like to come up here and tell you, "We win," and not tell you that they spent three years working on that thing.

There are some things that are getting better in this industry, though. One of the ones that makes me personally very happy is this idea of managing more and more of our infrastructure as code. As opposed to managing manually by clicking around, we're now managing more and more of our infrastructure as code. You're seeing that across the Stack. We used to provision the infrastructure manually; now we have a lot of infrastructure as code tools. We used to configure servers manually’ now we have a lot of configuration management tools, and so on, and so forth.

All of this stuff is being managed as code. I think that's a game changer. I think that's hugely valuable because code gives you a tremendous number of benefits. Hopefully, if you’re developers, you believe in that. But, things like automation. Instead of deploying something manually over and over again, spending hours, you'll let the computer do it. Computer is really good at doing the same thing over and over again. You get version control. So when something breaks, the entire history of your infrastructure is in a version control system, and you can just go look at the commit log to find out what changed.

You can code review those changes. You can't code review somebody manually deploying something. You can write automated tests, which we'll talk about a little later, for your infrastructure. Again, you can't write automated tests if you do things by hand. Your code acts as documentation. So how your infrastructure works is captured in code rather than some CIS admins head. And you get code reuse, which means you can use a code that you've written earlier. You can use code written by others.

That's kind of the genesis of this talk. I work at a company called Gruntwork, and we've built a reusable library of infrastructure code using a variety of technologies. We basically have these prebuilt solutions for a whole bunch of different types of infrastructure. Along the way, we've deployed this infrastructure for hundreds of companies, they're using it in production, and it's over 300,000 lines of code. This is actually a really old number, so it's probably way over that at this stage. We've written a lot of infrastructure code. My goal in this talk is to share basically the things we got wrong, or to share the lessons we've learned along the way. It's helpful as you go out there and you're starting to use your microservices, as you're starting to deploy all this stuff, you can benefit from some of these lessons, not make the same mistakes that we did.

I'm Yevgeniy Brikman, I go by the nickname, Jim. I'm one of the co-founders of Gruntwork, and the author of a couple of books, "Terraform: Up and Running" and "Hello, Startup," both of which talk quite a bit about this DevOps stuff. If you're interested in learning more, grab yourself a copy. Today, here are the things I'm going to chat with you about. The first thing I'll do is I'll give you guys the checklist. And what I mean by that is, we're going to talk about why infrastructure work takes as long as it does. Then we're going to talk about some of the tools that we use and the lessons from that. We'll talk about how to build reusable infrastructure modules, how to test them, and how to release them. So it's a good amount to cover. Let me get rolling. We'll start by chatting about the checklist.

The Checklist

I've generally found that there's really two groups of people. There's the people that have gone through the process of deploying a whole bunch of infrastructure, have suffered the pain, have spent those six to 24 months and understand it, and then there's the people who haven't, who are newbies. When they see these numbers, when they see that it's going to take them six to 24 months to get live for production infrastructure, they tend to look like that. “You got to be kidding me. It's 2019, can't possibly take that long.” You have a lot of overconfident engineers who are like, "Ah, maybe other people take that long. I'll get this done in a couple of weeks." No, you won't. It's just not going to happen. It's going to take a long time if you're doing it from scratch.

Now, the real question is, why? I'm sure many of you have experienced this. You expected to deploy this thing in a week, and it took you three months. So where did that time go? Why does this stuff take so long? I think there are really two main factors in the plan to this. The first one is something called “Yak shaving”. How many of you are familiar with this term? So, less than half the room. Okay. For the other half of you, you're welcome. This is my gift to you today. This is one of my favorite terms, and I promise you, you will use this term after I introduce it to you. The other half of the room is smiling because you know why.

What is Yak shaving? Yak shaving is basically this long list of tasks that you do before the thing you actually wanted to do. The best explanation I've seen of this comes from Seth Godin's blog, and he tells a little story like this. You get up one morning and you decide you're going to wax your car. It's going to be great. It's going to be a fun Saturday morning. You go out to the backyard, you grab your hose, and the hose is broken. "Okay, no problem. I'll go over to home depot. I'll go buy a new hose." You get in your car, you're just about to head out and you remember, "Oh, to get to home depot, I have to go through a whole bunch of toll booths. I need an easy pass otherwise I'm going to be paying tolls all day long. So no problem, I'll go grab the easy pass from my neighbor."

You're just about to head to your neighbor's house when you remember, "Oh, wait, I borrowed pillows last week. I should return those otherwise he's not going to give me the easy pass." So you go find the pillows, and you find out all the Yak hair has fallen out of the pillows while you were borrowing them. The next thing you know, you're at a zoo shaving a Yak, all so you can wax your car. That's Yak shaving. If you're a programmer, you know exactly what I'm talking about. You've run into this a thousand times. All you wanted to do was change the little button color on some part of your product, and for some reason, you're over here dealing with some like TLS certificate issue, and then you're fixing some CI/CD pipeline thing, and you seem to be moving backwards and sideways, rather than the direction you want to go. So, that's Yak shaving.

I would argue that this, in the DevOps space, is incidental complexity. It's basically the tools that we have, they're all really tightly coupled. They're not super well designed. Again, remember, we're still in the Stone Age. We're still learning how to do this. And so as a result, when you try to move one little piece that's stuck to everything else and you seem to just get into these endless Yak shaves. So that's reason number one.

Now, the second reason is what I would argue as the essential complexity of infrastructure. This is the part of the problem that's actually real, that's there that you have to solve, and I think most people aren't aware of what most of it is. To share with you what we've learned in this space is I'm going to share with you guys what we call our production-grade infrastructure checklist. This is the checklist we go through when we're building infrastructure that is meant to be used in production, infrastructure that you're willing to bet your company on. Because if you're going to go put your company's data in some database, you want to know that it's not going to lose it tomorrow and take your company out of business. So this is what I mean by production-grade.

Here's the first part of the checklist, and this is the part that most people are very aware of. You want to deploy some piece of infrastructure, Kafka, ELK, microservices, whatever it is. You realize, "Okay, I've got to install some software. I have to configure it, tell it what port numbers to use, what paths to use on the hard drive. I have to get some infrastructure, provision it, might be virtual infrastructure in a cloud, and then I have to deploy that software across my infrastructure." When you ask somebody to estimate how long it'll take to deploy something, these are the things 99% of developers are thinking of. These are the obvious ones. But this is page one out of four, and the other three pages are just as important, and most people don't take them into account when they're doing their estimates. So here's where the real pain comes in.

Let's look at page two. Page two has things that I think most of you would agree are equally important. Things like security. How are you going to encrypt data in transit with TLS certificates? How are you going to do authentication? How are you going to manage secrets, server hardening? Each of these things can be weeks of work. Monitoring for this new piece of infrastructure. What metrics are you gathering? What alerts is it going to trigger when those metrics are not where they should be? Logs. You've got to rotate them on disc. You've got to aggregate them to some central endpoint, backup and restore. Again, if you're going to put some data in a database, you want to be confident that data isn't going to disappear tomorrow. This stuff takes time. That's two.

Let's go look at page three. How about networking? Especially if you're thinking about microservices here today, I know a lot of you have to think about how are we going to do service discovery and all the other things you're doing with things like a service mesh. How do you handle IP addresses? How do you do all the subnets? How are you going to manage SSH access or VPN access to your infrastructure? These aren't optional. You have to do these things, and this is where the time ends up going when you're deploying a piece of infrastructure. You need to make it highly available, heard a lot about that today, scalability, performance.

Finally, page four, which most people do not get to, things like cost optimization. How do you make it so this thing doesn't bankrupt your company? That seems pretty important. Documentation almost nobody gets to. What did you do? Why did you do it? And then automated testing. So, again, very few people get to this. They just don't think of it. You think, how long is it going to take me to deploy something? You're thinking of page one. Basically install, configure, deploy. I'm done. And you're forgetting about three other pages of really important stuff.

Now, not every piece of infrastructure needs every single one of these items, but you need to do most of them. So the takeaway here is when you go to build something in the future, when your boss says, "How long is it going to take you to deploy X?" Go through that checklist. Go through that checklist and make a very conscious decision that, "We will do end-to-end TLS”, or, "We won't." "We will think about SSH access," or, "We won't." Figure those out. Make those explicit. Your time estimate will at least be directionally correct. You'll still be off by quite a bit cause of Yak shaving, but at least you'll be a little closer. So, hopefully, that checklist is useful. You can find a more complete version of it on the Gruntwork website, It's in the footer. You can also just search for the production-readiness checklists. That's a really complete version of this. Use it. We use it all the time. It's really valuable.


Second item is we'll chat a little bit about the tools that we use. Something that a lot of people ask is, what tools do we use to implement that checklist? I know what I need to do, how do I actually do it? I'll go over the tools that we use, but something important to understand is that this is not a recommendation to you. You are going to have your own requirements, you're going to have your own use cases. The reason I bring this up is a little bit different, and I'll cover that in just a second. So, just to answer the question.

At Gruntwork, the tools that we use, we like things that let us manage our infrastructure as code, of course. We like things that are open source, things that have large communities, support multiple providers, multiple clouds, that have strong support for reuse and composition of the code. Composable pieces are important at the infrastructure layer as well. We'll talk about that in a bit. We like tools that don't require us to run other tools just to use those tools, and we like the idea of immutable infrastructure. We'll cover all of that a little bit later.

The tools that we're using as of today look something like this. At the basic layer, all of our general networking, load balancers, all the integrations with services, all the servers themselves, which are usually virtual servers, all of that we deploy and manage using Terraform, which lets us do it all as code using HCL as the language. On top of that, those servers are running some sort of virtual machine images. In Amazon, for example, those are Amazon machine images, and we define and manage those as code using Packer, another open-source tool that lets you manage things as code.

Now, some of those virtual machine images run some sort of Docker agent. That might be something to do with Kurbernetes, or ECS, or some other Docker cluster. So those servers form into a Docker cluster, and in that cluster, we're going to run any of our Docker workloads, so all sorts of containers. Docker lets us define how to deploy and manage those services as code usually using something like a Dockerfile. Then the hidden layer under the hood that most people don't tell you about, but it's always there, is all of this stuff is glued together using a combination of - basically, this is our duct tape. Bash scripts, Go for binaries, Python scripts when we can use them - that's the stack.

But here's the thing. These are great tools. If you can use them, great. But that's not really the takeaway here. The real takeaway here is that whatever toolset fits your company, whatever you end up picking isn't going to be enough. You could pick the most perfect tools, or you could copy exactly what we did, or come up with something different, and it won't matter at all unless you also change the behavior of your team and give them the time to learn these tools. So infrastructure-as-code is not useful just in and of itself. It only is useful when combined with a change in how your team works on a day-to-day basis.

I'll give you a really simple example of where this is absolutely critical. Most of your team, if you're not using infrastructure-as-code tools, it's probably used to doing something like this. You need to make a change to the infrastructure. You do it manually and you do it directly. You SSH to a server, you connect to it, you run some command, I made the change that I needed to. Now, what we're saying is when you introduce any of those tools, Chef, Puppet, Terraform, whatever it is, you're saying that now we have this layer of indirection. Now to make a change, I have to go check out some code. I have to change the code, and then I have to run some tool or some process, and that's the thing that's going to change my infrastructure. That's great, but the thing to remember is that these pieces in the middle, they take time to learn, to understand, to internalize. Not like five minutes of time, like weeks, months, potentially, for your team to get used to this. It takes much longer than doing it directly.

Here's what's going to happen. And if you don't prevent this upfront, I guarantee this will happen no matter what tools you're using. You're going to have an outage. Something is going to go wrong. And now, your ops person, your DevOps, your sysadmin, whoever it is, is going to have to make a choice. They can spend five minutes making a fix directly, and they know how to do that already, or they can spend two weeks or two months learning those tools. What are they going to choose during the outage? Pretty clearly, they're going to make the change by hand.

What does that lead to? Well, it turns out, with infrastructure as code, if you're making changes manually, then the code that you worked so hard on does not represent reality anymore. It does not match what's actually deployed, what's actually running. So, as a result, the next person that tries to use your code is going to get an error. Guess what they're going to do? They're going to say, "This infrastructure as code thing doesn't work. I'm going to go back, and I'm going to make a change manually." That’s going to screw over the next person, and the next person. You might've spent three months writing the most amazing code, and in a week of outages, all of it becomes useless and no one's using it.

That's the problem because changing things by hand does not scale. If you're a three-person company, sure. Well, do whatever you need to do. But as you grow, it does not scale to do things manually, whereas code does. So it's worth the time to use these tools only if you can also afford the time to let everybody learn them, internalize them, make them part of their process. I'll describe a little bit what that process looks like in just a second. But if you don't do that, don't bother with the tools. There's no silver bullets. It's not going to actually solve anything for you.


Third lesson we've learned as we wrote this huge library of code is how to build reusable nice modules. The motivation comes from the following. Most people when they start using infrastructure as code in any of these tools, they basically try to define all of their infrastructure in a single file or maybe a single set of files that are all deployed and managed together. So all of your environments, Devs, Stage, QA, Prod, everything defined in one place.

Now, this has a huge, huge number of downsides. For example, it's going to run slower just because there's more data to fetch, more data to update. It's going to be harder to understand. Nobody can read 200,000 lines of Terraform and make any real sense of them. It's going to be harder to review both to code review the changes to code review any kind of plan outputs that you get. It's harder to test. We'll talk about testing. Having all your infrastructure in one place basically makes testing impossible. Harder to reuse the code. To use the code also, you need to have administrative permissions. Since the code touches all of the infrastructure to run it at all, you need to have permissions to touch all of the infrastructure. Limits your concurrency. Also, if all of your environments are defined in one set of files, then a little typo anywhere will break everything. You're working on making some silly change in stage and you take down Prod. So that's a problem.

The argument that I want to make is that large modules, by that I mean like a large amount of infrastructure code in one place, are a bad idea. This is not a novel concept. At the end of the day, infrastructure as code, the important part here is, it's still code. We know this in every other programming language. If you wrote, in Java, a single function that was 200,000 lines long, that probably wouldn't get through your code review process. We know not to build gigantic amounts of code in one place, in every other language, in every other environment. But for some reason, when people go to infrastructure code, they assume that's somehow different, and they forget all the practices that they've learned, and they shove everything into one place. So that's a bad idea that is going to hurt almost immediately, and it's only going to get worse as your company grows.

What you want to do is you do not want to build big modules. At the very least, to protect against the last issue I mentioned, the outages, you want to make sure that your environments are isolated from each other. That's why you have separate environments. But even within an environment, you want to isolate the different pieces from each other. For example, if you have some code that deploys your VPCs, basically the network topology, or the subnets, the route tables, IP addressing, that's probably code that you set up once, and you're probably not going to touch it again for a really long time. But you probably also have code that deploys your microservices, your individual apps, and you might deploy those 10 times per day.

If both of those types of code live in one place, if they're deployed together, if they're managed together, then 10 times per day, you're putting your entire VPC at risk of a silly typo, of a human error, of some just minor thing that can take down the entire network for your entire production site. That's not a good thing to do. Generally speaking, because the cost of getting it wrong here is so high, you want to isolate different components of your infrastructure from each other as well, where a component is based on how often it's deployed, if it's deployed together, risk level rise. So, typically, your networking is separate from your data stores, is separate from your apps. That's kind of the basic breakdown.

But that only fixes the issue of, "Well, I broke everything." All the other issues would remain if that's all you did. You still have to think about the fact that it runs slower. The fact that it's harder to test, the fact that it's harder to reuse the code. So really, the way I want you to think about your architecture when you're working on infrastructure code is, if this is your architecture, you have all sorts of servers, and load balancers, databases, caches, cues, etc., then the way to think of this as infrastructure code is not, "I'm going to sit down and write." The way to think about it is, "I'm going to go and create a bunch of standalone little modules for each of those pieces." Just like in functional programming, you wouldn't write one function. You write a bunch of individual little functions, each responsible for one thing and then you compose them altogether. That's how you should think about infrastructure code as well. I call these modules. That seems to be the generic term in the infrastructure world, but really there are no different than functions. They take in some inputs, they produce some output, and you should be able to compose them together.

Typically, what that looks like is you start by breaking down your environments. So dev, stage, and prod live in separate folders. They're separated from a code perspective. Within each environment, you have those different components. As I said, your networking layer separate from your database layer, separate from your app layer. Under the hood, those things are using reusable modules, basically functions, to implement them. So it's not like those are copied and pasted. Under the hood, they're using a shared library that you can build out, and that library itself should be made, again, from even simpler pieces. Again, there's nothing new here that I'm telling you other than just whatever programming practices you would use in your Java code, in your Scala code, in your Python code. Use them in your infrastructure code as well. It's still code. The same practices tend to apply.

I'll show you a really quick example of what the infrastructure code can look like, just to build your intuition around this. The example I'll look at really quickly here is there's a set of modules that are open source for something called Vault, for deploying Vault on AWS. Vault is an open-source tool for storing secrets, things like database passwords, TLS certificates, all of that stuff, you need a secure way to store it. Vault is a nice open source solution, but it's a reasonably complicated piece of infrastructure.

Vault itself is a distributed system, so we typically run three Vault servers. The recommended way to run it is with another system called Consul, which is itself a distributed key-value store, and you typically run five of those servers. There are all sorts of networking things, TLS certs you have to worry about. So there's a decent number of things to think through. You can look at this later on after the talk if you're interested just to see an example of how you can build reusable code to do this.

I'll show you the Vault code checked out here on my computer. I don't know if you can read that. Maybe I'll make the font a little bigger. Let's try. That's a little better. Well, over here, there are basically three main folders, examples, modules, and test. The rest of the stuff you can ignore. Examples, modules, test. So the modules folder is going to have the core implementation for Vault. The thing to notice here is there isn't just one Vault module that deploys all of Vault, and all of Consul, and all of those dependencies. It actually consists of a bunch of small submodules. In other words, basically smaller functions that get combined to make bigger ones. For example, here, there's one module called Vault cluster. This happens to be Terraform code that deploys an autoscaling group and a launch configuration, basically a bunch of infrastructure to run a cluster of servers for Vault.

Now, that's one module in here, and that's all it does. Separate from that, we have another one, for example, to attach security group rules. These are essentially the firewall rules for what traffic can go in and out of Vault. That's also a Terraform code, and it lives in a completely separate module. Even separate from that are other things like install Vault. This happens to be a Bash script that can install a specific version of Vault that you specify on a Linux server. As you can see, it's a bunch of these separate orthogonal standalone pieces. Why? Why build it this way?

One thing that's really important to understand is you get a lot more reuse out of it this way. I don't mean just in some hypothetical, “You're not going to need it” approach. I mean, even within a single company, if you're running Vault in production, you'll probably deploy it like that. Three Vault servers, maybe five Consul servers, separate clusters scaled separately. But in your pre-production environments, you're not going to do that. That's really expensive. Why waste? You might run it all on a single cluster, might even run it on a single server.

If you write your code as one giant super module, it's really hard to make it support both use cases, whereas if you build it out of these small individual Lego building blocks, you can compose them and combine them in many different combinations. I can use, for example, the firewall security group rules for Vault, and I can attach those to another module that deploys Consul because they're separate pieces, and I can combine them in different ways. So exactly like functional programming, nothing different here. It's a different language than you might be used to.

We have a bunch of these standalone little pieces. Then in the examples folder, we show you the different ways to put them together. Think of it as executable documentation. We have an example in here of how to run a private Vault cluster. Under the hood, this thing is using some of those little sub-modules that I showed you earlier, and it's combining them with some modules from another repo that run Consul and a whole bunch of other stuff. It shows you one way to assemble these building blocks. There are also examples here of how to build, for example, a virtual machine image. This is using Packer. So, all sorts of examples of how to do this stuff.

Now, why bother writing all of this example code? Well, for one thing, this is documentation. This is good for your team to have access to. But the other really critical reason we always write this example code is the test folder. In here, we have automated tests. And the way we test our infrastructure code is we're going to deploy those examples. The examples are how we're going to test everything, which creates a really nice virtual cycle. We create a bunch of reusable code, we show you ways to put it together in the examples folder, and then we have tests that after every commit, ensure the examples do what they should, which means the underlying modules do what they should. I'll show you what the test look like in just a minute, but that's your basic folder structure of how to build reusable, composable modules of small pieces that each do one thing. UNIX philosophy says this, functional programming says this, you should do it in an infrastructure world as well.

We walked through these things. Let me skip forward. The key takeaway is, if you open up your code base for Chef, Puppet, Terraform, whatever tools you're using, and you see one folder that has two million lines of code, that should be a huge red flag. That should be a code smell just like it would be in any other language. So small, reusable, composable pieces. We have a very large library of code, and this has been the route to success to build code that is actually reusable, testable, that you can code review, etc.


I mentioned testing a few times. Let me quickly talk about how that works. One of the things we've learned building up this library, even before we built this library, when we tried to use some of the open source things that are out there is that infrastructure code, in particular, rots very, very quickly. All of the underlying pieces are changing. All the cloud providers are changing and releasing things all the time. All the tools like Terraform and Chef, they're changing all the time. Docker is changing all the time. This code doesn't last very long before it starts to break down. So, really, this is just another way of saying that infrastructure code that does not have automated tests is broken.

I mean that both as a moral takeaway lesson for today, but I also mean that very literally. We have found every single time that we have written a nontrivial piece of infrastructure, tested it to the best of our ability manually, even sometimes run it in production, taking the time to go and write an automated test for it, has almost always revealed nontrivial bugs, things that we were missing until then. There is some sort of magic that when you take the time to automate a process, you discover all sorts of issues and things you didn't realize before. In both bugs in your own code, which we have plenty of, but we found bugs in all of our dependencies too when deploying elastic search. We actually found several nontrivial bugs in elastic search because we had automated tests. We found many nontrivial bugs in AWS and Google cloud itself, and, of course, in Terraform, and in Docker, all the tools we're using. I mean this very literally. If you have a pile of infrastructure code and some repo that doesn't have tests, that code is broken. I absolutely guarantee it. It's broken.

How do you test them? Well, for the general purpose languages, we more or less know how to test. We can run the code on localhost on your laptop, and you can write unit tests that mock outside dependencies and test your code in isolation. We more or less know how to do that. But what do you do for infrastructure code? Well, what makes testing infrastructure code pretty tricky is we don't have localhost. If I write Terraform code to deploy an AWS VPC or Google Cloud VPC, I can't run that on my own laptop. There's no localhost equivalent for that.

I also can't really do unit testing. Because if you think about what your infrastructure code is doing, most of what that code does is talk to the outside world. It talks to AWS, or Google Cloud, or Azure. If I try to mock the outside world, there's nothing left. There's nothing left to test. So I don't really have local hosts. I don't really have the unit testing. Really, the only testing you have left with infrastructure code is essentially what you would normally call integration testing, sometimes end-to-end testing. The test strategy is going to look like this. You're going to deploy your infrastructure for real, you're going to validate that it works the way you expect it to, and then you're going to un-deploy it.

I'll show you an example of this. The example that I'm showing you is written with the help of a tool called Terratest. This is an open-source library for writing tests in Go, and it helps you implement that pattern of bringing our infrastructure up, validate it, and then tear the infrastructure back down. Originally, we built it for Terraform, that's why it has that name, but it now works with Packer, with Docker, with Kurbernetes, with Helm Charts, with AWS, with Google cloud, a whole bunch of different things. It's actually a general purpose tool for integration testing, but the name has stuck. We're not very good at naming things.

The Terratest philosophy is to basically ask the question, how would you have tested this thing manually? Let's look, again, at that example of Vault. Here's my Vault infrastructure. I wrote a pile of code to deploy this. How do I know it's working? Well, probably the way I would test that it's working is I would connect to each node, make sure it's up, run some Vault commands on them, to initialize the Vault cluster, to unseal it. I'd probably store some data. I'd probably read the data back out. Basically, do a bunch of sanity checks that my infrastructure does what it says on the box.

The idea with Terratest is you're going to implement exactly those same steps as code. That's it. It's not magic. It's not easy necessarily, but it's incredibly effective because what this is going to do is, you're going to verify that your infrastructure actually does what it should after every single commit, rather than waiting until it hits production. Let's look at an example for Vault. I mentioned that with Vault, we have all of this example code. Here's this Vault cluster private example that has one particular way to use all our Vault modules. The test code for it lives in here. Vault cluster, private test. This is Go code that's going to test that code. The test consists of essentially four, what we call stages. I'm not going to cover what the stages are, but basically four steps. The first two, if you notice, they use the Go keyword, defer. This is like a try-finally block. This is what you run at the end of the test. Really, the test starts over here.

The first thing we're going to do is deploy the cluster. Then we're going to validate the cluster, and then at the end of the test, we're going to tear it down. So what's the code actually doing? Deploying the cluster is a little piece of code that basically says, "Okay..."- and this was using a bunch of helpers built into Terratest - it says, "Okay, my example code or my real code lives in this folder. I want to pass certain variables to it that are good for test time." Here we do a bunch of unique names and a bunch of other good practices for tests. "And then I'm basically just going to run Terraform init and Terraform apply to deploy this. I'm just running my code just like I would have done it manually using some helpers from Terratest.” Once that cluster is deployed, and all of those Terratest helpers will fail the test if there's any error during deployment. Now I'm going to validate that Vault does what it should.

First, I'm going to initialize and unseal my cluster. The way to do that is basically what I described. We're first going to wait for the cluster to boot up because we don't know how long that'll take. And that code is a retry loop that basically says, "Hey, ping each node on the cluster until it's up and running," because you don't know exactly how long it will take. Yay, microservices, right? Everything's eventually consistent. Then I'm going to initialize Vault. The way I do that is I'm going to run this command right here, vault operator init. This is exactly what you would've done manually. I'm going to do that by SSHing to the server and running that command, and there's basically a one-liner in Terratest to SSH to something, execute a command. Then I'm going to unseal each of the nodes, which is basically just more SSH commands, this time vault operator unseal, etc.

Hopefully, you got the idea. I am just implementing the steps I would have done manually as code. This code executes after every single commit to the Vault repo. And that's going to give me a pretty good confidence that my infrastructure does what it should. I now know that if I'd made a commit to the Vault repo and it deployed, that that code can fire up a Vault cluster successfully, that I can initialize and unseal it, that I can store data in that cluster, and I can read that data back out. We have fancier tests that redeploy the cluster, that check for zero downtime deployments, that check all sorts of properties, but that's the basic testing strategy.

If you're not doing this, then you're basically just doing it in production and letting your users find out when you have bugs, which sometimes you have to. I mean, to some extent, we all test in production, but you can get a tremendous amount of value by doing this sort of testing ahead of time. I walked through Terratest, and Terratests has a ton of helpers for working with all these different systems. Helpers for making HTTP requests, helpers for SSHing, helpers for deploying a Helm Chart to Kurbernetes, etc. It's just trying to give you a nice DSL for writing these tests.

So a couple of tips about tests. First of all, they're going to create a lot of stuff. You're constantly bringing up an entire Vault cluster. In fact, our automated tests to deploy all of those examples in different flavors, so every commit spins up something like 20 Vault clusters in our AWS accounts. What you don't want to do is run those tests in production because it's going to mess things up for you, but you don't even want to run them in your existing staging or Dev accounts either. You actually want to create a completely isolated account just for automated testing. Then what you want to do in that account is run a tool- there's cloud-nuke, there's, I forget the name, Janet or monkey or something like that. There's a bunch of tools that can basically blow away an entire account for you, and you'll want to run that basically as a cron job. Because occasionally, tests will fail, occasionally they'll leave resources behind due to a bug in the test, etc. So make sure to clean up your accounts.

The other thing that I'll mention about testing is this classical test pyramid that you see for pretty much every language, where at the bottom you have your unit tests, then on top of that you have integration tests, and then on top, end-to-end tests. It's a pyramid because the idea is you want to have more unit tests, smaller number of integration tests, very small number of end-to-end tests. Why? Because as you go up the pyramid, the cost of writing those tests, the brittleness of those tests, and how long those tests take to run goes up significantly. So it is to your advantage to try to catch as many bugs as you can as close to the bottom of the pyramid as you can. Now, you're going to need all the test types. This is not to say one test type is better than another. You're going to need all of them. But proportion wise, this is what you ideally have just because of the costs go up.

With infrastructure code, it's the same pyramid. The only slight gotcha is that we don't have pure unit tests as we discussed. You can do some linting, you can do some static analysis. Maybe you can count that as almost a unit test, but we don't really have pure unit testing. So, really, your unit test equivalent in the infrastructure world is to test individual modules. This is why having small little pieces is so advantageous, because you can run a test for an individual module pretty quickly and easily, whereas running a test for your entire infrastructure will take hours, and hours, and hours. So small individual modules, that's a good thing. That's the base of your pyramid.

Then the integration test equivalent is one layer up. That's basically a bunch of modules combined together. Then finally at the top, that might be testing your entire infrastructure. That's pretty rare because the time's involved here, just to set the expectations for you correctly, look like this. Your unit tests in the infrastructural world take between one and 20 minutes typically. They're not sub second, unfortunately. Integration tests can take from five to 60 minutes, and end-to-end tests, that depends completely on what your architecture looks like. That could take hours, and hours, and hours. So for your own sanity, you want to be as far down the pyramid as you can. This is another reason why small little building blocks are great because you can test those in isolation.

Key take away here, infrastructure code that does not have tests is broken. Go out there, write your tests, use whatever languages and libraries you prefer. But the way to test it is to actually deploy it and see if it works.


The final piece for the talk today that I'll use to wrap up is just how you release all of this code, how you basically put everything together. Here's what it's going to look like. Hopefully, from now on, the next time your boss says, "Hey, go deploy X," this is the process that you're going to use. You're going to start by going through a checklist because you want to make sure that when your boss asks you for an estimate, you actually know what it is that you need to build and you don't forget critical things like data backups and restores. You're then going to go write some code in whatever tools make sense for your team, and, hopefully, you've given your team the time to learn, and adapt, and internalize those tools. You're going to write tests for that code that actually deploy the code to make sure it works. You're then going to have somebody review each of your code changes, so make a pull request or merge request, and then you're going to release a new version of your code, of your library.

And a version isn't anything fancy. This can literally be a git tag. But the point of a version is it's a pointer to some immutable bit of code. "Here is my code to deploy microservice for version 1.0, and here's 1.1, 1.2, and 1.3.” It's some immutable little artifact. What you can do with that infrastructure code is now you can take your immutable artifact, and, first, you can deploy to some early pre-prod environments, to Dev or QA, and test it there. If it works well, you can take the same immutable artifact and deploy it in your next environment. You can promote it to staging. And since it's the same code, it should work the same way. Then finally, promote it to prod.

The key takeaway - we went from that at the start of the talk with our pizza slice and iron cooking mechanism, to this where now we have small, reusable modules. We've gone through a checklist, we've written tests for them, we've code reviewed them, and we're promoting them from environment to environment. That's it. Thank you very much.

Questions & Answers

Participant 1: Thanks for that, it was great. I was curious about how are you guys managing your deployments of Terraform code for CI/CD pipeline? Are you automatically running Terraform, or are you running tests on top of that, or checks against the plans, etc.?

Brikman: CI/CD, the whole end-to-end pipeline is a pretty long discussion because a lot of it does tie into the specifics of each company. That's why there are three million CI/CD tools out there, because you're trying to implement your custom workflow. But the general structure to imagine in your head is, at the very basic layer, we start over here with the individual standalone modules, so like that Vault module that I showed you. That lives in its own repo, its own little universe. It's just there, and you have example code and tests for it. Every commit there gets tested only if it passes the test to do, then release a new version of that code.

You have a whole bunch of these repos. You have one from Vault, you have one for your VPCs, one to deploy Kubernetes, etc, etc. Each of those is tested by itself. Those are your individual building blocks. Then over here you're going to now assemble those into your particular infrastructure. So that might be a separate repo. Whether it lives in a separate repo or not isn't actually critical, but I'm just going to explain it that way because it just mentally helps to separate them. In a separate repo, you're now going to assemble your infrastructure using those modules. You're going to take Vault over here, and you're going to deploy in this VPC, and you're going to run Kubernetes in this thing, and you're going to connect it with the service discovery mechanism, etc.

That's, again, more code that you write. Now, if those were your individual units that you are unit testing, this way to combine them is essentially the integration testing. So what you want to do, is in this repo, you're going to run a test against it after every commit as well, or maybe nightly, depending how long those take. Now we're getting into the larger chunks of infrastructure, so that's a little slower.

If those tests pass, now you can release a new version of your infrastructure repo. Now you can take those versions, and finally, in our final repo, you deploy them to your actual live environments. You can take version 3.6 that pass tests in this repo, which is using 10 modules from those repos that also passed tests, and you're going to deploy that to Dev. If it works well, you're then going to deploy to staging. If that works well, you're going to deploy that same version to production.

In those live environments, typically you're not going to spin up the entire architecture from scratch. You could, but that probably takes hours and is brittle. So usually what most companies do is they just have some kind of smoke tests that runs against the already existing environment. So if you have a dev environment, you deployed version 3.6 of your microservice library in there, and you're going to run smoke tests against that environment that's already standing up and do a bunch of sanity checks. Can I talk to my service? Can I deploy a new version of it? Whatever you need to verify. Then you do the same thing in staging and production. And hopefully, by the time you're in prod, the code has been tested across so many layers that you're still going to have bugs. There are always going to be bugs, but it hopefully eliminates a lot of the sillier ones.


See more presentations with transcripts


Recorded at:

Mar 30, 2019