Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Infrastructure as Code: Past, Present, Future

Infrastructure as Code: Past, Present, Future



Joe Duffy discusses the challenges (and solutions) met while running IaC and how that shapes the future of IaC.


Joe Duffy is CEO of Pulumi, a Seattle startup making it easier for teams to program the cloud. Prior to founding Pulumi in 2017, Joe held leadership roles at Microsoft in the Developer Division, Operating Systems Group, and Microsoft Research. Most recently Joe was Director for Engineering and Technical Strategy for Microsoft's developer tools.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Duffy: Welcome to infrastructure as code: past, present, and future. My name is Joe Duffy, founder, and CEO of Pulumi. I'm here to talk to you about all things infrastructure as code. We'll start with a little bit of history. Where do we come from? We'll talk about current state of infrastructure as code and what tools exist out there, and where we might be headed from here.

Where We Came From

Where did we come from? I think you look at the evolution of server-side management and provisioning servers. At the beginning, of course, you've got hardware. You've got to rack and stack that, move it into a data center. Eventually, virtualization made a lot of this easier. Once you've got a server, it's a stateful machine. If there are dependencies, or pieces of software that need to run on that, of course, how do you install those? The first step is you manually do so. You SSH into a machine. You manually patch it as that software needs to be updated over time. These days a lot of pointing, clicking. We've got consoles in all the cloud providers, or even on-prem with vSphere management consoles. All these things have problems. Where we came from was not repeatable. You would go configure some infrastructure, maybe install some packages. What if you needed a second server that looked the same? What if something failed? Repeatability was a challenge with this mode of doing it manually, either through command lines, or in UIs? It also doesn't scale. I mentioned going from one to two servers. What if you're going from 1 to 10, or 100, or 1000s? What if you're going from one environment to multiple environments? It's easy to make a mistake. If you're manually running commands, and you fat finger a command and something fails, now you're in a mode where you have to manually recover from that failure. These sorts of failures, we hear about them all the time. They lead to reliability problems, outages, security problems. There's definitely a better way to go about things. To foreshadow a little bit, that's what infrastructure as code is all about.

It sounds familiar. If you look back in the 1970s, back then we were building software packages. How were we building software packages? Again, running manual command lines. We're compiling software, packaging it up, linking software. From there, we created a tool called Make. Make is an interesting source of inspiration for where we're going to go in terms of infrastructure as code, in that Make is all about taking some set of inputs, performing some set of commands, and then having some outputs as a result: source code in, compiled program out. The way Make works is actually super interesting, because you're using a domain specific language. We'll see that infrastructure as code, many tools have domain specific languages. In either case, Make has to understand dependencies between all of these so-called targets, and then orchestrate the execution of the build process so that those inputs can properly translate repeatably in an automated fashion to those outputs.

You'll see, a lot of these concepts show up in the way that we have approached how to manage infrastructure and how to automate infrastructure. The first tool that I'll point to in this evolution is CFEngine. CFEngine was really the first approach of taking a similar approach to what we saw with Make where dependencies can be described. The process of automating infrastructure is encoded in machine readable textual declarations, hence the term infrastructure as code. Infrastructure as code allows you to automate manual processes. We're not going and SSHing into servers and running manual commands to update packages. We're really encoding those things as repeatable steps, so that as the needs of our software evolve, we can repeat, reuse, and scale that. Also, as the scalability changes, the requirements, we can easily scale that infrastructure without having to manually worry about issuing commands that can fail and recovering from those.

If we fast forward to these days with the modern cloud, we'll see that the complexity of infrastructure these days is a totally different league than 20 years ago, and so, increasingly, being able to treat infrastructure like software has unlocked a lot of practices and processes that can help us tame the complexity of that infrastructure. Really just eliminate a whole bunch of manual effort, increase the ability to collaborate and really do more with less. If you look at this, this is my attempt at infrastructure as code family tree. We'll see, Make came out way back in the '70s. CFEngine actually was early '90s. Although, if you looked at CFEngine 1, 2, and 3, those were spaced apart with some pretty significant changes for each major revision. Then we'll walk through actually every other tool on this roadmap, but you'll see a few branches in here that are interesting to trace through, which I'll do in this talk.

Imperative vs. Declarative

We'll first talk about some general concepts, and then we'll get into some specific tools. The first major concept is the notion of imperative versus declarative. What do I mean by that? I'll give you an example just to make it concrete. What if our objective is to create a Python web server listening on port 80, available over the internet? The imperative way is to describe how to go about creating a Python web server. Here are some steps. We might create a new AWS EC2 instance, assuming we're running this in the cloud. We might then SSH into that instance, install Python, copy our web server files from maybe a Git repo or somewhere else that we've got them stashed. Start a Python web server process. Then add a port 80 firewall rule to allow traffic. That's fine but that really speaks to those manual steps that we talked about earlier. What if something fails in between steps two and three, for example, or so on, and we have to manually recover? The declarative approach instead is to literally say to some Oracle, create a Python web server listening on port 80. That Oracle, the infrastructure as code engine, the infrastructure as code tool that we're using, will then decide how to perform those steps. We've removed ourselves from the messy business of figuring out every precise step necessary to actually provision that and allow that Oracle to figure out things like, what if it fails? How to actually distill that into the concrete steps. Nearly all the infrastructure as code tools we'll look at have some notion of declarative approach to them. It widely varies based on the tool.


Let me get a little more concrete. I'm going to actually show you a demo of that specific Python web server listening on port 80 in an infrastructure as code tool. This is general concepts. I'll be using Pulumi, the company I work for, just as an example, but the concepts transcend any one tool. Here, we've got just an editor. One of the things about infrastructure as code is often many tools are using general-purpose languages. We can use tools like VS Code, the editor we've got here. We can use package managers and all the familiarities that we have with languages. Again, depends on the tool, some use YAML, some use domain specific languages. In this case, we're using Pulumi. I can use any language. I've chosen to use Python. Really, if you look at what's happening here, we're just setting some objects. Again, this is declaring the desired state of my infrastructure. That concept of desired state is important. If we trace through, we're saying, we need to allow internet traffic on port 80. The way we do that on AWS is to use a security group with IngressArgs that allow incoming traffic over the internet on Port 80. We need to use a Linux Amazon Machine Image, AMI, for EC2. It's effectively just the image that we want the server to run. Then we go ahead and we configure our server and we declare that it's going to use that AMI that we just looked up, and put it in a security group that we created, and run a Python web server. Then at the end, we're going to spit out the automatically assigned addresses.

Notice that this is declaring the state. It's not necessarily talking about exactly how to go about creating that web server, that's left to the declarative engine. Pulumi, I just say, pulumi up. Most infrastructure as code tools, they'll show you a plan of what's going to happen before actually performing that. Here, we see that it says, we're going to create a security group in an instance. If we want to look at some of the details, we can. It will show us things like the instance type and all the other settings. In this case, we're just going to say yes. That's going to go ahead and start chugging away. Notice that that first step showed me what's going to happen first. This is a really key concept, because although you could go script against the cloud providers, you could go use the SDKs that the cloud providers offer, that still has the problem of being able to know what it's going to do before you run the script, and what happens if it fails. The benefit of an infrastructure as code tool is you remove yourself from all of that. The infrastructure as code tool can handle things like showing you what it's going to do beforehand, so you can always make sure that there are no unintended consequences of a deployment activity. If something were to fail here, I'd be able to trivially pick up where I left off. This is going to go create the server. It takes a little bit to create the server in Amazon. You'll see here it's done, and it gives us the autogenerated IDs. The wonderful thing now is if I wanted to take this and deploy it a second time, maybe this is my staging environment and I want to go deploy to production. That's trivial. Or maybe if it's production, East Coast versus West Coast, or production EU, maybe they all have different settings. I can still start from a common code base and repeatedly scale that. Again, if I wanted a second server, a third server, this demo was creating something from scratch, but infrastructure as code tools can take an existing environment and evolve it over time by basically diffing the current state with the future desired state. If I were to go ahead and curl this HTTP address, I will see, Hello, World, the server is up and running. I'm going to go ahead and destroy this, because we don't want to leave our server sitting around costing us money.

Expression Language vs. Evaluation Engine

You saw there the basic concepts of infrastructure as code. We provisioned a web server by declaring it. I think there are two key concepts, there's declarative and imperative. Another key concept here is the language versus the evaluation engine. You'll see that I claimed that the previous example was a declarative example, and yet I was expressing the desired state using an imperative language, Python. Many infrastructure as code tools offer different choices. You can do YAML, or JSON. You can use a domain specific language like HashiCorp's Configuration Language, or Puppet script. You can use a general-purpose language like Ruby with Chef or JavaScript, Python, Go with Pulumi. The key concept here is that you're expressing the desired state in one language, and then the evaluation engine is actually carrying out the deployment activities. The deployment engine and the evaluation engine can be declarative, such that it has a goal state, if something fails, it can resume where it left off. Yes, that expression language, it can be beneficial to be able to have the imperative constructs. In Python, I might take that example I just showed you, put it in a class so that I can have a web server class that encapsulates some of those concepts. I can have a for loop. If I wanted to create 3 servers, I could say, for i in range, (0,3), provision. Yet none of that takes away from the fact that ultimately, at the end of the day, the evaluation engine is declarative. Separating these concepts in your mind as you think about infrastructure as code, can be quite helpful when understanding the options available, and which ones are the ones that you want to choose for your project.

On the Importance of DAGs

Another key concept, I hinted at this with the Make example. Ultimately, if you looked at that example we just had, we had a web server EC2 instance. The EC2 instance actually consumed, you remember, it referred to the security group that we had declared. That forms a DAG, a directed acyclic graph. This is a concept that is pervasive in many of the infrastructure as code tools beginning all the way to CFEngine. I believe it was two, could have been three. It's a key concept because what it allows the engine to do is parallelize operations and also perform operations in the correct order. If you think back to what we just showed, we create a security group, and then a web server. The web server depended on the security group, so, clearly, we have to wait until that security group is completed, the provisioning is completed before we can go create the server. Similarly, I just destroyed the stack. That means it has to do it in the reverse order. If it tried to destroy the security group first, it would find that the web server depended on that. This concept that an infrastructure as code engine not only needs to create this declarative plan, it also needs to understand references between all of the infrastructure components within your stack. This is similar to Make as well. Make needs to build things in the right order. If something consumes the output of a previous build step, it has to do those in the correct order. DAGs are very important. It's an internal implementation detail if you're just using the tool, but it definitely helps to understand this aspect of infrastructure as code as well.

Desired State and Convergence

I've mentioned this notion of desired state many times. This is also a common concept amongst many infrastructure as code tools, where, effectively, the infrastructure as code engine has to eventually converge on your desired state. You think of the security group in the web server, we've declared that to be our desired state. Now the infrastructure as code tool's job is to figure out, where are we? Are we starting from an empty environment? Are we starting from an environment that's partially constructed? Is it fully constructed and it's just a minor change to the environment, like we're going from a T2 micro to a T3 large or something in terms of the web server size? Then it can just update one property on that EC2 VM. The idea that this desired state is known by the system and then the system can converge towards it is actually critical to not only infrastructure as code tools, but also modern technologies like Kubernetes. If you're declaring a Kubernetes configuration, the API server and control plane is effectively doing the same thing. This is a critical concept. It's also why infrastructure as code tools can resume in the face of a failure. Because that desired state is known, the Oracle can always be trying to converge. If something fails along the way, it can simply pick up where it left off and continue attempting to accomplish that desired state. This is another key concept that you'll encounter with infrastructure as code tools.

Y2K: Commence Data Center Modernization

Let's talk about now some specific tools. We'll go through the evolution of computing and cloud computing, starting from data center modernization in the 2000s, where we really started moving whole hog from rack and stack and physical servers, and data centers, and manual configuration to software managed VMs and automation with VMware. The great thing is, we went from KVMs and switches where we actually had to physically be located with your server to actually do anything with it, or Telnet into it, or SSH into it, to actually having a software control plane that manages our servers. That's really a key innovation. Because what that did is it meant that infrastructure now is software effectively, it's programmable. Software has an API. vSphere, for example, had that software defined control plane. When you look at AWS, it's pretty incredible, actually, you can curl a REST API to get a server somewhere in some data center. At Pulumi, we like to say, program the cloud. This was the key innovation that happened along the way that allowed us to build increasingly powerful infrastructure as code tools, and other tools and capabilities as well.


Let's talk about one really important concept. We'll talk about configuration-based infrastructure as code and provisioning-based infrastructure as code. These are two primary different families of IaC tools. Configuration, once you created a server, a virtual machine or a physical server, you need to install and configure software on it. You need to copy and edit files, run commands and processes, start daemons and services. Configuration based infrastructure as code was really born in a stateful world, a world of virtual machines where you patch those servers in place anytime they needed to be upgraded. In the early era, these were the initial forays into infrastructure as code primarily with virtual machines. CFEngine even predated these, but you look at Chef, Puppet, SaltStack and Ansible, really what these tools were about was how to install and upgrade and patch software on virtual machines. That was the primary mission. Of course, they have since evolved to a lot more than that. At the time, that was one of the major challenges, how to automate, how to make that repeatable.

Puppet is really an early approach here that builds on some of the momentum of CFEngine, but takes a slightly different approach. Puppet gives you a domain specific language. It's very Ruby like but it's DSL, where in this case, we're basically declaring some firewall rules that we want to accomplish. Again, look at how declarative this is. This is not a Bash script that goes and runs a set of commands to accomplish this, it's actually saying, firewall for protocol ICMP, make sure it's set to accept, and so on and so forth. Of course, this will translate into some SSH command, some edits to some configuration files on the machine, where we're able to operate at a much more declarative, higher level, more repeatable level of abstraction.

Chef builds on this as well and takes things in a slightly different direction, which is actually using Ruby. It'll look a lot similar to what we just saw, we're still saying we're declaring firewall rules but we've got the full expressiveness of the Ruby language. Again, remember, because the separation between expressive language and the infrastructure as code engine, although we now have access to all the rich capabilities of Ruby, we're still getting the belt and suspenders of infrastructure as code. Thanks to this, really Chef enabled and unlocked a lot of amazing capabilities, like cookbooks, which were Chef's sharing and reuse mechanism. With programming languages, we're accustomed to sharing and reusing packages and patterns, so that we don't have to copy and paste scripts all over the place. We can actually say, here's a common configuration, define it once and then use it a whole bunch of times. That's a very powerful capability. In addition to that, and the expressiveness of the language, we get things like testing, the ability to actually test our code. This was the first example of using really, a full-blown general-purpose language, but marrying that with the rock-solid infrastructure as code engine. We'll see this pattern repeats, although from here we go in a slightly different direction with many of the other tools.

Ansible approached this from a different angle, and instead of a DSL, instead of a general-purpose language, actually used YAML, a markup language to express effectively the desired state of the configuration. One of Ansible's major innovations in this space was having a serverless design, not in the way that we talk about serverless and Lambdas today, but not needing an agent, so agentless. Effectively, it could manage any machine that had Python installed on it, so you don't have to worry about running an agent or running some heavyweight processes on the machines that you're trying to manage. Which has made adopting Ansible at scale within organizations, just very seamless. Almost every machine already has Python on it. If it doesn't, it's pretty noncontroversial and easy to get that installed. Again, we've seen DSL, general-purpose language, and YAML, and you've already seen basically the three approaches to expression languages that you'll find in all infrastructure as code tools.

DSL vs. GP vs. ML

Some of the benefits of a DSL is, it's purpose built, it's easier to get started. That DSL can afford to basically make language design decisions that are intended to streamline and make it super easy to use for that one purpose-built situation. Some of the cons of DSLs, however, are that it is. Its pros are also its cons, which is, it's single use. It's really, you learn it for that one thing, but it doesn't transcend that to other use cases. Compare that to Python, where if you learn Python, there's many other tools that use Python, and you've now learned something that is transferable to other domains and contexts. Because of that limited familiarity, somebody coming from a different background, or for the first time coming to the space, they have to learn that DSL. Now, because DSLs are often simpler, maybe that learning curve is easier. If you just graduate from college, you know Python, you're not going to know that specific DSL, most likely. Many DSLs are destined to grow up into a general-purpose language. It's just not designed that way from the outset, and so you often end up with funny for loops, or funny if statements that maybe were not designed intentionally, but bolted on after the fact. General-purpose languages, again, are multi use. They give you the ultimate expressiveness which can be a challenge, but in this context, because infrastructure as code, the declarative engine at the core, that expressiveness, it limits the blast radius of how much you can shoot yourself in the foot, and has brought familiarity.

The cons are, if you don't know the general-purpose language, it sometimes can be more complex to learn it when you're just getting started. Frankly, marrying that with declarative infrastructure as code is not trivial. We've seen that Chef was able to do that. We've seen that Pulumi was able to do that. There aren't many examples of being able to do that because you do need to narrow down the object model and really design a system from the outset that can work with the best of both worlds. Finally, markup language is really extreme simplicity. JSON, YAML are effectively universal data formats that are used everywhere throughout computer software these days. It also aligns well with that declarative approach. Really, it is data, so you're declaring. There's no compute in that. Except that markup languages lack expressiveness. If you do need a for loop, or you do need some level of abstraction, or templating, or you're declaring something like the web server that needs to reference the security group, you have to invent constructs to enable those things which often feel like you're now jamming a general-purpose language into what was meant to be a simple data format. In fact, you look at some systems like Helm and the Kubernetes ecosystem, they've had to add Go templates to generate the YAML. Because in many complex situations, you do run into the wall of complexity, that leads to a lot of copy and paste, a lot of custom tooling built up around this supposedly simple format. There's really not great tooling to help manage those things at scale. All of these are pros and cons. I just wanted to give you the full landscape. Again, the right solution to the right problem domain is typically my guidance for when to pick one over the other.


Along the way, we introduced this concept of DevOps, developers and operations, and really taking those things and working together. DevOps really is a set of practices to harmonize software development and operations so that we're not thinking of these things as, developers go write the code over here. They throw the code over the wall with a ticketing system in between to the operations team, who then goes and clicks a few things. Really, this is the idea of reimagining the software development lifecycle with infrastructure at the forefront, and really helping the two sides of the house, developers, operations, infrastructure collaborate. This really helps us to achieve continuous software delivery where we're shipping daily, hourly, constantly, instead of just quarterly, or some of the ways that we did things 20 years ago. Really, I mention DevOps here, because infrastructure as code was an essential technology to first facilitating DevOps. Then DevOps also helped carry infrastructure as code into the mainstream.

Cattle vs. Pets

Let's keep going on the journey here. I think DevOps predated the cloud, just barely, and then the cloud came on the scene. I love this press release, AWS first service that they launched with S3 back in 2006. This really changed the game completely. One of the things that it introduced was this notion that infrastructure is an API call away, and it's much more repeatable and scalable. We have managed services. Effectively, we can offload a lot of the operations tasks and management of our infrastructure to the AWS control plane, the AWS services themselves. As we hit this inflection point, there's another concept that I want to mention, because it really leads from configuration-based infrastructure as code to provisioning-based infrastructure as code, which we're about to dive into, this notion of cattle versus pets, which is probably terrible analogies. It's not mine, but it's pretty well known. The idea is, in the world of VMs, those were like pets. Every machine, especially in the world of physical racked and stacked data center computers, every one of those is a special stateful being. They have specific names like They're unique, lovingly hand-raised and cared for. When something goes wrong with them, you try to get it back to health. If something fails on the hardware, you go swap out the SIM card and install something new. When a piece of software gets corrupt, or there's a security vulnerability, you go patch that thing.

Whereas in the shift to the cloud, we move more towards this notion of cattle, which is, we've got a lot of these machines, and they're almost identical to each other. When one fails, you swap in a new one, whether that's because hardware fails, and we just get a new machine, we plug it in. Or if something needs to be upgraded often, instead of going and patching that and worrying about the fact that it's stateful, and that means that we have to consider if something fails, the machine might be in a bad state, we just go and create new ones. Then using things like load balancers, we can redirect traffic from the old to the new. This slide I took directly, in 2012, from the initial introduction of this concept.


That leads to provisioning. In a world with cattle instead of pets, it doesn't make sense to really think a lot about, patching that virtual machine. In fact, we can just bake images, and we'll see with containers and Docker and Kubernetes is the new way we do things. We need to bake images with the latest software in them and then just go provision new versions and update all the old references to the new. What that means is we really don't have to think about state transitions from A to B to C to D. If something fails along the way we can just say desired state, so getting back to this notion of desired configuration, and the infrastructure as code tool can figure out how to get from wherever we are today to where we want to go tomorrow.

That's led to this middle era of infrastructure as code tools. There are cloud specific tools like AWS CloudFormation, Azure Resource Manager, Google Deployment Manager, and then Cloud-Agnostic, Terraform. CloudFormation can use JSON or YAML. Looks a lot like the Ansible example we saw earlier, it's just declaring AWS infrastructure here. This is actually basically doing the same thing we saw on our demo earlier, effectively, just spinning up an EC2 instance, that runs some commands when it spins up. We'll notice a few things though, that !Sub here in FN::Base64, again, because of the limitations of a markup language, CloudFormation has introduced a mini-DSL embedded within it, where you can actually substitute commands to reference other objects or perform computations, like in this case, we're substituting in Base64 encoding some strings.

Azure Resource Manager has very similar concepts. It's just written in JSON instead of YAML. Terraform, however, takes that approach of a domain specific language. Again, very similar to what Puppet did where it didn't have to be constrained by the limitations of a markup language, and yet didn't have to go and support a full-blown language. Terraform allows you to express yourself in its DSL, the HashiCorp config language. We'll see here this var.instance_type, var.instance_key. We can reference on the security groups, We have some facilities for programming, but within a simpler declarative DSL. Terraform, again, is multi-cloud, so it can support AWS, Azure, Google Cloud, also on-prem things like vSphere.

Enter Containers

All of that was early 2010s. Then Docker and Kubernetes came on the scene, and really changed how we think about application architectures, and encouraged us to move beyond lift and shift. I think a lot of the prior technologies were still very oriented around, how do we provision VMs, and how do we configure them? Kubernetes and cloud native, in fact, actually really dovetail nicely with this concept of declarative infrastructure as code, because the Kubernetes object model and way of doing things is all about eventual consistency, assuming things will fail, and really working in terms of desired state, and loosely coupled dependencies. Unfortunately, there's lots of YAML in there, but the concepts at the core of how Kubernetes work align very nicely with the core concepts of infrastructure as code that we've already talked about. At this point, where did the general-purpose languages go? We saw that with Chef. We saw that basically, for 10 years, we just moved entirely away from general-purpose languages, which meant that we lost the ability of for loops, and if statements, we lost abstraction facilities like function classes and modules. A lot of the editor support is now missing. Most editors are not going to understand CloudFormation's way of doing a DSL to reference things. You're not getting statement completion or red squiggles if you make a mistake. Unless somebody goes and implements a specific plugin for it, you're not getting interactive documentation for things like Terraform's DSL. It led to a lot of copy and paste, and a lot of recreating the wheel.

Today's era, we're seeing a resurgence of two things. One, more cloud native technologies, not just Kubernetes and Helm, what people typically mean when they say cloud native, but really embracing managed services like AWS Lambda, or Azure Cosmos DB, offloading some of the heavy lifting of managing services to the cloud providers themselves. Also, seeing a renaissance of, let's use those general-purpose languages. Let's marry those with declarative infrastructure as code. Let's give ourselves the ability to tame the complexity and go from small numbers of services to many services, to many environments, and use software to tame that complexity like it was designed to do. This led to Pulumi's introduction, the company that I founded back in 2017. We already saw this in action, but the idea is, use JavaScript, TypeScript, Go, Python, C#, Java, we actually even support YAML, if you prefer that markup language approach. The idea here is no matter what choice you pick for the expression language, you're still getting all that great declarative provisioning-based infrastructure as code, belt and suspenders, previews, and all of that.

Summary - Picking an IaC Solution

If you're picking an infrastructure as code tool today, it's very common to see folks using Pulumi with Ansible, or Terraform with Chef, or any combination thereof, because configuration is still super useful if you're doing virtual machines. If you're doing stateful workloads and stateful virtual machines, you're going to have to configure them, you're going to have to patch them. Those tools are very much vibrant and being used widely in the ecosystem today. If you're really going all in on modern cloud architectures, provisioning-based infrastructure as code is basically table stakes. You're going to want to pick Pulumi, Terraform, something in that category. For simple use cases, it's fine to start with the DSL or markup language. We saw that with Pulumi you can pick either of those and still stay within the Pulumi ecosystem. If you want something expressive, familiar, that can really work in a very complex setting, and that is familiar to developers, general-purpose languages tend to be the way to go. Then of course, Kubernetes only, you got to pick something that works in the cloud native ecosystem. We saw some examples. Pulumi works great there. Crossplane is another example. Or just the Kubernetes native tools themselves are very infrastructure as code like.

The Future

I'm going to wind down just talking about a few trends that are exciting to me that we can expect to see in infrastructure as code domain over the next few years. The first which is empowering developers. You rewind the clock 20 years ago, and developers really didn't think much about the infrastructure. They'd write a three-tier application, two virtual machines in a database, and the infrastructure team could easily take it from there. We'd update the applications every quarter. Life was good. It turns out these days, infrastructure and application code, the line between that is getting blurrier, first of all. AWS Lambda, is that an application concept or an infrastructure concept? Somewhere in between. What about building and publishing a Docker container into a private container registry? That's somewhere in between as well. What we're finding is, increasingly, because developers are creating the business value, letting them move as fast as possible and embrace the cloud, as much as possible, is something that most innovative companies out there, especially those where the cloud is helping them ship faster is really key. These days, all software builds are cloud software. This totally makes sense. However, you need to still have those guardrails in place, of how to provision a network, a cost-effective Kubernetes cluster, reliability, security. That's why infrastructure as code is still an essential tool to empowering developers, where the infrastructure team can define those best practices, set those guardrails, but developers can still go and self-serve some subset of the infrastructure that makes sense.

That gives rise to this notion of a platform team. We're seeing this increasingly. A platform team is the team that sits between the operations and IT organization, and the developers. Oftentimes, as the infrastructure platform team, the goal is to allow developers to be self-serve with guardrails in place, and be the connective tissue between the developers and operations team. Many times, the platform engineering team takes a software engineering mindset. They're building systems. They're putting in place Kubernetes and scaling it up. They're defining the common infrastructure as code components that are going to be used elsewhere, and the policy components. That software engineering mindset is often prevalent within the platform team. You'll find that it's an intersection of software engineering experts and infrastructure expertise in this group. This is definitely a pattern we see in the most modern organizations.

We'll see that taming complexity is continuing to be on everybody's mind. I think there is essential complexity and accidental complexity. With the cloud today, we've got a lot of accidental complexity. If you're a developer, you just want to spin up a microservice and define a couple containers, a Pub/Sub topic, a queue, a serverless function, there's a lot of accidental complexity in that space. We're seeing already the rise of tools that talk about infrastructure from code, something Pulumi does, and some new entrants in the market. We'll continue to see that level of abstraction increasing over time, so there's less toil, we can focus a lot more on just business logic.

Security is unfortunately today still an afterthought. I think another trend that we're clearly seeing is principle of least privilege, policy as code, scanning early and often, making sure that software is secure by construction. Infrastructure as code has a huge role to play in ensuring that. We've seen a ton of great technologies here, InSpec by Chef. I mentioned the ability for Chef to test code. A lot of that work manifested in InSpec, so you can actually ensure that things are secure by default. We've got things like OPA, the Open Policy Agent, which is doing that in the realm of cloud native. We've got HashiCorp Sentinel, which does offer the Terraform tool. Pulumi CrossGuard, which allows you to do policy as code in Pulumi. Snyk, which is a more general-purpose solution across a lot of these different technologies, and truly distributed application architectures. I think that one of the most exciting things about the cloud, the cloud has really made it easy to create software with infinite scale, infinite data. That's a really exciting change in the entire industry, where we went from single computers, to multiple computers connected through HTTP, to multi-core, and now really, truly distributed application architectures, often leveraging containers, and serverless, and managed services. This has unlocked incredible capabilities that are giving rise to entirely new companies and business models and entirely new architectures.

Then, finally, artificial intelligence. I think GitHub Copilot, when it first came out, because Pulumi is just code, we cracked it open, started writing some infrastructure as code. GitHub Copilot was able to actually author and complete some of our infrastructure as code for us. The recent introduction of ChatGPT from OpenAPI, you can literally go and say write a Pulumi program to create an ECS Fargate microservice, and it spits out the code to create a microservice.


We've seen where we come from. A rich history. Many giants whose shoulders we stand upon, starting from Make, CFEngine, and Puppet, Chef, and so on. To today, present day, where infrastructure as code really is table stakes for any cloud infrastructure, whether it's on-prem, hybrid, public cloud, or anywhere in between. We've got many great tools to choose from, a lot of great innovations, thanks to containers, cloud native, and being more developer centric. Then, where we go from here, really moving beyond just building blocks to best practices, distributed application architectures, and having things like security built in from the outset. The good news is we're starting from a really solid foundation.

Questions and Answers

Andoh: Can we define declarative implementation as a very thorough imperative implementation done by somebody else and reused? In some sense, everything is imperative until it's made declarative as a layer above for specific cases that demonstrated the need to be curated.

Duffy: The engine of Pulumi, although the engine itself is declarative, is actually written in Go, which, of course, is an imperative language. In fact, I like to think of programming languages occupying space along a spectrum from imperative to declarative. You look at a language like Haskell, Haskell arguably is declarative, because side effects are explicit and built into the model. Then you look at a language like F#, where it has lots of declarative facilities, and in fact, you can use F# with Pulumi, or CDK. It's actually a nice fit, because it's more of a declarative model, but it's got some imperative constructs in it, as well. I think of these things as more of a spectrum. I do think that's a good way of thinking about it. There's more esoteric solutions, like proof-carrying code, and TLA+, and all these more research and academic languages that really are more declarative in nature. Most of the time, we're trying to approximate declarative using imperative languages.

Andoh: You do have this declarative engine, and especially for Pulumi, as part of your solution to infrastructure as code. You also talk about the tenant of infrastructure code as being declarative, and we're seeing the future is declarative. Are there any downsides for the engine or just the solution to not be declarative native, in the process of bring your own language construct to that evaluation engine?

Duffy: I think you could definitely argue it both ways. There's more of an impedance mismatch when you have to take something that fundamentally is not declarative, and map it on to something that is declarative. Yet, at the same time, a lot of programmers are accustomed to imperative languages. We like our for loops. We like our shared mutable state, even though, on paper, it's sometimes not a great idea. The approach that we took at least was to accommodate the different preferences of different end users, who might already have a language that they know, but still give them a way to map that onto the declarative core. YAML is pretty mainstream and popular. That's more of a data format. You look at Q, which adds some really super exciting capabilities around type checking, and the ability to do things like for loops within the context of a purely declarative data model. The unfortunate thing is there's no really super popular declarative programming language. Haskell is probably the closest, but people like to tinker with it. Certainly, in FinTech, you can go to Wall Street Quants, like people are using Haskell like crazy, but more broadly in the industry a little bit less.

Andoh: You mentioned the future of infrastructure as code and you also talked about policy and the rise of policy of code and how infrastructure of code has a huge part to play in that? Could you expound on that a little bit?

Duffy: I lived through an interesting time back in the Trustworthy Computing days where we had all these viruses and issues with Windows when we worked at Microsoft. The thing that we found out was like, you can detect a lot of those through static analysis and analyzing code or analyzing the environment. It's a whole lot better if you didn't get yourself into that situation to begin with. What we did back then was we started moving things more into the programming languages, more static type checking. You look at what Rust has done, for example, eliminating entire classes of common security problems by having a robust type system. I think policy as code is effectively in that realm of static type checking, and sometimes dynamic as well. Detecting errors like, "I didn't mean for this database to be open to the internet. My entire microservice there's, again, a pathway from the intranet. Or, I forgot to encrypt files at rest. Or, I'm using an end-of-life'd version of MySQL." Catching these things, ideally, at deployment time, but worst case, after they've already made their way out. We've seen a lot of great tools there. HashiCorp has Sentinel. Chef had InSpec. At Pulumi we created policy as code. We still have a long way to go because it's still an afterthought. It's not to the point where it's checked as part of authoring my program, and it's just secure by default. I think that's where we all want to get to is secure by default. You'll see supply chain security as well as part of this, companies like Chainguard trying to make it secure by construction. I think as an industry, we'll see a lot more movement in that direction.

Andoh: Unit tests and integration tests, are they recommended when we implement pipelines with Pulumi?

Duffy: Definitely. I think one of the benefits to using general-purpose languages is you get the whole ecosystem around that language. If you're using Python, you can use the built-in unit testing. Every language has a unit test framework. There's a spectrum of testing. There's unit testing. There's pre-deployment testing. Actually, a common thing is open a GitHub pull request, spin up a whole fresh copy of that infrastructure, run a battery of tests against it, and tear it down. Only pass the pull request, if that set of tests work. That's another form of testing. Then once you actually merge it, there's post-deployment testing and a variety of techniques like A/B testing, Canaries, blue-green deployments that are more sophisticated if you want to actually push testing into the post-deployment part of the process as well. We see people doing a lot of everything along that spectrum.

Andoh: You mentioned desired state and looking at state but you didn't mention the word drift. In terms of infrastructure as code, and both the configuration part and the instantiation part, can you talk about drift and the complexities that you're seeing there and solutions?

Duffy: I just got back from re:Invent, and talking to a lot of folks, and I heard a lot of people say we require any changes to infrastructure go through infrastructure as code. There are great class scenarios where if there's a security issue, somebody can go in and make manual changes. A lot of people are locking down the ability to make changes. Sometimes either it's standard practice, or occasionally somebody will log in and make a manual change. When they make a manual change, the actual state of your infrastructure no longer matches what you thought you deployed. Infrastructure as code tools like Pulumi, Terraform, others can actually detect that and say, what you thought you deployed is actually different. Maybe you opened Port 22, to SSH into a box and do some debugging, and then you forgot to close it afterwards. Drift detection can tell you, "You didn't mean to leave port 22 open." Then it can reconcile the state. It can go either reapply the changes, or if you want to, maybe you added a tag to a server, you can ingest that back into your infrastructure as code. That's what drift detection is. Definitely a very important thing to do. Start from, everything must go through a pipeline and then later on drift detection after that.


See more presentations with transcripts


Recorded at:

Aug 03, 2023