Transcript
Duffy: I'm really excited to talk about our approach to infrastructure as code and some of the things that we've learned over the last couple of years. One of the things I'm going to try to convince you all of is that infrastructure is not as boring as one might think. In fact, learning how to program infrastructure is actually one of the most empowering and exciting things as developers we can learn how to do especially in the modern cloud era. I'll start by telling you a little bit about my journey, why I came to infrastructure as code, why I'm so excited about it. I'll talk a little bit about current state of affairs, and I'll talk about some of the work that we've been doing to try to improve the state of affairs.
First, why infrastructure as code? When I see the phrase "infrastructure as code," I'll admit I don't get super excited about it even though that is an important technology to have in your tool belt. The reason why, when I break it apart, my background is as a developer. I came to the cloud through very different routes than I think a lot of infrastructure engineers and so, that gave me a different perspective on the space. I started my career and I did a few things before going to Microsoft, but I think the most exciting time in my career was at Microsoft where I was an early engineer on the .NET framework and the C# programming language, so back in the early 2000s.
I did a lot of work on multi-core programming and concurrency, created the team that did the Task framework and Async and ultimately led to Await. Then I went off and worked on a distributed operating system, which was actually eye-opening when I came to cloud because a lot of what we did foreshadowed surface meshes and containers and how to manage these systems at scale and how to program them. Then before I left Microsoft, I was managing all the languages team, so C#, C++, F# language services and the IDE. What really got me into cloud was taking .NET open source and cross-platform. The reason Microsoft did that was actually to make Azure a more welcoming place for .NET customers wanting to run workloads in Linux in particularly containerized workloads.
That might not have been obvious from the outside that that was the motivation but today, that's really working well with .NET Core and Azure. In any case, that got me hands-on with server lists and containers and I got really excited. I knew for a while I wanted to do a startup and every year I wonder, "Is this the year?" I saw all everything with cloud is changing the way that we do software development and so I saw a huge opportunity. I didn't know it was going to be infrastructure as code until I got knee-deep into it, but the initial premise was all developers are or soon will become cloud developers.
When you think of Word processors in the 90s, we don't build that software anymore. Even if it's a Word processor, it's connecting to the cloud in some manner to do spell checking or AI, machine learning, storage, cloud storage. Basically, if you're a developer today, you need to understand how to leverage the cloud, no matter what context. Even if you're primarily on-prem, a lot of people are looking to using cloud technologies on-premises to innovate faster.
As a developer, if I'm coming to the cloud space, why do I care about this infrastructure stuff? If I go back 15 years, I would go file a ticket and my IT team would take care of it and I didn't have to think about how virtual machines worked or how networking worked or anything like this. It was convenient, but those are simpler times.
Google modern cloud architecture in AWS on the left and in Azure on the right and the thing you'll notice is there's a lot of building blocks. All these building blocks are programmable resources in the cloud. If you can tap into how to wire those things together and really understand how to build systems out of them, you can really supercharge your ability to deliver innovative software in a modern cloud environment.
Modern Cloud Architectures
I'll talk a little bit about the current state of affairs. I spent a lot of time contemplating, "How do we get to where we are today where we're knee-deep in YAML?" For every serverless function you write, you've got an equivalent number of lines of YAML to write that are so hard to remember that you always have to copy and paste them from somewhere or you have to rely on somebody else to do it. I think how we got here was, if you fast forward to the 2000s and virtualization was a big thing, we started to virtualize applications, we took our end tier applications, we virtualized them usually by asking our operations and IT folks that help configure these things. It was black box, you didn't have to think about it. It was ok because there were too many of them, maybe you had an application in a MySQL back end or maybe you had a front end, a mid-tier, and a database.
Things are usually fixed, they're pretty static, so you would say, "I think I need three servers for the next five years," and then you buy those servers, you pay 2,500 bucks amortized over three years for each of those servers and that's the way IT worked. I'd say the Cloud 2.0, to me, that's really the shift in mindset from static workloads to dynamic workloads, where instead of provisioning things and thinking it's going to stay fixed for five years - things change - suddenly, your applications a lot more popular than you thought it was going to be or maybe it's less popular. Maybe with some major IT initiative for a new HR system and then suddenly, they decided to go with Google Hire or something and suddenly that thing wasn't necessary, now you've got three VMs just sitting there costing money that are unused.
We moved to this model of more hosted services, more dynamic, flexible infrastructure. We go from 1 to 10 resources in the Cloud 1.0 world to maybe a couple dozen in the 2.0 world. Now, when you think of like a modern AWS application, you're managing IM roles, you're managing data resources, whether that's like hosted NoSQL databases or RDS databases. You're doing serverless functions, you're doing containers, you're doing container orchestrator if you're using Kubernetes, diagnostics, monitoring. There's a lot of stuff to think about and each one of these things needs to be managed and has a lifetime of its own and is ideally not fixed in scale.
You want to be able to scale up and down basically dynamically as your workloads change. I think the old approach of throw things over the wall and have them configured, we've tried to apply that to this new world and it's just led to this YAML explosion that is driving everybody bonkers.
We used to think of the cloud as an afterthought, it was, "We'll develop our software and then we'll figure out how to run it later." These days, the cloud is just woven into every aspect of your application's architecture. You're going to use an API gateway, you need to think about how routes work and how does middleware work, how you're going to do authentication. There's so many aspects that just deeply embedded within your application and you want to remain a little bit flexible there, especially if you're thinking of things like multi-cloud or if you're thinking of potentially changing your architecture in the future. You don't want to take hard dependencies necessarily, but honestly, for a lot of people that are just trying to move fast and deliver innovation, that is the right approach is just embrace the cloud and use it throughout the application.
The Cloud Operating System
There's another area where maybe I'm crazy, I think of things a little bit differently than most people, but I think of the cloud as really it's a new operating system.
Over 10 years ago, when they kicked off the Azure effort inside Microsoft, it was called Red-Dog at the time. Dave Cutler, who was the chief architect of Windows NT used to describe it as a cloud operating system and everybody thought he was nuts, nobody knew what he meant. I think he really saw it as you think of what an operating system does. It manages hardware resources, whether it's CPU network, memory disk. It manages lots of competing demands and figures out how to schedule them. It secures access to those resources and most importantly, for application developers, it gives you the primitives you need to build your applications.
It turns out if you just replace the phrase operating system in that phrase, in all of that, with cloud, it's basically what the cloud does. It's just operating at different granularity. The granularity of a traditional operating system is a single machine. Of course, there are distributed operating systems but most of those never really broke out of research, they're fascinating to go study, but most operating systems we think of, whether it's Linux or Mac or Windows is really single machine, whereas now we're moving to multiple machines. Instead of the kernel calling all the shots. If you remember the Master Control Program from "Tron," it used to be the kernel, well, now it's the cloud control plane.
The perimeter is different, now you're thinking of a virtual private cloud instead of just a single machine with a firewall running on it. It goes on and on, processors and threads, or processes and threads versus VM containers and functions. That was another perspective when coming to the space from a languages background, you look at Async programming and all the things we can do with thread pools and then you look at serverless functions. There's definitely an interesting analogy there where, "Wouldn't it be great if serverless functions were as easy to program as threads and thread pools are in AsyncTask in your favorite language?"
Managing Infrastructure
That starts to get me into the topic of infrastructure as code. Basically, you move from a world of thinking in your programs of allocating kernel objects. Maybe you're allocating a file or a thread or whatever the objects are in your programming language that you're doing up. Now you're thinking about, "How do I stitch together lots of infrastructure resources to create applications?" Now you're thinking of networks and virtual machines and clusters and Docker containers and all the data storages required to power your application.
I think when you get here, when you say to yourself, "I want to build a cloud application," the initial approach for most people is, "I'll go into AWS, Azure, Google Cloud, DigitalOcean," whatever your favorite cloud provider is, go into the console, point and click to say, "Give me a Kubernetes cluster, give me some serverless functions," all the things you need. First of all, it's really easy to get up and running. If you don't know the things you want in advance, you can go click around, there's good documentation and there's visualizations that guide you in the right direction, but the problem with that is it's not repeatable. If you go build up your entire infrastructure by pointing and clicking in the UI, what happens when you want to provision a staging instance of that? Or maybe you want to go geo-replicated, now you have to have multiple instances of your production environment. Or what happens if something gets destroyed?
Somebody accidentally goes into the UI and thinks things are in the testing account and he says, "Delete everything," and then realizes, "That was actually in production." That happens more often than you would think. It's also not reviewable. You can't go into a team and code review that, you can't say, "Am I doing this correctly?" unless somebody stands over your shoulder. It's not versionable, so if you need to evolve things going forward, you're going in and doing unrepeatable state changes from some unknown state to another unknown state.
What most people will do as the next step is, they'll adapt scripts. All these cloud providers have SDKs, they have CLI, so you can always Bash script your way out of anything and that's slightly more repeatable. Now you've got a script - and I'll show in a few slides the difference here, but it's not as reliable. If you're provisioning infrastructure and you've got 30 different commands you have to run to get from not having an infrastructure set up to having a full environment, think about all the possible points of failure in there and how do you recover from them? If it fails at step 33, is that the same as if it fails at step 17? Is that automatable? Can you recover from that? Networks have problems from time to time, the cloud providers have problems from time to time, so this is not a resilient way of doing things.
Usually, the maturity lifecycle of an organization starts from point and click, moves to scripts, and then eventually lands on infrastructure as code. In infrastructure as code, you're probably familiar with some technologies here, like AWS. Most of the cloud providers have their own, so AWS has Cloud Formation, Azure has Resource Manager templates, Google has Deployment Manager, and HashiCorp has Terraform which is a general-purpose one that cuts across all of these. I'll talk about Kubernetes a little bit in the talk, it's not a Kubernetes talk, but I'll mention a few things.
Kubernetes itself also uses an infrastructure as code model in the way that it applies configuration, so infrastructure as code is pretty much the agreed-upon way of doing robust infrastructure for applications.
What really is infrastructure as code? Infrastructure as code allows you to basically declare your cloud resources - any cloud resources: clusters, databases - that your application needs using code. Infrastructure as code is not a new thing, it's been around for a while. Chef and Puppet used infrastructure as code to do configuration of virtual machines, that's more of like the Cloud 1.0, 2.0 period of time if you go back to my earlier slides.
Declarative infrastructure as code, which all the ones I just mentioned, are in that category, not Chef and Puppet, the Cloud Formation, Terraform, etc. The idea is to take those declarations and basically use that to say, "This is a goal state, this is a state that the developer wants to exist in the cloud." They want a virtual machine, a Kubernetes cluster, 10 services on that Kubernetes cluster, a hosted database, and maybe some DNS records. That would be a configuration that you then apply, so you basically give that to your infrastructure as code engine, it says, "Ok, this is what the developer wants, here's what actually exists in production and then we'll just go make it happen." It's basically a declarative approach to stating what you'd like.
The nice thing about that is if it ever drifts, if anything ever changes, you always have a record of what you thought existed in your cloud and this will be a lot clearer as we get into a few more examples.
Let's say we want to spin up a virtual machine in Amazon and then secure access to it, so only allow access on port 80, HTTP access. The left side is super simple script, if we just wanted to write something using the Amazon CLI, it would work. You notice we have to take the ID that returned from one and pass it to another command and, again, these are only three individual commands so it's not terribly complicated, it's a reasonable thing to do.
As your team grows, as more people contribute, as you have 100 resources instead of just 3, you can imagine how this starts to breakdown, you end up with lots and lots of bash scripts. To my previous point, if something fails, it's not always evident how to resume where you left off. On the right side, this is Terraform and this is basically effectively accomplishing the same thing, but you'll notice it's declarative. Instead of passing the region for every command, we're saying, "For AWS, let's use US East 1 region." For the web server, here's the image name, I want to teach you micro which is super cheap and small. I'm going to use this script from here.
Notice here that we're starting to get a little bit into programming languages, but it's not really a full programming language because we're basically saying, "Declare that we're reading in this file." You can read it, it's reasonable. Then, you take this and you give it to Terraform and Terraform chugs away and provisions all the resources for you.
Hot It Works
I'll mention this because effectively all the tools I mentioned for declarative infrastructure as code work the same way. In fact, the same way the Kubernetes controllers work in terms of applying configuration also. The idea is you take a program written in some code, whether it's YAML, whether it's JSON, whether it's Terraform HCl, or we'll see later what Pulumi brings to the table is the ability to use any programming language.
You feed that code to an infrastructure as code engine and that's going to do everything I mentioned earlier, so it's going to say, "What do we think the current state of the cloud is and what is the desired state that the code is telling us at once?" Maybe the state is empty because we've never deployed it before, and so everything the code wants is going to be a new resource. Maybe it's not the first time, so we're just adding one server, so it's going to say, "All the other resources run changed and we're just going to provision a new server."
Then, it comes up with a plan and the plan is effectively a sequence of operations that are going to be carried out to make your desired state happen, "We're going to create a server, we're going to configure the server," all the things that have to go behind the scenes into making that reality. If you think of the way you're allocating operating system objects, just to tie it back to some of the earlier points, you generally don't think about the fact that there's a kernel handle table that needs to be updated by the kernel when you allocate a new file handle. You're usually just in Node,js saying, "Give me a file," and then there's something magic that makes that happen.
Here, it's a similar idea, it's just stretched over a longer period of time. In a traditional operating system model, you run a program, it runs and then it exits. In this model, you run a program and it never really exits, it's declaring the state. Some fun things, if you ever look at some projects - Brendan Burns has this metaparticle project which takes this idea to the extreme, which is, "What if we thought of infrastructure as code programs is actually just programs that use the cloud to execute and just ran for a very long time?" It's a thought provoking exercise.
Because we're declaring this and this gets to the scripting versus infrastructure as code, we have a plan, so actually, we can see what's going to happen before it actually happens and that gives us confidence to allow us to review things. If we're in a team setting, we can put that in a pull request and actually say, "That looks like it's going to result in downtime," or, "That's going to delete the database, was that what you meant to do?" You actually have a plan that is quite useful. Once we've got a plan, then usually the developers say, "Yep, that's what I want, let's carry it out." Then, from that point, the infrastructure as code engines driving a bunch of updates to create, read, update, and delete the resources. That's how it works behind the scenes. In general, you don't actually have to know that's exactly what's happening but that's basically how Cloud Formation works, that's how Terraform works, that's how Pulumi works. All these systems effectively work the same way.
Days 1, 2, and Beyond
This works for day one, but another area where comparing it to point and click and scripting where it's important is actually being able to diff the states and carry out the minimal set of edits necessary to deploy your changes. This takes many forms. Deploying applications, constantly deploying changes like new Docker SHA hashes that you want to go and roll out in your cluster, maybe from Kubernetes or whatever container orchestrator you've chosen. Upgrading to new versions of things, like new versions of MySQL, new versions of Kubernetes itself. That's another evolution that you’re probably going to do.
Adding new microservices or scaling out existing services. Leveraging new data services, maybe you want to use the latest and greatest machine learning service that Google is shipping. Anytime you want to leverage something like that, you actually have to provision a resource in the cloud. You have to provision some machine learning data lake or whatever the resource is. Then, also evolving to adopt best practices. Sometimes, everything you've done, at some point, you might realize it was all wrong and you have to go and change things. For example, maybe you set up your entire Kubernetes server and cluster and everything and then you realize, "We've got to properly authenticate everything," and now it's open to the Internet and anybody can come play with our Kubernetes server. It's probably not a great thing but it happens to a lot of people. As you evolve those sorts of things and fix them, you need something that can incrementally apply deltas.
I mentioned earlier, scaling up a new environment. If you're standing up another production environment, one thing that we find a lot nowadays is every developer on the team has an entire copy of the cloud infrastructure for their service and that makes it really easy to go try out new changes, test changes. We even have people that spin up entirely ephemeral environments in pull requests. Maybe you're going to make a change to your application, you submit a pull request, it spins up an entire copy, runs a battery of tests, then scales it all down as though it never existed and in doing that, you get a lot more confidence than if you're trying to simulate those environments on your desktop.
Of course, sometimes you want to do that, running Docker on your desktop is a great way to just really spot check things but at some point, before you do your actual integration, you're going to want to run these more serious tests.
Just One Problem
In any case, the one problem with all of this is infrastructure as code is often not actually code, it's like the big lie here. It's very disappointing once you realize this and what it means is, a lot of times what we're calling code is actually YAML or often YAML with Go templates embedded within them, or you look at Cloud Formation, Cloud Formation is like YAML or JSON with a mini DSL embedded within the strings, and so you can put in the strings, referencing variables. The more complex our cloud infrastructure gets, the more this just starts to fundamentally break down.
The thing that's so disappointing for me when I came to this space is, I had heard of all these technologies, I never actually used these for months at a time in anger. That was the first thing I did when I started Pulumi, it was, "Ok, for the next three months, I'm just going to build modern cloud applications using best of breed technologies," and it was very quick that I was just swimming in YEML. It's not just YAML, I think YAML is fine, YAML is good at what it was designed for. I think the problem is we're using it for what it wasn't designed for, we're actually trying to embed application architecture inside of YAML and for that, I'll take a programming language any day. You get abstractions and reuse, functions and classes, you get expressive constructs like loops, list comprehension, tooling, IDEs, refactoring, linting, static analysis and most of all, productivity - all the things I spent my last 15 years basically working on.
It would be great if we could actually use those things and I think when we work with organizations to adopt newer technologies, there's still a divide between developers and infrastructure engineers but at the same time, there's this desire to break down those walls and work better together. I think it's not to say every developer is going to become an expert in networking, certainly I am not, but we should have the tools and techniques to collaborate across those boundaries so that if I want to go stand up a serverless function, I should be able to do that.
So long as my infrastructure team trusts that I'm doing it within the right guidelines and if I want to go create a new API gateway-based application instead of doing Express.js which is going to be more difficult to scale and secure, I should be empowered to do that. We thought that that was the way to go. This is an example from my home chart where the thing that's craziest about all of this is you actually have to put indentation clauses in here because that's the YAM. This is nested so you need to indent eight spaces to the right, otherwise, it's not a well-formed YAML. You've got this mixing of crazy concerns in here which nobody loves.
Remember this slide that I said, "The cloud should not be an afterthought?" It really is still an afterthought for a lot of us. For applications, we're writing in Go, JavaScript, Rust, TypeScript, you name it, but for infrastructure, we're using Bash, YAML, JSON. HCL is the best technology out there that's not a full blown programming language, it gets close but it's still different. Then for application infrastructure in the Kubernetes realm, everything is YAML to the extreme. These little Bash things because often, we're using different tools, we're using different languages for applications, multiple bits of infrastructure so, oftentimes, we have to do infrastructure in AWS but then we're doing Kubernetes infrastructure and we're using different tools, we're using different languages, slapping them together with Bash. Hepdio have this phrase, "Walls of YAML and mountains of Bash." I think that is the state of affairs today.
Why Infrastructure as Code?
That leads me to the next section, which is, "What if we use real languages for infrastructure also?" Would that help us blur this boundary and actually make the cloud more programmable and arm developers to be more self-sufficient, while also arming infrastructure engineers to empower their teams and themselves to be more productive and really stand on the shoulders of giants? I love that phrase because we've been doing languages stuff for over half a decade. Wouldn't it be great to benefit from all the evolution that we've seen there?
That's what led us to the Pulumi idea, which was let's use real languages, we'll get everything we know about languages. I mentioned a lot of these earlier, we'll eliminate a lot of the copy and paste and this jamming template in languages inside of YAML. At the same time, it has to be done delicately because it needs to be married with this idea of infrastructure as code. I think when I came to the space, my initial idea was, "We should write code, why are we writing on the CYAML?" I quickly soon appreciated why infrastructure as code is just so essential to the way that we actually manage and evolve these environments.
I think one thing Terraform did really well, which was have one approach across multiple cloud providers, we talked about a lot of people that are trying to do AWS and Azure and on-prem and Kubernetes and having to do in different tools and languages clearly does not scale.
Demo
With that, I'll jump into a quick code demo and we'll see what some of this looks like. First, this is a pretty basic Getting Started example here. This is a very simple Pulumi program. What it's going to do is create an S3 bucket and then populate that bucket with a bunch of files on disk. Let me see if I can pull this back up.
If I go over here, I'll see that it's basically a simple Node.js project and, of course, it could be Python, it could be Go, and the first thing we'll see is we're actually just using Node.js packages. In a lot of these other systems, the extensibility model is just not clear. Actually, most of them are walled gardens because they're specific to the cloud provider but here, we're actually just using Node packages. Notice we've got Pulumi packages, so it's AWS, there's one for Azure and Kubernetes. We're also using just normal Node packages. We're using typical libraries that we know. We could be using our own private JFrog registry if we're an enterprise that has JFrog, for example.
Another thing that is so easy to take for granted is, I get documentation like examples. I'm benefiting from all the things that we know about IDEs, if I mistype something, it's going to say, "Did you mean 'bucket'? You seem to have mistyped that." Oftentimes in these YAML dialects, you don't find that out until it's too late. You've tried to apply it, now you're maybe partway through a production deployment and you're finding an error. Actually, you are bringing a lot of this insight into the inner loop.
We're basically saying, "Give us an S3 bucket, we're configuring it to serve a static page." Then, here is where it gets a little bit interesting where we're actually just using a Node API, so fs.readdir, and we're basically saying for every file, create an S3 object and upload that to the cloud. Then, we set the policy to enable public access and then we print out the URL. Usually, if you were to do this in Cloud Formation, you're talking 500 lines of YAML or something like that. What we're going to do here is we'll just run pulumi up. Remember that diagram I showed earlier? This is doing what that diagram said, where it ran the program and said, "If you were to deploy this, you're going to create these four objects."
If I want to, I can get a lot of details about it but pretty standard stuff. If I say yes, now it's actually going out to Amazon and it's deploying my S3 bucket. It uploaded these two objects, notice here it's saying it's uploading from the www directory. It's actually uploading this in a Favicon, but it could be any arbitrary things. This is an extremely inexpensive way to host any static content on S3 but notice at the end, it pointed out the URL. If I want a custom URL, all the cloud providers are just amazing with all the building blocks they provide, so if I want to go and add a nice vanity URL like duffyscontent.com or something, that's just another 10 lines of code to go configure the DNS for this. You can start simple and get complex when you need to. Here, notice that it printed the URL so I'm actually just going to open the URL and then we'll see that it deployed.
Notice here, interesting thing, I didn't actually update the infrastructure. I updated some of the files but it's going to detect that, it's going to say, "This object changed, do you want to update it?" Yes, I want to update it. The flow is the same if I'm adding or removing resources. Now I can go and open this up and we'll see, "New content." That's a pretty simple demo.
If I want to get rid of everything and just clean up everything that I just did so Amazon doesn't bill me for the miniscule amount of storage there, probably one cent a year or something like that, I can go ahead and delete it and it's done. I should mention also, I talked earlier about standing up lots of different environments. We have this notion of a stack and so if you wanted to create your own private dev instance, you can just go create a new stack and then start deploying into it, so pretty straightforward.
This is a more interesting example, where this is building and deploying a Docker container to Amazon Fargate, which is their hosted clustering service. It's not Kubernetes but it's pretty close. What we did here was we took the Docker getting started guide, so if we go to Docker and download Docker, it walks you through how to stand up a load balancer service using Docker Compose, this is the moral equivalent but using Fargate. It’s got an elastic load balancer on it, listening on port 80. It's actually going to build our Docker image automatically. If I look here, we've got just a normal Docker file. What this is going to do is, it's going to actually do everything. It’s going to stand up the ECS resources, it's going to build and publish the Docker container to a private registry within Docker itself. It's also creating a virtual private cloud, so it's secure.
Notice this is doing a fair bit more than our previous example and I'm going to say, "Yes." This is the power of abstraction and reuse, we can actually have components that hide a lot of the messy details that frankly, none of us want to be experts in. If you were to write this out by hand, look, here this is almost 30 lines of code. Each one of these resources, if you're doing Cloud Formation or some of the other tech that I mentioned earlier, you'd have to manually configure every single one in excruciating detail down to the property level whereas here, we can benefit from reuse what you know and love about application programming and know that in doing so, we're benefiting from well-known best practices.
This may not be obvious, that's actually the Docker build and push. Depending on the bandwidth here, the push may take a little bit longer, so I'm going to jump back to the slides and then we'll come back in just a minute.
I mentioned multiple languages, so this is basically accomplishing the same thing infrastructure as code gives you just with any language. We have others on the roadmap like .NET and actually, there's somebody who's working on Rust right now which will be super cool or somebody who did F# as a prototype which is really actually a cool fit for infrastructure as code because it feels more declarative and pipeline-y. More language is on the way but right now, these are some great options.
I mentioned lots of different cloud providers, so every single cloud provider is available as a package. This is a stretched picture of turtles because it turtles all the way down. The magic of languages, you can build abstractions. Then, Kubernetes - I hinted at Kubernetes along the way, but what we see is to actually do Kubernetes, you you have to think about security so you have to think about IAM roles, you have to think about the network perimeter, you have to set up virtual private clouds. You think about, "If I'm in Amazon, I want to this CloudWatch, I don't want to do this Prometheus thing maybe."
You really have to think holistically about how you're integrating with the infrastructure in whatever cloud providers going to. It's not as simple as, "Give me an EKS cluster or GKE or AKS," all those things are really important. Also, if you're starting with Kubernetes, did you do that because you just wanted to deploy containers and scale up container-based workloads? Probably. You probably didn't do it because you wanted to manage your own Mongo database with persistent volumes and think about the backups and all that or MySQL when you're coming from a world of hosted databases. Most customers we talk with are actually using RDS databases, S3 buckets, Cosmos database if you're an Azure, but not managing the data services inside of the cluster itself but actually leveraging the cloud services that are hosted and managed.
What this shows is, you can actually say in this example, the one on the left is basic provisioning a new GKE cluster like an entire cluster, and then deploying a canary deployment in Kubernetes to that cluster. Then it exports the config so you can go and access your new cluster, all in one program that fits on a page. The one on the right is showing ladder thing I was just talking about where Azure Cosmos DB has a MongoDB compatibility mode, so rather than host your own MongoDB instance, you can spin up some containers and just leverage that. Here, we're basically creating new Cosmos DB instance, sending up some geo-replication capabilities of it, then here's where it starts to get interesting, we're actually taking the authentication information for that instance and putting it in a Kubernetes secret in this program.
Usually, you'd have to take that, export it, store it somewhere secure, fetch it later on, whereas here, we can actually just take it, put it in a secret, and then consume it from a home chart. Now the home chart is going to spin up some application that accesses that database.
If I go back, my application here should be done, this is the Fargate example. If I do this, saying, "Hello, world," so it's just the application up and running which is, again, just our application here. I should mention, just like I showed updating the file for the S3 bucket, if I want to deploy a new version of the container, I just go and change the Docker file, run pulimi up, it'll deployed the dif and now I've got the new image up and running.
This is the Kubernetes example, which is effectively just a very basic NGINX deployment object followed by a service, so it's a standard Kubernetes thing. If I go and deploy it, it's going to look very much like I had shown before with the caveat. This is neat, it's actually going to show me the status of the deployment as it's happening. I don't know if people have ever experienced deploying something with cubectrl in the Kubernetes. You're often found tailing logs and trying to figure out, "Why is this thing coming up? It turns out I mistyped the container image name," or any number of other things. Whereas here we're actually saying, "Why are we sitting here waiting? Well, it turns out GCP is now allocating the IP address because we use the load balance service," and so we can see those rich status updates.
Even cooler than that is if we go back here, we can actually just delete all that code and replace it with something like this. Again, the magic of abstraction here. This is, "I encountered this pattern hundreds of times in my applications, I'm sick of typing it out manually by hand, so I'm going to create a class and abstraction for it." It turns out services and deployments often go hand in hand and they have a lot of duplicative information in each of them, so why not create a class to encapsulate all that and give it a nice little interface to say, "An image, replicas, ports." That's just showing you a little bit of a taste of what having a language gives you, instead of copy and pasting the YAML and then editing the slight edits everywhere, you can use abstraction.
Obviously, it's easy to go overboard with abstraction but what we find is there's a level of abstraction that makes sense. Some people use abstraction to make multi-cloud easier, some people just do it to make some common task in whatever cloud you're going to itself much easier.
Automated Delivery
I should mention in terms of that maturity lifecycle of going from scripting to infrastructure as code, usually, the next step is continuous delivery, so using something like GitOps where you want to actually trigger deployments anytime you check in a change. That allows you to use pull requests and all of the typical code review practices you would use for your applications, you can now use it for infrastructure and it's just code so it's similar, you can apply the same style guide, you can apply the same pre-commit checks that you would use for your application code.
That includes testing - it's actually one of the things I actually did not envision this to be as popular as it has become. A lot of people are adopting this approach because now they can actually test their infrastructure. It seems so obvious in hindsight but until people started doing it, it wasn't obvious. For example, "What if you want to make sure that your servers are tagged appropriately or aren't open to the internet by accident or endless possibilities?" Now, you can use your favorite test framework, so in this case, I'm just using Mocha. If you're in Go, you can use the built-in Go test framework or Python, the built-in Python framework.
Doing this with YAML is possible but, definitely, you have to learn a set of new tools and it's just not the same and, by the way, if these things integrate with your IDEs, you just benefit from that as well.
Cloud Engineering – Test Your Infrastructure
This level of testing spans lots of different domains. There's unit testing like the kind I just mentioned, integration testing, so actually running tests after a deployment happens to make sure that the application is functioning. There is even more exotic testing like injecting faults, fuzz testing, scale testing. The idea here is, "Let's apply all the engineering best practices we know and love to all aspects of how we're doing cloud engineering," and, of course, testing has to be one of those.
The Future of Cloud Architectures
That’s it in a nutshell. I'll say if I look to the future, some of the things I'm really excited about, I think Kubernetes is a platform-level abstraction that definitely seems to be the right level of abstraction that we can start building on top of. That allows us to basically abstract away a lot of the underlying details of how compute runs in all the different cloud providers, there seems to be almost universal agreement. Too early to call it but we'll see in three years, it seems like a safe bet at this point.
Serverless is really exciting because you really do focus a lot more in application code. The thing I'll say is, after having been in the trenches for a while on this, I think the key thing with serverless is it works in certain places and serverless means different things to different people. I think if you look at AWS Lambda or Azure Functions or Google Functions, for event-driven workloads, it's wonderful. You don't need to spend money on compute that's just sitting there idle and you only pay for what you use. It's incredibly powerful and things like Amazon API gateway, just really bring that into the application domain for doing things like RESTful API servers. Definitely worth checking out if you haven't already but I would caution everything is not going to be serverless. I think state for workloads have a place on this earth, they're not going anywhere anytime soon.
I think there's a lot of folks focusing on NoCode solution. I think we'll see a lot more of those and the thing I'm excited about is as we raised the level of abstraction, infrastructure does fade into the distance. The reason why it's so exciting to me is because it's the thing that powers everything on top and so if you're actually able to program at the level of the infrastructure and build new architectures. I mentioned super charging, I really do think it's supercharged just what you can do with the amazing cloud resources that are around us. We're trying to enable this virtuous cycle of building new things and I think the builders who are figuring that aspect out are really going to enable all of us to focus on higher and higher level application-level patterns.
In summary, I think the power of real programming languages and the safety and robustness of infrastructure as code, really for the next wave of cloud innovation, I think you're going to see this happening a lot more. We already see other folks - Amazon has their project that they're gearing up to make a bit more noise about later this year called the CDK. We've been working with them and talking with them, it shares a lot of heritage there. Some of the Helm work that's going on for Helm 3 is going to incorporate more general purpose programming languages. HashiCorp just launched Terraform 0.12 almost a month ago, and that incorporates four loops and some basic rudimentary programming language features. My guess is they'll constantly be moving that in the direction of general purpose languages as well, so if you're into infrastructure, this is where it's at. It's a lot more fun, which is probably the most important part.
Questions and Answers
Participant 1: You showed that policy enforcement through the unit tests but if we think about the code, wouldn't be the static code analysis a better place for policy enforcement? For security issues, you need to write unit test to enforce it, but it would be nice if a compiler or something else can tell you, "By the way, you're provisioning the infrastructure and you have some issues that you might want to fix," and showed it as a warning when you compile it.
Duffy: Yes, I 100% agree. We're actually working on a new policy as code framework that we're going to launch in the next couple of months, which we'll basically bring that into the IDE so you get red squiggles, you get little light bulbs for auto fixing. There's some things you can't do statically, unfortunately, they require a little bit of dynamic analysis but for those that you can, I completely agree, it's better than having to write tests for them.
Participant 2: I think one of those ideas is really cool and I really think, "Why didn't I come up with it?" because it's really simple but it makes so much sense. I have one question to the demo that you showed, I think the answer will be yes, but do you have a way of defining dependencies? For example, there was no Async/Await because it showed up that it's tested right, but I think the first two jobs were running in parallel and the rest of the three were waiting for the first two - at least it looked like that - at the finish. Do you have a way to actually Async/Await or something else to say, "I need this thing to finish," and then I can, for example, pipe the URL into the job?
Duffy: That was probably one of the more difficult parts of building the system. At the heart of it, all resources have properties on them, and the properties are like fancy promises, except the promises carry dependency information along with them. If you're programming it, the system automatically tracks all the dependencies so that it can parallelize things and then destroy things in the right order. In the event that it gets it wrong or you have like an external dependency that's not obvious in the program, there's a way to just basically say, "Depends on," and then you can pass an array of other resources that things depend on. You can control the DAG if you need to but in general, it tends to get it right by default.
Participant 3: I have two questions, the first one is, does it support C# right now?
Duffy: I was showing TypeScript and there's no C# yet, but that's on the roadmap. Right now it's TypeScript, JavaScript, Go, and Python.
Participant 3: Also, they support the transaction, which means that if one middle step failed, would it have us automatically roll back and then clean the resource inside cloud?
Duffy: We don't automatically rollback but you can roll back explicitly and the reason why we wanted to leave that to the end-user to decide is that the right thing to do because sometimes it's not the right thing to do and so we decided, "Making that decision on behalf of the end-user could get us into trouble," but you can easily roll back in the face of failure.
Participant 4: Can you describe how this would be applied for on-prem infrastructure?
Duffy: We have a lot of different providers, we have VM vSphere provider, OpenStack, F5 BIG-IP if you're going to manage firewall rules using code. All of it works for on-prem, we have people using it for Kubernetes on-prem in addition to just VM orchestration. There's a lot of different on-prem technologies, so we definitely don't support everything. We have a Terraform adapter layer so if there's ever a Terraform provider that exists already, we can easily adapt it into the system and project it into all the languages. That's like an hour of work, so it's pretty straightforward.
See more presentations with transcripts