Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations AWS Cloud Development Kit (CDK)

AWS Cloud Development Kit (CDK)



Richard Boyd looks at how users can create infrastructure with CDK and some best practices for creating reusable components.


Richard Boyd is a software engineer by trade in a developer advocate role. He works on AWS’s Developer Tools. Previously, he was a software engineer on Alexa, Amazon Robotics, and iRobot. Prior to working in the tech industry, he spent about a decade in the defense industry working on statistical models and path-planning software.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Boyd: My name's Richard Boyd. I'm a software engineer and by trade I've been doing software engineering for about eight years in the private sector, and about 10 years in the defense industry before that. I've just recently moved into a developer advocate role, so I still do some software engineering but more of my time is spent meeting with actual developers and feeling what their pain points are and taking it back to the services teams so that we can improve the product.

The next hour we're going to cover a tour of the CDK, which is a new software development framework for defining your cloud Infrastructure is Code. We're going to have roughly four parts today. I will start by giving you a little bit of the context into this Infrastructure as Code journey, and then we'll talk about some of the challenges the developers face, some of the best practices that have emerged from those challenges. Then we'll quickly give an overview of the main core concepts of CDK so that we're all using the same taxonomy when we're talking about constructs or stacks or apps. Then finally, I'll do a demo app and I'll do a small demo app, and then we'll slowly add a little bit of complexity to it based on how much time we have, just demonstrating some of the concepts as we go. Then we'll wrap up with some recommendations on where you can go to learn more about the CDK. Let's hop right into it.

Our Infrastructure Management Journey

Before we get into the actual journey itself, I want to take a small detour and talk about an outage I caused a few years ago. I used to work on Amazon's Alexa project. Everyone's heard of Alexa, I assume. A few years ago, I couldn't assume that so it's much better now. Specifically, I was on the personalized model building systems. We built these personalized models for each user just to make Alexa more accurate for that particular user. Amazon's a customer of AWS just like any other customer, just much larger. We had a fleet of EC2 instances that would have to share some state across all of these hosts and it was very important that the state was consistent across all of the hosts. We had this two-stage deployment process where we're very careful and methodical about how we updated that, because we just stored that state on the actual hosts themselves. If you've got 70 or 80 hosts and some text file essentially that you want to keep consistent, you have to be careful about how you make changes. I thought I was going to be very clever and move the shared state off the host and put it into a DynamoDB table, so just one shared state, that's the purpose of a database.

Since we only had this one table, we were doing a test one day and like any good test fixture, it tore down all of the infrastructure when it was done, including our production table. Nobody could enable any new skills for a couple hours during this outage. As Amazon's fairly well known for, we do these retrospectives afterwards to figure out what caused this outage. One of the things that came out of this is that we needed not just separate tables for development and testing and prod, but we needed separate AWS accounts, whole new accounts just for prod, dev, test and also for each of the regions that we were operating in: North America, South America, EU, Asia Pacific. This ended up with about 16 individual AWS accounts.

Has anybody ever tried to create infrastructure in 16 different accounts by using the console all at once? It wasn't fun. Just making the accounts, I was, "There has to be a better way to do this." Also, I thought there was a better way to do this but after I made all these accounts, I was, "It's done now. We don't have to make any changes. Why are we going to invest in automating it if we've already done it this one time?" We just stayed that way. We had this runbook that was really big that said, "Here's how you add a new table, if you want to spin up a local environment or a local account to do your own testing." There was some machine learning scientists who made some models that we later consumed which was part of this table process. They wanted to be able to create a version of our microservice so that they could test it before they delivered it just to speed the iteration on delivering those models. If you've ever worked with data scientists before, if you ever tell them, "Just go poke around in this console until it works," it'll never work and they'll be mad at you for a long time.

We also wanted to start leverage more and more of the services that AWS was offering, not just tables but SQSQs, S3 buckets, all these other services that for each one resource you create, you have to create 16 more manually in the console, and that wasn't a very good experience. I realized that we had more story points in [inaudible 00:04:55] allocated to updating this runbook or as I called it, the runcyclopedia. We had more points devoted to them than to our actual features that we were trying to launch. The administrative overhead of all of these accounts were swamping us.

I did as any good software engineer would do. I waited until I was on call next and I replaced all the contents of the runbook with the script. I think the one shown here is in Ruby, just as a demo. The one I made was in Python, it used a Boto3 library and it would just go and create all these resources for you. It was a huge improvement. Where we were spending a whole bunch of time maintaining this is essentially a Wiki document, we now just had a few hundred lines of code that we would maintain.

We traded one problem for another, and this problem is that when you create infrastructure, you only ever create the infrastructure once. Then you modify it tens, dozens, maybe a hundred times or so and then eventually it's deleted hopefully, it doesn't just linger forever. When I wrote the script, all it knows is it just creates the infrastructure. You want to talk about mutating that infrastructure, set a flag on one of the buckets or add some property to a Dynamo table, that wasn't really supported in this thing. It became very difficult to update the infrastructure to the point where it was easier to just tear it down and start over again. That's what a lot of people did until someone accidentally tore down a production table and we realized this wasn't going to work for us.

We moved down to these resource provisioning engines, which is CloudFormation and Terraform. What I really liked about these is that you just gave them a desired state. You say, "This is a state I want my infrastructure" and the provisioning engine will look at the state or go look at what state your infrastructure's in now, it'll figure out what the difference is and then it moves your infrastructure into that state with your appropriate API goal. It handles rolling them back if there's any mistake, it handles updates, it handles everything for us so we can just focus on just describing the desired state of our infrastructure. Really, it feels like magic the first time you use it because you're just, "Here's what I want" and then you give it to a CloudFormation or Terraform and what comes back is exactly what you wanted, hopefully.

What I found with these is they're very verbose. I describe it as a very evil genie, when you say, "This is what I want" and it gives you exactly that. You have to be very clear about, "I want this and I want this property set this way," and you end up with this very large CloudFormation, like YAML document specifying all these properties. There's no opinionated defaults, there's no abstractions. It's just verbatim, this is exactly what I want, and it'll get you there but you have to be very clear about what it is that you're doing. I found that I was spending most of my days just copy-pasting JSON and YAML or converting between JSON and YAML because I found a sample that I wanted but it was in JSON, the rest of my template was in YAML. One time I tried to just paste it in there, it kind of worked, but then it didn't get through code review and I was really upset by that. I was hoping that they just wouldn't put me on on call anymore but it didn't work out.

After a whole bunch of times copy-paste all this YAML code, we're, "There has to be a better way." I tell people I did some Googling but actually a friend of mine just pointed me to these Document Object Models. They're the only way to generate CloudFormation templates based on this DOM structure. It felt closer to what we wanted to do. I wanted the power of CloudFormation that was declarative and it handled the provisioning engine portion of this where it moves my infrastructure from one state to another so I don't have to think about all that. I also wanted to just be very [inaudible 00:09:18] when I say it. This is what I want. I have no preferences for how many capacity units are on my Dynamo table. I just want a table that works right now and I'll worry about that later. This is what the DOMs gave us. They gave us this abstraction on top of the CloudFormation and allowed us to deploy our infrastructure. The first time we get into the deploy infrastructure, some people call it real code, I call it familiar code. It looks very much like YAML. I think this one is Troposphere, the demo that's shown there which is in Python, so you get all your Python features from your IDE.

You can use PyCharm or whatever ID you use and you get a tab completion and inline documentation, all these great things you like about using IDEs, and the reason people don't just code in [inaudible 00:10:18]. They offered an abstraction but an abstraction wasn't a first-class citizen in these object models. You couldn't wrap up everything you did as an abstraction to give to someone else. If someone else wanted to do this, they would just copy and paste your Python code instead of copy and pasting your YAML. We were only slightly better than we were with our previous provisioning engine models.

AWS Cloud Development Kit

This is where AWS stepped in. Coincidentally, this was right as I left AWS to go to iRobot. I got on a different team as this was being launched. Those two were unrelated. I'm not saying I left because of the launch, that would be a terrible thing to say on camera. AWS offers all of these services. We have hundreds of services and I should probably know the exact number but I don't. We want to use more and more of these, and as more launched, they make services that are more specific to individual use cases. You went from just a general purposes database to now you've got a graph database, you've got a document database, you've got a key value database, we got a SQL database, if that's what works for you. As we have these more and different types of resources, it becomes harder to manage having all these different resources, it becomes very complex. The CDK team was formed to create an abstraction that helps developers manage this complexity of this infrastructure, minimize that complexity to make it easier [inaudible 00:11:47].

I'll quickly cover some core concepts, then we'll jump right into writing some demo code, when we're writing Python or TypeScript for the rest of the demo. One of that is very core; the CDK is a software framework for defining infrastructure for your cloud applications, and this code that you write in this framework is called your application. I know that word application is used quite a bit, but when I refer to a CDK app or CDK application, I'm referring to this specific framework feature, which is just a collection of your stacks. You could have one stack or hundreds of stacks or maybe thousands depending on what CloudFormation gives you and how you spread them out. If you have a thousand stacks, I'd really like to talk to you in the hallway afterwards. Then each stack contains what we call a tree of constructs where a construct is a AWS cloud resource with some configuration attached to it, or no configuration depending on how you use it. We'll just say that no configuration is a special case of some configuration. Then the key to this is the abstraction. The CDK programming model allows you to bundle up these constructs as abstractions that you can pass around and publish, which we'll talk about in a moment.

Also, this allows you to keep using your IDE resources that you had before, so you can still do inline documentation, you can still do the testing tools you use. For Python, you can use PyTest, for JavaScript or TypeScript you can use Jest to test these, you can use your IDE to do tab completion like I said before, type hints, things like that. I'm not sure how many people memorize the standard library for applications, but I normally just press the dot and then I look at all the methods that are available and look for the one that's similar to the one I need. Or, if I'm creating a class, I'll do the opening parenthesis and then I'll see all the things that it expects and hopefully it tells me which ones are acquired or optional. If anyone writes an API and they don't put documentation in those, I don't want to talk to you in the hallway after this. We're not friends.

Because we're still working on top of CloudFormation, we still are able to leverage that provisioning engine. You still don't have to think about how am I going to move my infrastructure from one state to another. You just give it your desired state. We're just using the CDK Construct Library and the CDK core application, the software framework to create this CloudFormation representation for you so you can focus on just the code, and it lowers the amount of cycles mentally you have to use to think about the CloudFormation syntax, so you can spend more time thinking about what your application is doing.

Did anybody attend the last talk in this track upstairs, the Pulumi one? The talk was about how the line between infrastructure and application is blurring and it's true, that's why he's giving a talk about it. How your infrastructure is configured greatly affects the way your application performs, or if it performs. The more time you can spend thinking about that, the less time you spend thinking about the underlying implementation details of mutating some resource, like the better applications you'll produce, which is generally true. The more time you spend thinking about something, the better you are at it. If that's not true, I'm sorry.

We have this thing called the Construct Library, which I believe is one of the last resources we'll discuss here before we start writing code. I really want to start writing code, but they said I had to talk first. We have this Construct Library, I think this is just a snapshot of the cover. We have modules and constructs. Each AWS service more or less is composed into its own separate library so it does its own versioning. Some are marked as stable and GA, some are still experimental based on how much development has been done on them individually, or how confident we are in them. We also have a few extras that we've added as helper functions.

Has anyone ever tried creating a CloudFormation custom resource? Nobody? It's kind of hard. Don't tell anyone I said, that but it takes a while. It takes trial and error quite a few times. Usually, you use a custom resource because you just want to do a single AWS SDK call that's not supported in CloudFormation, so we added this AWS custom resource package that you can just describe that SDK call that you would make in your code and it will just deploy that as a CloudFormation resource, which unlock a huge potential. Even though we're still built on top of CloudFormation, we're allowed to offer a lot of features that CloudFormation doesn't quite support yet. The ACM custom certificates that are used, I just recently used it for Athena tables which don't quite have as much confirmation coverage as I'd like.

Luke [Hoban] mentioned constructs as L1, L2 or L3. L1 constructs, because they're build on CloudFormation, if you really wanted to, you could write CloudFormation in Python or TypeScript, whatever language you wanted to use. It would be very - I don't want to say it would be difficult but it wouldn't look very nice and you wouldn't accomplish much more than just directly writing the CloudFormation itself. You could do something, it gives you some benefits because CloudFormation will tell you, "This is supposed to be a bring or this is supposed to be a number." Then your IDE will say, "This is supposed to be a string or number." But if it's supposed to be an R and it's just represented as a string, it's not going to know if that's the correctly formatted R, and then it'll just say, "Sure, I'll take that," and then when you deploy it, it'll fail. The L1 constructs are just your base CloudFormation resources.

Then L2 constructs are those resources but with opinionated defaults built in. The example I like to give is a VPC. If you've ever created a VPC in CloudFormation, it's about 350 lines of CloudFormation. You have to create net gateways, you have to create private subnets, public subnets, route tables, internet gateways, side address blocks for all of this. With L2 construct, we say, "If you tell us, 'I don't care what resources and what configuration VPC has, just give me a VPC. I need a VPC to put an EC2 instance in for something, I'm going to tear it down for a minute. I don't care about any of its settings, just make it work,'" it's a single line of CDK and it's just new VPC. Then we will give you, I think, 2 public, 2 private, 256 IP addresses in each one. I think these are the defaults. We provide a lot of opinionated defaults, and then you can slowly peel back that abstraction if you want to. If you say, "I'm a little more picky. I want to have exactly one public subnet," then you supply just that to the constructor and that's what you'll get. I describe this as peeling an onion, whereas some of these other tools that are more like cracking an egg, where as soon as you deviate from the opinionated default, you just have to do everything yourself. You open an egg.

Also, this project is open-source, so we have it on GitHub. I made a few contributions while I was in Amazon before, and then while I was at iRobot, and then I made a couple since I came back. It's actually easier for me to make contributions when I wasn't working for AWS just because with Amazon, AWS technically being separate companies, I had to go through more reviews. In iRobot, "Yes, just do whatever you want." I did it on my own time. It's not like iRobot was paying me to do it.

Demo: Build an AWS CDK App

Let's get into some code. We're just going to make a directory. What are we going to name this application? You in the orange, what's your name?

Participant 1: Matt.

Boyd: Ok, MattsNewApp. One of the benefits of CDK is that you can use a language that's familiar to use. We have general availability support for Python, TypeScript, JavaScript. We have developer preview for Java and C#. The user experience for those two isn't quite where we want it yet before we're comfortable saying it's general availability, but it does work. If you're using Java, because it's originally written in TypeScript and then uses JSII which just totally transpiles into these other languages, Java developers don't want to have to know about using NPM and doing these other things.

Every time we do this, people complain because if we do Python, they will say, "I was hoping it'd be TypeScript." If we do TypeScript, they say, "I was hoping it'd be Python." I'll let you pick. Just raise your hands if you want this demo be done in Python. Ok. Raise your hand if you want it done in Typescript. Python it is.

We'll do cdk init. This command by itself won't do anything. It will just tell me how I messed up hopefully. This gives you your options, you can do just a regular app, which is a blank project structure library, which we'll cover in a moment, or a sample app. We're going to do a sample app. You say which template you want, so we'll say sample-app. It creates a virtual environment for us. I've been asking the team to have it automatically source the virtual environment so I don't have to do that, and they said that was too lazy. That was a bridge too far. They wouldn't support that. They do give you the command so you can just copy and paste it.

We're going to use this virtual environment. Then we're going to pip install the requirements for this. Because you all chose Python, I hope you're familiar with this. This is the requirements file. It'll go in this and pull in these dependencies. By default, this is the application that we give you, we'll look at it real quick. We can ignore this hello_construct for now. We import iam, sqs, sns, simple queueing service, notification service, and then a subscription thing for sns. Then you get an sqs.Queue.

Every object is part of a construct and the constructs have to have a reference to where they want to be put. We always start with self saying that we're going to add it to this MyStack. You could if you wanted to pass in some other stack to attach into, but that feels like violating some weird computer science software engineering principles about encapsulation and data hiding. You can technically do it, I'm not saying don't do it, but don't do it.

Then you have to give it a name, if you've used CloudFormation before, this is very similar. You'll pass in these options. Our Queue has a visibility_timeout, that's a common option, sns.Topic. This is a construct that we're actually going to get rid of. We're going to get rid of all of this we're going to create more or less from scratch. I like this because it gives us a lot of the templating for this. We're going to create an sqs.Queue that messages come into it, and then there's a lambda function that takes those messages and writes them off to the CloudWatch logs, mostly because you can write to CloudWatch logs just by printing it. That's the cool thing about lambda, it allows me to be lazy.

First thing we will do is we will import lambda. This is a problem that you have when you're writing a thing that compiles into many other languages. If you look at the reserved keywords across Java, JavaScript, Python, C#, there's not that many words left. If you want to use lambda, it treats this as the Python lambda, and it's like, "Sure, that can be a function," and then, "That's not at all what I want." The idiomatic way I see people doing this inside of CDK is they'll just add an underscore to either the beginning or the end or they'll abbreviate it. Sometimes, I'll import these with just one character. I did this for the code tools a lot because that's the tools I focused on. I code build, code pipeline, code commit. I just call them cb, cp, and cc. I get yelled at all the time, "This is unreadable." I'm, "That's the way I like it."

This tells you the resources that are available inside the lambda service. Anything that starts with a Cfn is the L1 constructs. If you really wanted to write CloudFormation in Python, you could do it with this Cfn. I'm not going to give you all the choice to do that today because I know you'll make me do it. Then we say self, we say where to I attach it, we give it a name, mattsfunction. This is saying that there's a code, handler, and runtime are undefined, so that means that this will fail if I try to compile this because those are required properties.

We do offer a lot of opinionated polls, but there are some things we just can't guess. I can't guess what your code is going to be. If I could, I'd be making a lot more money and I wouldn't be working for someone else. The first thing I'm going to do is runtime, say lambda_.Runtime. Then, here are the runtimes, it's like an enum. I love this because when I try to use regular CloudFormation, There's different versions of Node that are described different ways. I can never remember the syntax. I always have to go and look them up. This just lists them as an enum of what it accepts.

This thing that just popped up here, these are all the possible options you could give it. We saw that there were the three required options but we do still support if you wanted to add layers add a description, which I never describe anything, so that's documentation. You want to add some events to this, you want to create a custom role and assign it to, that you can also do. If you don't supply a role, we'll give you the basic lambda execution role.

index.main is the handler. If you've created a lambda function before, you'll know that you have to say an entry point into the function. We're going to do what's called an inline lambda function where you put the actual code in the CloudFormation document itself, and in that case, there's no file name which is normally the first word here. It's just called index. You just have to know that from the docs; i'll show you. code=lambda. See, I just did it there. You could say from_asset. I think there's a from directory. You can say from_asset and you can say, "This whole directory is a lambda function" and it will handle zipping that up, uploading it to S3, adding in the correct keys for you.

We're just going to do from_inline and I'm also going to do something very gross where I'm going to write Python just in one straight line across. This is going to be our function code, similar to what the guy in the Pulumi talk did earlier. We're going to say def main, /n, of course you have to indent it because Python likes white space. print(event). I always say the white space is the best form of scope.

Now we just have a lambda function. Let's come back here. We can do cdk synth to synthesize our template. I think there's an extra space there. Hopefully it doesn't break anything. That was anti-climactic, I'll make it better. This is the app I referred to earlier. It actually creates two stacks just to demonstrate this, but I'm going to delete this one because nobody cares about us-west-2, and we'll make this the best region, us-east-1. We'll try this again.

If you have more than one stack, it'll just output all of them to a folder and say, "Ok, I'm done." If you only have one stack, it will print it out for you and the synthesis command, which if you were doing this in a production environment, into the folder would probably be a better approach. Then this is the CloudFormation that I created. It creates mattsfunction, it creates the role for mattsfunction. It creates the function itself. This is the syntax it uses for this hopefully correct Python code, then references the role. Then we add in this metadata which helps debugging this. If you wanted to see where a specific resource came from, you could dive into this and see where that came from or what version numbers were being used. Inside the cdk.out file folder, we also produced this manifest file which describes the resources and stack trays for each of those, whether they passed or succeeded, they still get the stack trays to help debug it if you're not sure where a resource came from, because if you have layers of abstraction on top of layers, it's really hard to tell where did this thing come from. Sometimes the templates will just show up in the same as peers in the template.

cdk deploy. I'm going to use a profile, it's just a profile I use for demos. I just don't want to have to open up my config file and show everyone my credentials during a recorded screen. This is a Richard bug. What happened was, earlier I deployed this stack with the same name as a demo. It gets this name, hello-cdk-1. That was the default name, I didn't change it, and I made a couple changes to that thing I created earlier and now we're doing this again, and it's trying to overwrite that but it's telling me, "You're going to remove this policy if you do," which is sqs.Queue, which makes sense. By default, when you get this diff, anytime you do a cdk deploy, it'll show you any IAM role or policy differences because those are the most important ones that people tend to care about. You can force it to say, "Show me all the differences." Let's say no for now. cdk diff. This will show you the actual differences if it runs.

It's adding a new role to a new function. I'm not sure why it's not mentioning getting rid of the queue, I'll have to look into that. If you wanted to put this, it'll prompt you for the yes or no. I you want to live dangerously, you can say force, I believe is the command, and it'll automatically just say yes to these things, or you can just tell it to never prompt you if you really want to live on the edge. There's probably a vendor here who will sell you a tool that will scan it for you so you can just autopush it, and if not, maybe there will be next year. This is creating the change set, let's hop over into the CloudFormation.

Let's update it. It's saying "Update in progress." I totally messed this up. I should have deleted the stack first, so I didn't do this. It normally could say create in progress. See these events, and resources I created yet. Let this deploy, or not. It might fail, we'll see, but let's go back to the code.

I spoiled it with the cdk diff but let's add a queue = lambda, mattsQueue and that's it. We can put a visibility timeout, like the other one had originally, but we'll just leave it off there. More of those opinionated defaults. See what this is doing, this is deleting old resources. That's why it's taking so long. We'll go back real quick.

Right now, we just have lambda function and we have a queue. There's no connection between those two. You can say grant_consume_messages and that has just created the scope down IAM permissions for IAM to be able to consume messages from that queue. You don't have to worry about, it's in the right string you have to get from the IAM docs. You don't have to worry about any of this. It gives you the correct permissions for this queue for that lambda function. That allows the lambda function to consume from the queue but what we also need to do is to give the queue permission to invoke the lambda function.

You just say add_event_source. We're going to add_event_source, this is looking for this thing called ieventsource, which is very small and no one can read it so you just have to trust me. I have not earned very much trust in this session and I always forget what this is called. I think it's event_sources. It's going to say it doesn't know what that is because I haven't imported it, so the way I generally prefer to do it is just to copy this. That's another reason Python is better than JavaScript and TypeScript, because I can have these trailing commas and it doesn't throw an error. There it is, all I needed was the name so I'm not going to wait. It's still red because I had it here but I have to actually execute the pip install command.

While that's installing, we will look at the CloudFormation. This is done. There it is the mattsfunction, way at the bottom. Especially in TypeScript, sometimes I get it to work, I'll get through the build and enough of the install that it works, and then it'll throw a bunch of errors and then I just suppress those.

CDK 1.16 was released this morning and there was a regression in one of its dependencies that was then fixed but it broke my demo. I fixed the demo for the regression and it looks like they put in a patch, so they unfixed it. I'll try it again. I think we have to call it there because normally this would work. I apologize. This is a terrible look, I get it. The point of it was that you could say add_event_source and then you could say sources, then it would list all the possible lambda event sources. I believe the number is 47 that Andy Jesse had last publically mentioned, and we've implemented a dozen of those. It would be sqs event source and then you'd point to the queue. Normally, that would work if a better engineer were up here. I'll call it a day on the demo.

Next Steps

We have a workshop where hopefully you'll have better luck than I will getting this to work with the versions. The versions are [inaudible 00:44:44] in that one so they will have the same issues. There's a gitter channel and on the actual GitHub itself, you can post issues. Sometimes, I'll just post an issue and it's, I don't like this and nothing else, and it'll stay open for a while. They're going, "That's very good feedback," so I don't know. We also have sessions at re:Invent that's coming up early next month.


See more presentations with transcripts


Recorded at:

Feb 10, 2020