Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Lessons Learned from Reviewing 150 Infrastructures

Lessons Learned from Reviewing 150 Infrastructures



Jon Topper presents a structured review of the architectural and operational choices of 150 platform teams. Topper explores several themes, talks about common mistakes, and gives advice on how to avoid these. The review tool used is part of the AWS Well-Architected program.


Jon Topper runs The Scale Factory, a team of cloud infrastructure and DevOps experts based in London, UK. He's worked on infrastructure problems for Fortune 500 companies and startups across a range of market sectors.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Topper: My name is Jon. I run The Scale factory. We're a DevOps and cloud infrastructure consultancy based here primarily in London. We're a remote-first team. We have colleagues around the country. Before that, I came through the Startup Mill on a mobile messaging app that was a bit like WhatsApp, but nowhere near as successful. Then before that, I was in the ISP industry, building infrastructure for web designers to deploy their websites on, in PHP 3 era. I've been around this area a while. I've got a beard so you know you can trust me and my sys admin credentials. I have a lot of stickers on my laptop, which means you can trust that I'm a DevOps practitioner.

The Scale Factory is an Amazon consulting partner. We've been going since 2009. We joined a program at Amazon called Well-Architected as an early access partner in April 2018. Since we joined that program, we've reviewed, at this point, around 180 distinct workloads on the Amazon cloud. We've embedded this review framework deeply into our consulting practice. To me, this is a fairly unique and valuable perspective, looking at the ways that a lot of people are doing things in the cloud, and being able to make assertions about what you love to probably do in the cloud based on that data. Hopefully, the things that we've learned will be useful to you to take away and look out in your own practices. One caveat, I should say that the majority of the reviews that we've done have been for small and medium businesses with not a lot of high-end enterprises. We don't really work very actively in that space. If your environment is enterprise, this might not directly apply, but actually, on the basis of the enterprises I have spoken to, it's probably still true in those organizations as well. It's just that the data I have doesn't come from that world.

I'm going to talk a bit about what Well-Architected is, what a Well-Architected review looks like, and the things that we found through running that review process. Two things that I hope you'll get out of this is, one, hopefully you realize that you're not alone. We'll pick out some trends here. You'll be like, "We do that as well." The fact that other people are also making those same mistakes might make you feel a little better about yourself. I'm also hoping that some of you go away and go, "Actually, we're doing this pretty badly as well in our infrastructures. We should probably look at that."


What is Well-Architected? Well-Architected being an Amazon framework means that this is most directly relevant to you, if you're using AWS. Who's on AWS? Anybody on GCP and Azure? Any other cloud infrastructure? A few Baidu cloud, or Ali cloud people in the middle, presumably. The framework itself asks questions in a platform-agnostic way. When we're reviewing infrastructures, it's mostly Amazon using customers. We're mostly looking for Amazon shaped answers. A lot of this is probably still valuable to you even though you're not Amazon users entirely.

Well-Architected was originally collective internal knowledge within Amazon solutions architecture organization. As a solutions architect at Amazon, you would go out and talk to customers about how they build their infrastructure and ask them questions about what's working for them, what isn't. They, over a number of years, built up this tribal knowledge of, what does good practice on AWS cloud look like? Then they've taken that learning and codified it into a set of documentation and a review framework so they can share that across the organization with a level of consistency. In April 2018, they came to the realization that they wanted to review more Amazon workloads than they had solutions architects available for and so they shared it with the partner network, which is how we came to get access to this. It's platform agnostic in as far as the questions it asks don't apply specifically to Amazon. You can run this with an on-premises workload. You can run these questions with an Azure or a GCP workload. You're more likely to do it if you're an Amazon user. Don't feel like you can't use this if you're not.

Today, Well-Architected is now a set of white papers. These are for Amazon specific. This goes through the design principles and recommendations of that framework on a per area basis. Talks about how to apply those principles using Amazon's tools. There's also a review tool which exists in the console on the AWS platform, which is basically a multiple choice questionnaire. This is the type of questionnaire where we've got this data from that we've used with customers over the last couple of years.

The Well-Architected framework covers these five pillars, as they're called. They go in order of operational excellence, security, reliability, performance efficiency, and cost optimization. Most organizations care about multiple of these. If you're a bank or a financial services company, or you're doing things with pharma, then you might care a little more about security than some of the other industries. Maybe cost optimization is more important if you're a startup, those sorts of things. There's a white paper per pillar that you can go and read on the website.

There's also what are referred to as lenses. A Well-Architected lens is a collection of questions that relate to a specific type of workload. If you're building serverless components, or serverless applications, or you have an HPC, or an IoT workload, there is, right now, white papers specific to those types of workloads and a number of additional questions in the review tool that apply to that. Right now, the serverless lens is the only one that's available in the online tool. The white papers contain the questions for the other two at the moment.

How do we use this? How can you use Well-Architected as a framework? It's a really good tool for gap analysis. A lot the opinions in the review framework can be used to identify areas where you're not doing things particularly well. You can use that to roadmap what you're going to do over the next 3 to 12 to 36 months to address those. It's a great teaching tool. When we run these with teams, invariably, even the person who thinks they're the Amazon expert goes away having learned a bunch of new stuff, whether that's new product launches that they hadn't really heard about or something that they weren't using quite correctly. In larger organizations, it's really good on team alignment. As a governance practice, it's a really good way to make sure that all of your teams are doing things in a consistently high quality manner. Some of the enterprises that we have worked with, the bigger companies, use that to help keep their teams moving in the same direction, pretty much. You can probably use it, using these ways too.

What Is a Well-Architected Review?

A review itself is where the data that I've got here comes from. It's 46 questions in the standard review tool. The other lenses add additional questions. It takes up to four hours if you're doing that in a partner led way. We use it as a conversation starter a lot of the time as we go through ticking boxes. It's very qualitative. When you're doing a review of this stuff, you don't need to gain access to the Amazon accounts. It's very much asking questions about, are you doing this practice? Are you doing that practice? Do you have this consideration? It's quite high level. It's fairly technical, as far as it goes. It's not a direct analysis of the things that you've set up. It's a lot more about your operational practice and the way that you've made certain architectural decisions.

The total number of questions is pretty big. If you were to be running an IoT workload, there's another 11 security questions and 10 performance questions to look at. Those cover specific considerations for that workload. The IoT stuff covers quite a lot of X.509 certificate management stuff, because there's a lot of that in that world. Serverless covers a bunch of stuff around the things that make operating a serverless platform more difficult than a regular containerized or EC2 workload.

How Do You Determine What Your Priorities Are?

I'm going to show you an example question and how each of those questions are scored. Then I'm going to take you through some of the data that we've got for some of the notable questions in that review as far as our findings are concerned. Then we'll talk a little bit about the implications of that. What does that actually mean? What can you do with that information? The question I'm going to use as an example is the first question in the operations pillar? The question, as asked by the framework is, how do you determine what your priorities are? What we're looking for here are that you're making decisions based on actual context. You know what your customers need, whether they're internal or external customers. You're familiar with or aware of your compliance landscape. You're thinking about your threat landscape as well, from a security perspective. You're making trade-offs based on those things, and using those trade-offs to drive your priority decisions. You're looking at the benefits and risks of each of those situations.

This is a fuzzily laid out question. It's annoying that it's the first one on the framework, because it sounds a bit weird. Really, the question is, are you thinking about what you should be doing based on what your company needs? The way that the scoring of the framework works is that each of the answers is given a rating. That rating is either Well-Architected, needs improvement, or a critical issue. You get an overall score for this question based on the boxes that you tick in the tool. If you ticked none of these because you're not making priority decisions based on any context, then quite rightly, that's flagged as a high risk. That question we consider high risk for that pillar. If you were to tick all of the Well-Architected boxes, then that would be flagged as a medium risk, because there's still some of the stuff to work on. That would remain a medium risk until you tick every box in there. At which point you get a Well-Architected score. Reasonably straightforward. The way this is rendered in the console tool is quite nice that each of the questions has links off to pieces of information and videos about how you could meet the goals of that question. You do that 46 times, basically.

Common Review Findings

Let's talk about some of the things that we found. I'm going to start with the things that are good, the places where most people are doing a pretty good job. That question that I used as an example is actually the question that more teams have got a Well-Architected mark on than any of the others. This dataset is actually about 115 rather than 150. The majority of teams score well on this because they are taking into account their compliance landscape, their customer needs, and so forth, as they make their architectural decisions, or their operational decisions. It's likely that in your businesses, you are also making reasonably good decisions driven by your business priorities. If you're not, you should probably just pack up and go home anyway. I don't know how you come to operate in that vacuum way.

How Do You Select Your Storage Solution?

The second highest number of Well-Architected ticks by question, at 70%, is around selection of storage. There are a number of questions in the performance pillar about how you've made selections of specific types of infrastructure. You'll be asked about how you select your storage. How have you selected your database? How have you selected your compute? How have you selected your networking options? All of these really are around understanding what the characteristics of those services are in relation to the type of data you're processing, or the type of app that you're building. Seventy percent of people have chosen their storage solutions based on an understanding of the characteristics of the thing that they do. Really, what this means, I think, is we're using S3 because that's the storage option that most people are doing anything with on the Amazon platform is S3. In some cases that might also be recognizing that using EFS, for example, is a good way of achieving a NFS shared storage layer within that application rather than doing some other thing.

The questions around architectural selection, also usually mention, as in here, available configuration options. Most of the teams that score pretty well on their understanding of the requirements don't always score as well on their understanding of configuration options that every component in the ecosystem typically has a number of things that you can tune and configure about it that you might not necessarily recognize. For example, if you're attaching block storage in the AWS landscape with EBS, you may not be aware that the number of IOPS, the IO Operations per Second that are possible with that storage volume relates directly to its size in some circumstances. You may not be aware that there are other things that you can tune about EFS, for example, to perform one way or another depending on the type of files you're storing on it, that stuff. Most people are doing a good job of choosing their storage solution.

How Do You Implement Change?

Question 5 in the reliability pillar, 63% of the workloads that we've reviewed get a Well-Architected tick for this question. This is about planning change, basically. Eighty-three percent of people are making changes in a planned manner, which is there. Sixty-seven percent of those people are deploying changes with automation. Not everybody has fully adopted automated change management at this point. A worrying 6% of people are neither planning nor automating their changes, and presumably going through life just randomly deploying stuff on a whim into production, which hopefully none of you are doing. Those are the top three things that teams are doing pretty well.

How Do You Plan For Disaster Recovery?

Let's have a look at some of the things that are going badly. Question 9 in the reliability pillar. Seventy-nine percent of the workloads I've reviewed score a high risk in this area. Eighty-seven percent of teams score either high risk or need improvement in this area. That is planning for disaster recovery. Very few teams seemingly are making decisions about how they should be responding to disaster events based on, for example, business goals, and so forth. The first thing that you'd be expected to do if you're thinking about disaster recovery is to define what those objectives are. Your recovery time objective. Your recovery point objective. How much data are we prepared to lose? How long are we prepared to be out for in order to meet the needs of our customers? Only a third of the teams that we reviewed the infrastructure for have actually thought it through in those terms?

I think if you're looking at the Google SRE type nomenclature, these are your SLOs. The things that you're trying to hit. Lots of teams are not really thinking about that in terms of disaster when things will go wrong. A third of people are also not defining what those recovery strategies are. Some of them aren't thinking about how much data they're willing to lose or how long they're willing to be out for. Some teams are not actually thinking about how they would get back on their feet in the event of an outage. Only a quarter of teams are actually testing any of these plans at all, which is worrying. Sixteen percent of people are doing some automated recovery. That's usually the use of things like autoscaling groups to bring dead hosts back into life. Or, maybe people who are using Kubernetes, where the pod controller is just going to reschedule those things. The thing that most people are doing the worst, is actually thinking about disaster recovery planning.

How Do You Respond To a Security Incident?

Related, this is the second ranking highest risk issue for teams that we have reviewed. Seventy-five percent of people are not doing a good job of responding to security incidents. This is about planning ahead for things that are going on in the security landscape. Only half of the team surveyed actually know who they would go to and involve in the event of a security incident, which is worrying. Twenty-seven percent of teams would know what tooling they would use to either mitigate or investigate a security incident. Forty percent of people have an incident response plan, which I don't think really matches the rest of the data. No teams at all have any automated containment. That's where you might quarantine a host or a container, keep it running but stop its access to external resources, which is something that you can set up. Eleven percent of people have identified their forensic capabilities. They know how they would get onto those compromised instances and figure out what went on, whether that's through log analysis or using other security tools.

Only 10% of teams are actually pre-deploying those investigative tools to their hosts so that they would be available in the event of an outage. The model here is that if you're going to quarantine hosts that have been compromised, then you need to have on those hosts access for the people who would do the investigation and tooling for them to do that with before that quarantine happens. Otherwise, you risk unquarantining the host in order to get those tools on there. Then you're probably in more trouble. Three percent of teams, which is probably two people that we interviewed, actually rehearse security incidents by running game days. Thirty-five percent of teams are not doing any of this. There's very little thinking about how to go about dealing with a security incident were it to happen. Based on these stats, it's likely that three-quarters of you are in that same boat.

When we're talking about security incidents, we're talking about exploited frameworks. People getting to run untrusted code on your machines. We're talking about having failed to set up adequate access controls on storage services. Leaking data from an S3 bucket is made public, that thing. Or, as happened to one of our clients, they accidentally made some of their API keys public. A developer checked some .files into a Git repo and immediately made those keys available to the world, which was not good for them. They got away pretty likely by only having run up some big Bitcoin mining bills rather than having all of their data deleted, which would have basically screwed them.

How Do You Classify Your Data?

The third question that most people are doing badly at relates to security, and is 75% of teams. Data classification. This is something that I think has got a bunch more attention recently with GDPR becoming noisier. Most teams don't really have a good idea of where their sensitive data lives. Ideally, you want to be able to say, this service has personally identifiable data in it. Or, this service over here has card data in it, or whatever, and make good access control decisions and good decisions about how you back that data up and how you ensure that it doesn't leak based on the sensitivity of those classifications. I think some of this is because, historically, all the teams have used one big relational database for everything. I think if you're building microservices, and you're allowing those microservices to have their own datastores, then you might be in a better position because you can identify a particular service or a particular store that holds sensitive data. Many teams are not thinking about what data lives where, and that's likely to be a problem.

How Do You Evaluate New Services?

Question in the cost pillar, 71% of teams are not scoring well on this, which is basically keeping up with the new products that Amazon are launching, basically. A lot of teams are reviewing and implementing new services in an unplanned way, which is to say, someone discovers one day that there is a service version of this thing that used to be running on EC2 instances and moves to it. This is quite an Amazon serving question. I think that they would like for you to be, on a regular cadence, looking at the new things they've released and deciding whether you can use them or not based on your workload. Most teams have a real practice for this, which is fair because Amazon are releasing hundreds and hundreds of things every quarter, and it is difficult to keep up with. Most teams are not really keeping an eye on that landscape. A lot of teams can save money or save long-term engineering operational time by adopting a service version of something that they were running themselves. Historically, if you're running Elasticsearch clusters, you'd be burning a lot of time and energy on that. You can now use an Elasticsearch service from the provider, for example.

How Do You Test Resilience?

Then the fifth and final question I'm going to talk about in the bad section, 67% of teams have high risk in questions around resilience testing. That's, basically, pre-thinking how things might fail, and building plans for what you would do in that event, which relates very closely to some of the other things that we've already talked about. Seventy-three percent of teams are conducting root cause analysis in the event that things do go down, which means that they are at least learning from the things that go wrong. There's very little upfront planning for those incidents, which is not great. The first time you try this stuff is in an actual production outage, then you're going to do less well at it than if you've rehearsed some scenarios as well. Those are the good and the bad.

How Do You Reduce Defects, Ease Remediation, and Improve Flow into Production?

I've also picked out a few questions where I thought that the results were not necessarily either good or bad, but they were notable for some other reason. Normally, the percentages down the sides are skewed in an interesting way. This was interesting to me. Question 3 in operations, this is about, are you using continuous integration, continuous delivery, basically? How do you reduce defects, ease remediation, and improve flow into production? It seems to be the case, today in 2020, that most teams are using version control for stuff. The 10% that aren't, I worry about, but most of you are sensible people using Git, or some version of Perforce, or any of those things to manage your source control. Eighty-seven percent of teams are claiming to test and validate their changes, running automated testing in their CI. There's nothing in this framework to ask about coverage so it might well be that these are very basic smoke tests, or a smattering of unit tests. Most teams have some testing involved in their delivery lifecycle, which is good.

Seventy-eight percent of people are using config management systems. Great. That's good to see, finally. Most are using build and deploy systems, so presumably Jenkins, CircleCI, those sorts of tools. Thirty-seven percent only of teams are doing anything that looks like patch management, which is a security thing. People are not updating software as new versions become available from vendors or from upstream, which is a bit of a problem. Fifty-seven percent of people are sharing design standards, which means that within teams, they have a mechanism to ensure that there's a level of consistency across what you're delivering. You would score a tick in that box, if you're, for example, using Terraform across a number of team members to deploy the same cloud infrastructure within your environment.

I think the thing that really stands out to me here is making frequent small, reversible changes, at least 63% of people. Although we seem to have adopted the CI/CD tooling, we're still delivering large batch changes rather than incremental smaller changes that can be backed out of. It's good to see that in the data because, anecdotally, I felt that was true. It's also notable that only 52% of people are fully automating the deployment of those changes. We're not really living continuous delivery. There might be continuous integration going on, but half of the teams probably have a manual gate or some manual testing involved. It goes to show, I think, that continuous delivery practices, trunk based development, those things are not really very evenly distributed across teams. I think we could all stand to do a bit better in that arena.

How Do You Understand The Health Of Your Workload?

Question 6 in the operations pillar, 46% of people that got a Well-Architected ticked this question, which is, how do you understand the health of your workload? This is a question about monitoring and telemetry. In this review framework, we differentiate between application monitoring and telemetry, so user behavior monitoring and real user monitoring versus workload behavior. What resources am I consuming? The thing that I find notable about this particular question is that a lot of teams are collecting data. Not very many teams are doing anything with it. There's a lot of like, we know what our KPIs are? Fifty-five percent of us know what our KPIs are? We've defined what metrics we are interested in looking at. Only half of the teams have established baselines, an understanding of what normal looks like. A similar number of people are learning what expected patterns of activity look like. Then the numbers drop from there in terms of people putting alerts on those monitoring items. Thirty-seven percent of teams are doing anything around going back to the KPIs they originally set out and learning whether or not the metrics they're measuring meet those objectives? We can stand to do better at monitoring. This has been a theme forever. It's been interesting to me. We established the consultancy in 2009, which is when DevOps as a word was first being thrown around. It's interesting to me that some of the things that we were talking about back then in terms of technology outcomes, so CI/CD, conflict management, monitoring, have made it to fairly common usage. Things like monitoring are still pretty poor across the board.

How Do You Control Human Access?

Security question number 2, 47% of teams have high risk on this one, which is about human access control. What do you allow your team members to get at in your environment? Most teams claim to be understanding what their requirements for access are. Only 58% of people are granting least privileges, which means that 58% of the people are probably sat with a root equivalent account on your cloud platform that you don't need, day-to-day, for the most part. Thankfully, most teams are using unique identities on the per person basis. We're not sharing root passwords around anymore, which is great. Only 70% of people are bearing in mind lifecycle when they're managing credentials. If you fire somebody from your organization, do you have a good mechanism for revoking their credentials? You might want to achieve that through granting access through roles of federation, so using single sign-on tooling. There's room for improvement there. Automation of credential management is more or less nonexistent, pretty much.

How Do You Control Programmatic Access?

Another security one that I found notable, 57% of workloads scored high risk on this. Eighty-nine percent scored high risk or needs improvements. This is about programmatic access, which is what you're doing with API keys. For the most part, if you're using a cloud platform, as a piece of running compute, you will have access to some service that allows you to present tokens to other APIs and demonstrate that you are that service. In the Amazon world, it's instance profiles. There's a local instance metadata service, so that if I do something with the Amazon SDK, if I don't specify any API credentials myself, it will pick them up from the instance. The instance can be given privileges, or that role can be given privileges over certain types of operation against the API. Very few teams are doing this thing. That's the automate credential management part of this. Dynamic authentication is the other side of that. This is worth really looking at. I think, anecdotally, we see a lot of teams who are still managing API keys and secrets, baking them into config for their applications rather than using Secrets Managers or HashiCorp Vault, those sorts of things. This is an area that could be improved as well. I find that fairly notable.

Teams Are Ok At Choosing the Right Services

There are some major themes here. The thing that this data tells me is that, for the most part, teams are ok at choosing the right services. You're choosing compute options that suit the type of compute that you're doing. That might be traditional EC2 compute, or it might be you know that your use case is a bit more bursty or done on a much more SAS basis, in which case Lambda and serverless computing is a good thing to be looking at. Most of you are choosing good database options. If your data is relational, you're choosing relational datastores. If it's very document based, you might be looking at DynamoDB. Generally, as a rule, teams are making good choices about which parts do they use. They're good at the architecture piece, the design piece. The place where teams get compute choices a little bit wrong is usually around sizing. There's a plethora of different instance types, instance classes, and instance sizes. Often, we see that compute is overprovisioned, or maybe you've selected in the Amazon world of T3, instance class which is geared up towards bursty workloads, but actually, you're doing quite a lot of consistent computation. There are times when that instance is throttled as far as compute, so maybe M5 is a better option in that case. In general, that's the area that could use some improvement, is that, have I chosen the right class of compute nodes? Have I matched that to my requirements? In general, teams are good at choosing the components that they're building with.

Teams Are Ok At Making Software Changes

Most teams, seemingly, from this data, are ok at making software changes. You're using automation tools, even though you're not necessarily automating everything. Full continuous delivery seems to be a little out of reach for some people. Often, anecdotally, it's been my experience that that's fear or risk thing. Teams are unwilling to do full CD because there's still a lot of manual testing required in their pipeline, or they're still releasing large batch size changes. It's risky to allow those things to flow straight into production. Change batch sizes are, as a rule, I think, too big for most teams. I think those two are related. If you can make your batch sizes of change smaller, then you'll probably be able to adopt full CD more readily. This is from the Accelerate State of DevOps Report, 2019. If you haven't read "Accelerate" as a book, Gene Kim, Jez Humble, Nicole Forsgren. Nicole's done all the work on that one. High performing teams are typically typified by the fact that they can deploy on-demand, basically, multiple times a day. They can do so with a short lead time. That they can restore service quickly in the event of an outage. That meantime to recovery stat is important, and make fewer changes that cause failure. A lot of the stuff that I think I've seen in these stats points at most teams probably being medium low performers that we've seen in this landscape. You can download this doc from Google. There's a lot of other good stuff in there. I like to go to these stats as good organizational health metrics as far as DevOps practice are concerned.

Teams Are Bad At Thinking about Failure Modes

Teams, generally, are bad at thinking about failure modes. That's what this data tells me. There's a lack of consideration of business requirements. How long can you be down for? What have you agreed with your customers? What's in your contracts? Do you know that? We've certainly had conversations with a number of CTOs who have either not contracted anything specifically with their clients, or have basically made ludicrous promises based on their architectural designs about availability that doesn't really match. Walking teams through a better understanding of what they're committing to is a good thing to be doing. There is almost no risk analysis of failure modes, really. We're not thinking about, how could this go wrong? I have some hypotheses about why that might be. When we were building this stuff on-premises, there was more stuff to get wrong. We were responsible for more things. We were maybe a little bit more thoughtful about failure modes, clustering, failover. I think there's a bit of an assumption that by putting this in the cloud, you've made this somebody else's problem. While that's true for some aspects of your infrastructure, that's not true across the board. Thinking about failure modes is something that you need to be doing within the context of your app and your business requirements.

For the most part, teams aren't documenting anything. They're certainly not writing good documentation to follow in the event of an incident of some sort. If you're not doing incident planning ahead of time, then you are cognitively impaired during an incident and you don't have anything to grip from, so you're making it up as you go. If you can rehearse this stuff in advance, you are more likely to succeed and make good decisions during an incident. Thinking through failure modes, how you will identify them, and how you would resolve them is something that pays off. It doesn't take an awful lot of time, really. You've just got to make the opportunity to do it. There's very little attempt to rehearse those outages as well.

This is an example of something that we do with teams from time to time. We worked with a team last year, they'd built an application, they were about to advertise on television. They were expecting a large amount of traffic. Their platform, they were not convinced was going to be able to withstand it. We did some performance tuning work, but we also did this, which was basically a risk register of the platform. What we've done here is enumerate all of the components in the environment. For each of those components, think about the things that could go wrong with it. In this case, there was a lot of Lambda in here. We've got manual deployment error, somebody deploying the wrong stuff is a risk. Cold start delay, which is a concern for Lambdas within VPC environments in the event of a scaleout. Lambda concurrency limit reached, so we're now executing too much compute, we can't handle any more requests. For each of those risks, we've identified how likely they are, what the impact is likely to be. That's just a low medium high activity. We've thought about how we would identify it. What we could monitor in order to know that that thing had happened. We've also highlighted a mitigation. If we can solve that problem ahead of time, either by building some automated remediation, or taking that problem away entirely by reconfiguring something, or something of that nature, what we do there. Then a runbook action. What we would do in the event of that thing in order to get things back up and running. That column is the least fleshed out. That informed a larger set of documentation for how we'd follow those instructions. In the final column is a list of Trello cards that relate to work that we would do to mitigate those risks.

That actual risk planning exercise was probably just two day's work. That was a couple of our people and some of the application developers. Answering questions like, if the database goes away, what happens to your application? Does it reconnect? Does it reconnect properly? Is it re-resolving the DNS name? Those sorts of things. In some cases, we'd come up with questions that needed answering. Those would go on a backlog as well. Answer the question, if the database goes away, how does the application behave? If the answer is I don't know, then going away and doing some work to get to that understanding is a good thing to do.

Teams Are Bad At Monitoring For Failure Modes

As well as being bad at doing anything about failure modes, and thinking about failure modes, we're bad at monitoring for them. We're collecting data in some cases, but not really doing things with it. I think that's missing an opportunity, really. I think this is related to the question about whether you're thinking about failure. Unless you're thinking about failure modes, you don't know what to monitor for. Most of the monitoring that's going on is probably just high-level resource consumption metrics that never get looked at until something starts going wrong. Interestingly, almost no tracing is going on. This is like request tracing through a stack. In the AWS world, that's X-Ray, maybe it's also Honeycomb, if you're a Honeycomb user. There's not a lot of that going on. Even in teams where you're building microservices, or building distributed Lambda-based platforms, there's not a lot of that happening. That's just flying blind. You don't know what the hell is going on in your platform if you are not tracing. The more you decompose your architecture into smaller pieces, the more that's important.

Teams Need To Do Better At Security

Teams need to do better at security in general. I think there's still poor hygiene around patching, which is pretty inexcusable in 2020. There's not a lot of data classification, which is something that I think we need to get better at as an industry. Human access control isn't great. Most people have more privileges than they need. Most software components that need to access services are also not being very well access controlled. There's not a lot of adoption of security monitoring tools, which goes back to thinking about failure modes as a way of driving and understanding what you need to monitor. One such failure mode might be there's been a massive security breach. How do you know that that's happened?

Top Breach Causes

From the OWASP Top 10 breaches, I think this is 2017 data. I got this from a Snyk blog post. The top causes of security breaches in 2017 versus the OWASP list were using components with known vulnerabilities. That's not patching your application. It's not patching your infrastructure. People are doing badly at that. Please start patching things. Security misconfigurations. Accidentally making your S3 bucket full of private data available to the world, or bad security grid rules. Injection is the third. You can mitigate that with a web application firewall, to some extent. Weak auth and session management, and missing function access control. All that stuff is potentially dangerous to your application. Really, most teams are not eating their vegetables. They're not patching their stuff. One good way of looking at this from an architecture perspective is actually if I'm currently running my workload on regular compute instances, so EC2 instances, maybe if I were to migrate to Fargate, or a containerization platform, I remove the need to do a lot of patching because I no longer have responsibility for the operating system that my stuff runs on. All I have to worry about then is the app. You're reducing your give a shit surface to things that actually concern your application and your business case.

I think the primary takeaway here is that everyone is better at building platforms than they are at securing or running them. I think this is the real lesson. It's a very high chance that this applies to you and your teams.

What Next?

Where you can go from here, if you're an Amazon user, go and read the white papers for Well-Architected. The white papers are a really good, dense way of consuming information about AWS. There are a number of product white papers that are pretty interesting as well. If you're studying for any of the AWS exams, the white papers are the things you really want to be spending your time on. Run your own reviews. That Well-Architected review tool exists in your console today. It's not in all the regions, but it's probably in the regions that you're in. You can go and start working through those questions, following the links to more information. Understanding what your own practice is like. Consider engaging a partner for this as well. The third party perspective is probably one of the things that's been the most beneficial to the people who we've run these reviews with. There are also funding programs available if you're an amazon customer and you use an Amazon partner to do your Well-Architected review. There is $5,000 of credit funding available to help fix some of the things that are found. There's benefit financially as well as knowledge-wise to engaging an external party to do these things. We're an Amazon partner. We can do these things. There are other partners as well. If you are currently working with an AWS partner, ask that partner if they're Well-Architected review capable, if they haven't offered you that already.

Being Cloud Agnostic

Participant: My company is moving a large project into the cloud at the moment. One of the mandates is to be cloud agnostic. We're in GCP and Amazon. One of the things that have come down from the higher-ups is that we should try, wherever possible, to not use a cloud-specific service. Can you speak to that at all? If you want to be able to jump between GCP and Amazon, if you want to effectively use them as two different regions. What can you say about that?

Topper: What I can say about cloud agnosticism is that it's bullshit. Basically, you can't do anything in any cloud without coming up against something that is cloud specific. Anything that you do in AWS, the security model is controlled by IAM. IAM is an Amazon specific service. As soon as you do anything there, you've touched something Amazon specific. When I'm having these broader conversations, I think the question needs to be, why? Why is cloud agnosticism that important? Is it economic? Is it you want to use one vendor to beat the other up, that thing? I'm an Amazon partner, I know most about that ecosystem. Amazon's prices drop every year. No prices are going up. No one's locking you into that ecosystem. If it's an economic argument, I don't personally think that holds any weight.

If you're building something that you want to be cloud agnostic, the lowest common denominator you can get to is there's some computes, there's some storage, maybe there's a managed database service, Kubernetes, probably. As soon as you start consuming anything beyond those services, you look at it and go, "We want to use Elasticsearch. We're going to build in both clouds, our own Elasticsearch on our compute services." Now we're running our own Elasticsearch platform. We've got to patch it. You're not going to. We've already proven that. You're going to spend a lot of time and manpower on running all those additional things in order to protect yourself from this weird, unlikely future world where you have to negotiate price between two cloud vendors. I don't think it makes sense. I wouldn't go at a cloud agnostic strategy for that reason.

I think the places where you can do that if you're immediate delivery, for example. Let's say you're a television company, I've done a bunch of work with one such company, and you are delivering video. You can deliver your video assets into multiple CD ends, multiple cloud platforms, and deliver from there and spread the load dynamically consistently over time. Content pricing and content delivery pricing is something that is more negotiable, generally, across vendors than other things. The idea that you'd want to make everything entirely cloud agnostic sounds to me like a big waste of engineering hours that could be spent on application and stuff that your users care about. It isn't that, I promise you that.


See more presentations with transcripts


Recorded at:

Jul 22, 2020