BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations How to Build a Successful Cloud Capability on a Heavy Regulated Organization

How to Build a Successful Cloud Capability on a Heavy Regulated Organization

Bookmarks
37:04

Summary

Ana Sirvent discusses their cloud capability journey, highlighting lessons learned and best practices on culture, processes and technology.

Bio

Ana Sirvent is the AWS Practice Lead and a Principal DevOps engineer on KPMG UK. She has more than 14 years of experience leading, developing and delivering full enterprise projects from discovery phases, design, and implementation to production, lately focusing on cloud native solutions using serverless and micro-services architectures.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Sirvent: I would like to start the presentation going through an example of a business going through a typical cloud adoption journey. Usually what happens is that the first team starts using the cloud. This is usually because there is a proof of concept, or there is an innovation project, that Greenfield project that wants to take advantage of cloud features. This first team started using the cloud. What happened next is that first team started using the cloud, the news spreads about all the cool features and possibilities that the cloud offers. More teams are getting onboarded on the cloud. Now, with this onboarding of the new teams, on using the cloud, then comes the problems on different skills from different people. Usually, you will have obviously different people on your organization, and you will have different level of skills in those different teams. That means that while a team can be very good on maybe securing the infrastructure, other teams maybe don't have the necessary skill to know how to properly configure and run that infrastructure. Their risk of being not compliant in security increases.

After you have this number of teams onboarded and using the cloud, there is what I currently call the unsustainability trigger. That means there is a point in time where the organization realize that they cannot manage the cloud estate in the way that they're doing it as right now, because it's not sustainable. This trigger can be more or less dramatic, but it always happens. If you are running dozens, or hundreds, or even thousands of workloads in the cloud, then you cannot have a manual way to actually know what you are running. Once this unsustainability trigger happens, what happened next is if there's any obvious security risk that you have in your cloud infrastructure, you're going to the phase of originally fix all those that could lead to a security breach. Once that is fixed, then you start to realize that you need to gather the data and have a centralized way to know what is going on through all the cloud estate. You start doing some gap analysis on what data you have, what data you need, and then having a centralized view to actually see it through a single pane of glass.

After you have all the data, it's, what do I know and what is the governance around that? Usually, what happens is you establish the governance around security. You identify all the policies that you need as the organization, and as a business. Usually, at this time, if you didn't have any center of excellence, then you create one that basically what they do is promote best practices, and make sure that they're followed throughout the organization. Then you have the ongoing compliance where you have a security culture in place, and then the security risk of being an exploit and having breaches. It decreased a lot. All the clients that I've worked with during my work career, they have all this cloud adoption journey. That unsustainability trigger could be more or less dramatic, could be like you've actually been breached, you actually have a security incident that triggered all this. Or it could be maybe that an engineer, or a director, or someone in your internal organization realized that there were some things that need to be fixed. Regardless of how dramatic that was, if you got through a cloud adoption journey, it's pretty much sure that you have this kind of journey.

KPMG Cloud Capability Timeline

What I want to do is talking through the journey that we did in KPMG, and how we built a successful cloud capability in a heavily regulated organization, as KPMG is. I'm Ana Sirvent. I'm the AWS Practice Lead on KPMG UK. I would like to talk a little bit about our timeline in terms of cloud capability. You can see that we have three major cloud providers there. We started with AWS, then Azure, and then GCP. In terms of milestones, 2020 was also a big one, because, coming from KPMG and all the audit work, so it was really focused on industry alignment. For example, we were looking for engineers that had a background on government industry, or financial services industry, but it really didn't align to what we were looking for. You want an engineer that wants to do things, it doesn't really matter if he has a financial services background or not. We shifted that into technology aligned, and it went really well. Now in 2023, we are almost 300 cloud DevOps engineers, and we manage and run an annual cloud spend of more than $25 million per year with all those three combined. We have a pretty big estate to manage there.

Working in a Regulated Environment

Obviously, working in a regulated environment can be challenging. If we are talking here about the main challenge that we have, is if you work in a regulated environment, so you know all about the three lines of defense and effective risk management and control, which basically, is a control that was done in 2013. It basically has these three lines on defense for any organization. The business, which are the ones that own and manage risk. Then risk management and compliance, which are the ones that oversee, or specialize in compliance or the management risk. Then the internal audits that provide independent assurance on risk management and control. The main challenges that we have was with this first and second line on regulated environments, because it was really a difficult journey to actually convince them to go from a point in time check to a more automated and ongoing compliance. There was a lot of resistance to change. Maybe because it is what their people are used to, and they don't want to change their ways. Also, introducing complexity on their processes, maybe, because it is no longer like a piece of paper round and doing tech. There's a lot of infrastructure and complexity on building all the automation to make sure that we have the ongoing compliance, and not like, on point in time. Also, we manage to have the data and convince that as technology evolves so rapidly, having a warning time place for when you deliver a project, is no longer valid in our world. We needed to put that in place. Internal auditors never had any problem, because once we explain all the automation to them, they just really love it.

Again, this not only has a bad side, it also has a good side, how it works in a regulated environment. The first one obviously is like, if you have this type of organization, then you will have direct collaboration with a regulatory expert. That means that you can tap in into inside knowledge from your company, which has a really good understanding of all these regulations. Then, you actually understand what's the intent behind the control, so that means that usually control is written in a very vague way. It's not always obvious what it really means. When you connect these people with the technology people, and the technology people realize, what's the intent behind that control, it really speeds up the journey that I was talking about before.

Technology and Security Compliance

In terms of technology, I just want to go a little bit deeper on how we put all the technology in place to manage to go through security compliance in our organization. Obviously, I'm going to be a little biased in terms of technology here. I'll mainly talk about AWS, because it's my field. This is an example of how we are currently gathering data across our AWS organization. This centralized account that you see here is serverless. It's a high-level design on AWS. It basically gathers all the data needed for security on our AWS organization. You can see that there are roles that are deployed in all the different accounts. Those roles are assumed by the Lambda functions, that are collectors that basically gather all the data. Those collectors are event time triggers. Basically, they're scheduled to run daily.

All the data are streamed into Amazon Kinesis Data Firehose, and then stored into an S3 bucket. The S3 bucket then has some triggers that actually compress the data into a more efficient manner using Parquet format. All the data there is then pushed into AWS Glue database and tables, which are consumed by Amazon Athena. What that basically allows us is, at any point in time, we are able to query through Amazon Athena and know the state of the different accounts in terms of security compliance. Then, finally, we have another Lambda function that queries Athena, with several Athena queries that either send emails to the relevant people who own and manage our accounts. For example, in terms of vulnerability, we use Kenna Security. We send all the vulnerability data from our VMs, our EC2 instances, and containers to Kenna through APIs. That's mainly what we do regarding AWS and gathering the data. As you can see, that's all serverless, and this was deployed using the Serverless Application Model, so SAM, on AWS. It's really very well architected and very cost efficient.

In terms of where are we actually gathering data. There's quite a few of the AWS services that we use, and this is a subset of them. Mainly all the data regarding EC2 instances and specifically, for example, on system managers, or SSM, and all the patching data on the VMs, regarding AMIs, and different OS, and Elastic IP address, or load balancing. Also, all the Amazon Inspector regarding vulnerabilities, again, on EC2 instances, and also containers on Amazon ECR, the Elastic Container Registry. We have a lot of containers too running on ECS. We have a lot of data also from the task definitions and services, or regarding IAM, and also CloudFront, hosted zone, and probably the most important regarding security compliance, which is AWS Config service.

Processes - What Good Enough Looks Like

Once we have all the data, what do we about all the data? You usually have to have different processes in place, and we do. We created some of them once we started to gather all the data. What you usually need to do here is make sure that you understand what good enough looks like to you. You can never be 100% secure. You will always have zero-day vulnerabilities. You basically need to manage and balance security risk and cost effectiveness. That's something that you need to do in your cloud capability. This is, for example, in terms of AWS, what good enough looks like to us. You need to have that level of compliance on patching, more vulnerabilities and config rules to make sure that you comply then with all the security that we need to be compliant. Basically, if you go below those levels of compliance, what happens is that you need to go then to a weekly security call, where you will be asked on why you are below the level of compliance if there's any kind of maybe exception that you need in your team for whatever reason. You need to go there, and explain your reasons. Then, if they're not missing off, then the teams need to make sure that they fix as soon as possible the resources that are not compliant.

In terms of config rules, we do have a set of config rules, for example, that they need to be 100% compliant, and they're the ones that are called KPMG Required, and those are a must. An example of these ones will be, we don't allow any S3 buckets to be public, for example, or RDS instances databases, or make sure that all the data at rest of databases and different services are encrypted. Those are rules that are currently in the KPMG Required subset. They need to be always in compliance. We have then other ones that are called Best Practices where they should be following, but then you are able to not be 100% compliant. Then, for every new rule that we have, we basically put it on the KPMG Beta for a couple of weeks, and then we place it in whatever, required or best practice will be. That's the current process that we have around making sure all the teams in our cloud capability are aware of the level of compliance that they need to be. Then, they are aware of the processes.

Should I Use Security Auto-Remediations?

Because we're talking about a lot of AWS config rules and engineers and automation, obviously, there are a lot of new features that goes into the cloud. Some of them are regarding how to auto-remediate security issues. My recommendation in terms of if you should use security auto-remediations or not, are based on those three points there. You will need to know what's your team maturity and security culture around. Because, if you have a team or organization that is mature enough, and the security culture is so embedded that if you actually tell them, you need to fix it, they go and basically promptly do the fix, then maybe you're not so in need. If you actually, on the other hand, have an organization that are informing teams that they have these security issues, and then basically nothing happens, then it could be like, you need to put some auto-remediations in place if you want to avoid security breaches. In terms of technology landscape, it's also important, like what type of architecture do you have. That's very linked to the third point, because if you have some type of architectures, like containers, or maybe VM, something that can go down, spin up, and go in an endless loop, that needs to be treated with caution. Because if you have, for example, a trigger for the auto-remediation that is config based, and you attach it to a container, and you have a pod or a service that is constantly failing, and reprovisions, then you can end up with a really big bill. Just be aware of the type of triggers for that remediation, and your technology estate, and how mature your team is around it.

Culture

As we're talking about culture and team maturity there, we're also talking about culture a little bit. I want to highlight a quote from the Accelerate State of DevOps from 2022. Accelerate State of DevOps report is a report that has been done through the past 8 years, it has more than 33,000 professionals taking part on it. It is the largest and longest running research of its kind. Last year, it focused especially on the software supply chain security practices, because it's a big topic on cloud. It seems like there's a skill problem on how to properly secure things on the cloud. They were focused on two initiatives, the Supply Chain Levels for Software Artifacts, or SLSA, and the NIST Security Software Development Framework, or SSDF. The main factor that they found out that was allowing companies to follow security practices were not technology at all, but it was rather cultural. Again, if we're talking about DevOps, the culture is always the most important and most challenging thing to implement. In this case, it was also the same regarding good security practices.

Culture in KPMG

In terms of culture, and this needs to be like the main area, or the most important area on your organization. I just want to highlight our culture on KPMG. These are the four mantras that all our engineers live and breathe about. The first one is, use the right tool for the right job. We empower the teams to be creative, and use whatever they think is best to use to solve their problems. We don't prescribe any set of tools for them to use. We just allow them to use whatever they want to use for their problems. Second, build it and run it. If you build it, then you run it. We don't have any support team as such. If a team builds a project or a workload, they basically run it also into production. It seems to make the teams responsible for their own code. One of the main benefits is that it reduces technical toil and technical debt. Obviously, it drives better client outcomes, because when you have a team that builds and runs it, all the workloads, then they don't want to be called at 3 a.m. in the morning, because there is a security breach on some of their products. They make sure that everything is well done and well maintained. Very interlocked with this build and run it mentality is also they continuously improve, because teams own all the software lifecycle, then they're basically responsible for the security, for the reliability of the project, the performance, operational excellence. We tend to put everything through a well-architected lens. They're continuously improving the system workloads that we have on our production environments.

The last is regarding automation. The mantra here is, automate as much as you can. Our engineers, if there are jobs that are going to be done more than twice, then thinking things on how you automate and save time in the future. That's our mantra, our values in terms of culture on engineering. I want to highlight an example here, because it's not only like, how you build it, it is how you maintain all this culture. This is the AWS repository for the config rules that we have. I just want to highlight the number of contributors that we have there. You can see there are like 30 people who actually actively contribute to this repository. It's just not only a responsibility of the DevSecOps team, or security team, all the people in the different teams, they're making security their own, and even new joiners or experienced hires. One thing that we did that was really successful is all the new junior people that were joining the teams, we encourage them to contribute to this repo, and then build and code new config rules to use. They were aware of the process and how things were built, and how the security compliance works. That is really important too.

The last one is regarding paved paths. Obviously, we are not prescribing anything per se. One thing that we do is making sure that we put some paths out there for our engineers that allow them to not try and reinvent the wheel in everything they do. I think that we engineers are lazy by nature. If we actually have that possibility of having like an AMI or any code that we know that is security compliant, that then we're going to use it. Why build something from scratch? That is an example of the things that we have in terms of paved path. We have code AMIs on Amazon, for example, in AWS, which has the latest security patches, software and configuration, so they're ready to use in, for example, the workloads that are more VM heavy, maybe. We have also CI/CD pipelines, workflows available for the teams to use. For example, we heavily use Terraform for infrastructure as code. We have a default Terraform CI/CD pipeline with a number of stems to be used for all the teams that they want, with OIDC authentication for the pipeline, checkov for security, and infra cost step to let them know this infrastructure chain, how much it's going to cost, and then the Terraform plan or apply. You have all that built so you don't need to think like, what's the best practices here for the pipelines? We also have centralized Terraform models where, again, we heavily use Terraform for all infrastructure as code. We have shared libraries for the engineers to use, ready to consume in case they need them.

Lastly, our Open Developer Platform is a product for developers that is a self-service template for software engineers. What the product does is abstract all the infrastructure layer. The software engineers, they don't need to think about how all the things are provisioned. They just focus on what they do best, which is writing code and building features. In this case, so if a .NET developer wants to create a new .NET application, they just go there, select the .NET service. Then, under the hood, a new GitHub repository will be created with the pipelines in place, and all the infrastructure in place that is all deployed and ready to consume, for the software engineers to just focus on what they want to do, and they do best. That's in terms of what's out there.

Summary

Summing up all the main points on how to build a successful cloud capability in a regulated environment. The main points and takeaways that I want to highlight. First, if you are in a regulated environment, then you most likely have internal knowledge, so tap into it and take advantage of that internal knowledge because it's going to be really an accelerator into the journey. Secondly, you will need to have all the data gathered and aggregated, and have a centralized view of that. It's really important, so you are able to query however. It doesn't matter. You need to be able to see all the relevant data at this point in time, like the current point in time, and all centralized through a single pane of glass. In terms of the next one, so you need to establish what good enough looks like. If there are not processes around it, maybe you need to think about creating new processes to support what your good enough security compliance is. Now, the next one is the culture. Probably the most important thing in your organization. Culture eats strategy for breakfast. Just make sure that you have a good culture, and you actually follow that across all your organization. Importantly, also, and very related to the culture, make security everyone's job. Make sure that all the new talent that you put in the organization follows that mantra too. Security cannot only depend on your DevSecOps or CloudSecOps team. It needs to be everyone because otherwise it's not maintainable. Then, finally, use paved paths as much as possible. Usually, if you have an engineer, if there is an easy route to security compliance, engineers will use it. Make sure you have that type of option available for them.

 

See more presentations with transcripts

 

Recorded at:

Nov 10, 2023

BT