BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Getting Developers into F1 Driver Seats with Security?

Getting Developers into F1 Driver Seats with Security?

Bookmarks
39:22

Summary

Henry Tze discusses a platform to develop on, automation, a perimeter to safeguard the best assets, and a user-centric container foundation.

Bio

Henry Tze is a Lead Cloud Security Engineer at Virgin Media O2. Focus on building a users-focused security paved road at scale for developers/engineers/builders to maximze value creation at pace in AWS and GCP Cloud.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Tze: My name is Henry Tze. I am a Lead Cloud Security Engineer at Virgin Media O2. I'm here to talk about getting developer into F1 driver's seat with security. Why F1 driver's seat? Usually, F1 drivers are associated with high performance. We want all of our developers to be high performance, efficient, and productive. Let me start by asking, how does an F1 team keep the driver driving as fast as possible on the track? As you can see from the top right corner, there are about 20 cars in a race at one time, which increases the number of unexpected events. It forces the team to work very closely together in real-time to win one race. Bear in mind, to win a championship, it requires the consistency of 22 races.

The F1 Team

Moving on to F1 team winning principle, what does it take for an F1 team to be successful? They need to build a good car. A strong team that can really work together, having a top driver, also continuous improvement. Driver is only one out of four elements in this formula. Look at the diagram I created. The driver might be very good at driving but might not be expert in other area. Testing and software development, maybe not. At the end of the day, F1 team is hiring a driver because they can drive fast. F1 team require a lot of people with different skill sets to have the opportunity to win. Talking about people behind the scenes, I show you the unsung heroes. Let me show you three of my favorite unsung heroes in an F1 team. First of all, mechanical design engineer, the people who build the car. Without a car, there's no race. A typical F1 car is made up of 15,000 individual components. Crafting and putting everything together requires a specialist skill and years of experience. An F1 car must be built to a very high safety standard to keep the driver safe. Why is that? The average F1 training costs for a driver can be as much as $10 million. This doesn't guarantee you can train everyone to be a good driver, so keeping them safe, you still got a chance to recoup some of the money you spent on the training.

Next one up, race engineers. They're the link between the team and the driver. The single point of contact for the driver whenever he's in the car. They're like a second set of eyes and ears for the driver. For example, if there's a car crash in front of the driver, the race engineer will be informing the driver to be extra cautious, so they don't crash into the car in front maybe in the next corner. That communication is very important and it can make or break the race. Finally, the most famous, pitstop crew. Take a look on how fast they can change all 4 tires and refueling within 10 seconds. If you look closely, there are about 18 people there and working in harmony. It just fascinates me every time I see them working just like that, perfection. Any second spent at the pit stop are bad for the race and the crew know what they're doing. Talking about the whole F1 team, an F1 team consists of more than 20 key roles. I've shown you three, but there are plenty more roles there to build a whole F1 team. Each F1 team can have a maximum of 4 drivers per season, it requires 1000 people to keep the driver in the race. There's a strong motivation and incentive behind the scenes. The market value of F1 currently is worth about $15 billion. That is why every year there are at least 10 F1 teams competing for the championship.

Digital Transformation for Telco

Digital transformation for telco. There's also a lot of money that can be made if a telco is successful in digital transformation, especially if telco already have a large customer base as the competitive advantage. There are four ways that telco can transform itself into Techco. To achieve these four factors in digital transformation, they require a lot of new products to be created. New product requires a lot of people building them. The people must have many different types of skill set, this results in a wave of hiring new blood, contractor, skilling up existing engineer and developer. The longer it takes for the onboarding for the developer or engineer to be productive, the more money is wasted. A common problem for any telco and also any enterprise size organization, they'll usually be branded as limited agility. Everything usually moves very slowly in a large company, including the many processes they need to be following, and many people need to get involved to get one thing done. They usually have very high operational costs. Every single task requiring more than one team, it will be another cost center to charge on. Also, for the service level agreements, because there are many teams, they need to have a discipline of how long it takes to fulfill one ticket. Last week, I raised a ticket internally for changing a firewall rule in the on-prem data center. When I look at the SLA, I'll say, really? It takes about 5 days to change our firewall rules on-prem, but the SLA ranging from 3 days to 15 days, depending on the request, it is so hard to work with. Also, each team is very much siloed. Usually, each team do one job very good, but they only do one job. To go through to put as a project context, it requires many teams to fulfill one single project at the moment.

Let me tell you some of my lifetime story. A couple years ago, I was trying to request a couple of the virtual machine on-prem, it took me seven months to get it provisioned. Why is that? I tried to map out the whole ticket backward and forward process, they got a really good ticketing system, so I can track every single person, every single team touches on a ticket. There's at least four teams involved to get me two virtual machines. There will be the security team who manages security and the architecture side, the Unix team, the network team, and also finally the network architecture team because it's on-premise on network. Also, they request so many backward and forward question and answer, like ping-pong. That's seven months to provision two virtual machines. The second example, I think one-and-a-half years ago, I needed to go live with one of the frontend applications. For a project to go live, it requires seven different teams involved to get my product live. I remember when I first engaged, the first team is called GSOC. I need to fill in a very long form and also including all the architecture diagrams and the whole shebang documentation before I can start my project. I engaged them. They have a weekly call with another seven teams that they can pick out which are the teams that need to get involved for my project. Finally, after one or two weeks I got an email back from GSOC, they're telling me to engage all seven teams. I was quite shocked at that time. Before I need to actually get my hands on to create my frontend application, I need to go through seven teams to actually get there. It was shocking.

Finally, I think this is something that happened years ago, like I need to increase my group's permissions in AWS, so I need to make a ticket to change one of the permissions for the groups. The first thing I need to get signed off from my manager is this action of changing permission requires, I think it's £3000 to update the permissions. When I think of that compared to what we're doing nowadays, it's supposed to be, if I need a permission, I go find the relevant repo, I make the change on the code, I submit a merge request, and get someone to approve. That should be under minutes. The job should be under minutes, and it shouldn't cost about £3000. That's a typical problem on a large enterprise company.

The Five Ideals Underpinning Effective Technology Teams

Moving on to, "The Unicorn Project." I'm not a book person, but I actually read "The Unicorn Project," after so many recommendations from my colleagues, and especially my manager as well. After I read it, it really describes that legacy organization, if they don't go through digital transformation successfully, they will become the next blockbuster and soon disappear. This book helps you to change your mindset and also it teases out what the technology teams really want to make them effective. Here, I've translated the five ideals from "The Unicorn Project" into my five elements of platform success. First of all, all five elements are now in the form of everything as code. At organization level, everything that we create in the platform, or in fact, everything, we use everything as code. They are purposely visible to all the relevant users. This is to encourage users to learn and contribute back to internal community. Just like the example I've given before, if I want to change permissions, the user should be able to create a feature request themselves. Go learn it, go read the code, make the code change, submit the merge request, and finally get approval. This should be just minutes of jobs. The way that everything as code is a common language between all the developers, engineers, data engineers, and security, if they can learn how to code, they can make change on any of our organization results. Of course, we've got change approval in place. The possibility is endless if you start using everything as code.

Five Key Elements of Platform Success

Let me talk about my five key elements of platform success. First, we need to provide a solid platform for development teams to develop on. The second, power at your fingertips. It is focus, flow, and joy. If you provide the power to the development team as a controlled format, they will have better focus to do their job better. The flow is nicer because they control their own destiny. Finally, getting the fun out of the power. Everyone loves power to a degree. Third, peace of mind automations. This is strongly pushing or improving the daily work. If we create enough automations that remove the boredom, or the tedious task that development teams need to face every single day, their daily work is happier. They will produce better work. At the end of the day, the real customer will benefit from this. The fourth is related to psychological safety. We work very hard to provide them a perimeter to safeguard their data and infrastructure. If they don't feel safe on where they're publishing their code, deployments, infrastructure, they won't really trust the platform. Finally, customer focus because we are moving on to a container-first strategy nowadays, and they need a big container, so we provide them a more user-centric container foundation for them to build their container on, make their life easier.

1. A Solid Platform to Develop On (Locality and Simplicity)

The first key element of platform success, providing a solid platform to develop on with locality and simplicity. For simplicity, you don't want to provide too many tools which confuse the developer, especially someone who just joined your company. If you give them 10 or 20 tools, they will get confused very quickly. Just provide the right amount and numbers of tools to do the job, and also provide a clear vision to use them and why you need to use them. I have selected five main tools to create the GEVGO platform. GEVGO stands for GCP, env0, Vault, GitLab, and Okta. Okta is used for authentication, SSO and MFA. GitLab, source control, complete CI/CD. Self-service we use env0. Secret management is Vault. Finally, our primary cloud at the moment is GCP. For locality, all these are SaaS tools that we're using. All the tools must be easily accessible to our developer, with SSO. You don't want them to keep putting in their password all the time, so we're going passwordless as well. A self-service capability allowing the developer to build at any time they want, no restriction. Of course, there's a security boundary, but there's no real restriction for them to develop their application in. The end goal for a solid platform is to ship code into cloud environments safely with no downtime to our customer. On top of this, to encourage the adoption of the tools, I've created the D.O.P platform to deliver all this to the developer. D.O.P stands for Department Onboarding Platform. This platform ensures that the developer can manage their own Okta group with their user, they don't need to submit tickets to update user group member. It's tedious. They get an ability to control which user to be in which group, and then grant them relevant permission to do their job. Also, D.O.P is going to provide a GitLab group to store all the code regarding the development team's product. Also, you need to store secrets for your application, API keys, whatever keys, they're all secret. We keep them in a place in the Vault, isolated from everyone else, specific for their product to store their secret. env0 is to create and manage GCP projects on flight, so they can modify and self-service their GCP project to their own specifications. Finally, they can deploy GCP now with all this platform.

2. Power at your Fingertips (Improvement of Daily Work)

Every day, there are many repetitive tasks which bore developers to death, literally, especially for the ones that require a lot of steps to complete, and some of them could be very much manual. The idea of a vending machine is to improve the daily work for our developer. This is trying to simplify the most tedious tasks from 100 steps to 1 step. Everything is abstracted away at the code level, so anything we've required twice, we can write once and use many. Now I can have a shelf of a vending machine item, I can just click, I get it. Click one I get another kind of cookie. Click, I get a bag of crisps, a bag of sweets, they're all doable. In our context, we'll be creating a GCP project, creating container repos that store the images. There are pipeline modules, what we call Lego blocks for the development team to use to embed to their pipeline. The Lego blocks can be considered something very simple, just versioning system, or different type of security scan that we want them to scan, SAST, DAST, SCA, IaC scan, and secret detection, or some of the pipeline modules could be just deployed in Terraform into GCP. They're all individually bundled. They're all individual and the development team can just lift and they can use it. Some of the development team or user is not really competent with pipeline writing, then we're giving them bundled pipeline template which is a purpose-built template, either building Docker container, scanning it, checking it, and push to a central repository, all in one single pipeline fashion. They don't need to pick out any of the pipeline module they want. Everything we want is embedded to the pipeline. That's one of the examples for the bundled pipeline template.

Also, we give them a lot of management capability. I want them to start managing the GCP project themselves, the IAM permission they grant to their user, everything as code, so I give them management capability for that. Also, since usually, for a product, it will consist of many different types of repos, such as infrastructure, application, utility pipelines, and also the images for the container. We give them a full management capability on how each repo permission, grant it to their own user, who can push, who can merge, and who can control each environment deployment. Moving on, we got the common infrastructure blueprint, the development team that are not really good at developing infrastructure. That's usually infrastructure engineer or DevOps engineers' work, but we created some of the common infrastructure blueprints for them to get started as soon as possible. Usually, the infrastructure blueprint will include the pipeline infrastructure. All that's left for the development team to do is plug in the application code. The seventh point is the unlimited scaling of GitLab Runner, so we utilize spot instance a lot in our GKE cluster, which is providing the GitLab Runner capability. It is cheaper to spin it, to run. Because it's a spot instance, it's about 80% off than the normal price if you're just running an on-demand instance. We provide them GitLab Runner to run any of their pipeline anytime they want, no restriction. Finally, lots of documentation. Also, I actually record a lot of video tutorial to do certain functions, because people are so different, they can be liking, watching videos, following documentations, or just asking for help, but we want them to self-serve as much as possible. Hence, I'm writing a lot of documentation, and especially a lot of troubleshooting guides. Every single error that I expect them to see or they will see during the development team cycle, they will get the documentation to resolve it.

One of the principles, whenever I create new vending machine items, we must maximize whatever a user can get from their input. If they're putting in one thing, the vending machine item should give them more than just one thing, just slight investment. If I invest a pair, I want to get £10 back. Not every time that's the case, but for creating GCP project in this screen, I've actually done it. Whenever development team create a GCP project, we give them a security baseline how the GCP project is set up. If they want to connect back to on-prem from their resource or across department, we give them network connectivity, they can pair into our shared VPC easily by just a toggle. For sandbox environment, we lock it down every hour. I've vowed that every single sandbox environment should have only 30 days to live. Everything after that is automatically destroyed, so we don't waste any money. The cloud wastage is a big problem in the cloud. Keyless deployment ability. I don't want our development team to be creating access key, secret key, or service account key in GCP to store onto their machine or pipeline, or even worse, they put it on source control. I've bundled this ability with the Vault. Every time they want to deploy to GCP they will get a dynamic credential generated, and only available at the pipeline phase, not anywhere else, so the key is not being handled by a human.

Moving on to the control deployment IAMs. The development team can control whatever permission for them to deploy infrastructure application into GCP. All they need to learn is what role they provide to the deployment credential. Moving on to binary authorization. This is quite a nice feature that I like about GCP. I think other clouds have it as well, but they've done it very easily for the end user. When you enable binary authorization, you can tell this project can only consume images from a certain repo. You can lock it down to a central repo or their own product image repo, very easy to use. Managing API. Every single service you want to consume on your GCP project requires enabling the API. If you want to use compute instance or virtual machine, you need to enable compute API. If you want to use KMS key to encrypt your data, you need to enable a KMS API, which is [inaudible 00:25:33], but I built that into my automation so they can choose to enable which service they want, or API they want to run their workload. Finally, the project label is very important. During the project creation process, we're actually creating it, department is already filled in for them, the cost center is already filled in for them, they need to declare who is operating it, and which team is operating it. All of this information is being put to the Terraform template, and finally, render the project label. This is super good when we're doing auditing and also cost optimization and insight. This is what I want to get out of every single vending machine, you put one in, you get many out.

3. Peace of Mind Automations (Focus, Flow, and Joy)

Moving on to the focus, flow, and joy, the peace of mind automations. These are two examples I put on the screen, the security visibility and source control protections. We provide a lot of Lego blocks or pipeline modules to our development team, but it requires them to take that module, put that into the pipeline to enable it. We find that's very tedious, so we're trying to create a better way of working for them, so Scout is born. Scout is an automation that embeds all the baseline scanning tools, and they're automatically enabling on every single pipeline in GitLab. The development team don't even need to lift a finger to have those scans in place because chasing them to enable scan is so bad from a security team. The user experience is not great. That's why we created Scout to enable the scan for them. Also offer related training whenever they see certain CVEs, vulnerabilities, there's some training specific for those CVEs for them to go through and learn more.

Secondly, for the source control protections. For GitLab project, there's many ways to protect a GitLab project. If you create a GitLab project from a blank template, minimal security settings will be on. Of course, for us, there are certain security settings we want development team to enable whenever they create the GitLab project. We created Oliver Pro, and Oliver Pro Max. The difference between Oliver Pro and Pro Max, is Oliver Pro, we're going to protect one GitLab project at a time with all the settings that we recommended. Oliver Pro Max is for the people that need to control multiple GitLab projects in a group so it can apply at scale. Let me jump back to Scout. If we're asking the development team to just include all the scans in their pipeline, they will not be happy because every single scan will take time. We work very hard to change the way that scan works, making sure we're only scanning the things that they need to scan. Also, we create a GitLab Runner optimized for the scan. We do it for optimizing the Kubernetes cluster itself. We're using GCP image streaming service to ensure every single GitLab job will be coming up as fast as possible. Also, we do a lot of GitLab Runner configuration optimization, making sure we git clone only a certain depth, doesn't go all the way to git depth 20, but only to the level that we wanted. We do a lot of optimizations, making sure the security scan finishes as soon as possible.

4. Perimeter to Safeguard Your Data (Psychological Safety)

Psychological safety. We need to make our developers feel safe about their application and data in the cloud. The number one cause of data breaches is stolen credentials. We deployed a VPC Service Control last year, and made all departments very happy and feeling safe, especially our data department. When you think of cloud, it's very accessible to people around the world, so attackers love it more than you think, as they can steal data from anywhere, if you don't have correct security control. How VPC Service Control works is, every single Google API is being protected within the perimeter, within our GCP organization. If any credential has been stolen, a hacker from North Korea or Russia or anywhere in the world, if they tried to use the key or credential to access a GCP organization's resource, they will be blocked. The service control actually works on identity, IP address, and device. If three of them doesn't match, you won't be able to access a GCP environment. This is also to let our development team to work day-by-day. The biggest challenge on rolling out VPC Service Control is understanding how it works, and also everything as code. I build everything as code, is to understand and break it down to very digestible chunks and also getting every single department involved during the process. Because at the end of the day, they need to work the repo to make sure they can make changes to the perimeter to allow the traffic they want into the GCP environment.

5. User-Centric Container Foundation (Customer Focus)

Moving on to the customer focus. One thing I've found is if you want the developer to do something different, you need to provide them multiple solutions to make their life easier. We are going container first in our cloud strategies. It's essential to provide as much support to your developer to build their application using containers. I've got three examples here. One is my PIUG automations. It stands for Public Images Utilization Guardian. I call it just PIUG. We got a lot of demand that the development team wants to use images directly from Docker Hub, or GCR, or Azure Direct Repository. We all know that public images are not safe and can be poisoned at any time. We got a lot of demand, we cannot cope with the demand, so we decided to create automation that all you need to do is run this pipeline, put the image you want to use in it, it will get scanned instantly. If the image is all good, is clean, then it will be pushed to a central repository. No security approval required. Also, the pipeline, as soon as it's been pushed to a central repository, they have self-service capability to grant whichever service account to consume the image from the central repository. Once we released this capability, it actually drove a lot of security awareness. Because the people now can just take any Docker image, run this through, the pipeline will show them all the results of the vulnerabilities. It helps them to understand what are the dangers of using images that have like 1000 critical vulnerabilities.

Next one, if the Docker image is no good to them, they need to build from baseline images. We have Paragon. One of my colleagues building all these images into 18 different flavors, and getting dependencies and all the packages each individual department needs, and we got 18 images so far in different flavors. All of them have zero vulnerabilities. It's just amazing that he can do that. That's his job. Or, you want to build from scratch, of course, we offered a way for that, is this prebuilt pipeline template. All you need to feed in is your Dockerfile, and also your application. You can use the base image from the PIUG or Paragon to build the images from scratch. Of course, those are the bundled templates that include all the security scans and all the permissions and everything that they need to push to either department GAR or the product GAR or the central GAR. It is quite a few different repositories they can push to.

Summary

Digital transformation requires a lot of people, getting them to be productive is not an easy task. With the five key elements of platform success that I showed before, which matches the five ideals from, "The Unicorn Project," they were able to start within days, transforming them into unicorn without them knowing. Suddenly, a developer can build infrastructure using CI/CD template with all the recommended practices from security. Also, DevOps engineers and data engineers, they are all very good at what they do, but when the culture needs to shift left, and the team is not expanding enough, like usually we do Two-Pizza team, eight to nine people in a team to do a project, we cannot pour enough people that have all the skill sets into it. They need to use the automations to become the unicorn that we want them to be.

Now I have a developer who can create infrastructure, a data engineer who can create a container image, DevOps engineer actually can start writing applications, not just secure applications, but as well based on the scan. I have developers already good at writing code, but they can write better code with the platform. We have achieved a lot during this digital transformation so far, and have the ability to onboard new people on-demand, and give them the ability to develop and transform organizations.

Key Takeaways

Create a platform that is secure by default with guardrails, and essential bells and whistles that improve developers' daily work. Remove handoffs by empowering individual teams with self-service and automations. Build everything in code. This really helped us to build a community of practice and encourage everyone to contribute. Everyone can make changes. Everyone can learn about every single aspect of our organization in the cloud. Create more self-service capability. This is an opportunity, your company can provide freedom within a controlled environment so the user won't go off the security paved road. If you don't provide enough empowering and self-service capability, they will always find a corner, a workaround to your problem, and eventually increase your security risk. Listen then implement. Create a product that has real demand. Now in my backlog, I got a lot of feature requests, people wanted extra self-service capabilities, customized way of pipelines, and so on. Infrastructure that they wanted to build, they wanted everything in there, so we listen all the time, and then implement to suit the real demand. I'm part of the security team and I really want to make security into a workforce second nature. This should come naturally. Everyone wants to build beautiful infrastructure, wonderful application, they should build everything with security in mind. I want this to be their second nature. Finally, make everything fun. My first value on my job is fun. If I don't really get my fun in my job, I won't have the passion and motivation to go further. Anyone can be James Hunt or Niki Lauda on a solid platform.

 

See more presentations with transcripts

 

Recorded at:

Feb 14, 2024

BT