InfoQ Homepage Presentations Building Trust & Confidence with Security Chaos Engineering

Building Trust & Confidence with Security Chaos Engineering

Bookmarks

View Presentation

Speed:

Download

38:46

Summary

Aaron Rinehart shares his experience on Security focused Chaos Engineering used to build trust and confidence, proactively identifying and navigating security unknowns.

Bio

Aaron Rinehart, while at UHG, released ChaoSlingr, one of the first open source software releases focused on using chaos engineering in cybersecurity to build more resilient systems. Rinehart recently founded a chaos engineering startup called Verica with Casey Rosenthal from Netflix and is the O’Reilly author on the topic as well as a frequent speaker in the space.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Rinehart: I'm going to talk about building trust and confidence with security chaos engineering. I'm going to talk about complexity in modern software. I'm going to talk about what chaos engineering is. I'm going to talk about the application of chaos engineering to cybersecurity. Some use cases it can be applied to, for example, chaos experiment for security experiment, and the framework you can also use in applying and building your own experiments.

Background

My name is Aaron Rinehart. I am the CTO, co-founder at Verica. Before I was at Verica, I was the former chief security architect for UnitedHealth Group, like a CTO for security for the company. I also have a background in safety and reliability with NASA. Most of my career I've been a software engineer builder before I got into security. I'm a speaker in the space. I also wrote the security chaos engineering content for the main body of knowledge for Chaos Engineering for O'Reilly. Kelly Shortridge and I wrote the O'Reilly Report on Security Chaos Engineering. I was the first person to apply Netflix's chaos engineering to cybersecurity. I wrote a tool called ChaoSlingr.

What Are We Doing Wrong?

The crux of the problem is that no matter all these new things we keep doing, we don't seem to be getting much better at the problem. It seems to be the breaches and outages seem to be happening more often. We should wonder why. One of my reasons why I think that this keeps happening is, it's because of complexity. It's not for lack of trying. It's that complexity is really complex adaptive systems, which is a term in applied science. Meaning that there are some characteristics that the outcomes in complex adaptive systems are nonlinear versus linear. It's due to the acting and reacting of different components and things through the system, causing a magnification effect, and also causing them to be unpredictable for humans.

Furthermore, complex adaptive systems cannot be modeled by humans in our own brains. It's difficult for us to conceptually keep an idea of what is happening from a system understanding perspective, as the system evolves at a rapid speed. In the end, what has happened is our systems have evolved beyond our human ability to mentally model their behavior. How can we continuously change something that we can't really drive an adequate understanding of, especially when it comes to security? If we don't understand what the system is doing, how do we expect the security to be much different? Security is very context dependent, so if we don't know what it is we're trying to secure, we can't really deliver good security as a result.

Speed, Scale, And Complexity of Modern Software Is Challenging

What has happened is today's evolution in modern systems engineering is that we've increased the speed, scale, and the complexity of which we've been dealing with. We've never ever had speed, scale, and complexity at the rates we have them today. If you look at the image on the left here, you see what a lot of folks know as the Death Star diagram. What that means is every dot is a microservice, and every line is connecting them together. Because microservices are not independent, they're dependent upon each other. It's very easy for microservices to sprawl from hundreds to thousands, pretty quickly.

Where Does It Come From?

Where does all this complexity come from? There's a couple different schools of thought. There's the accidental complexity, and there's the essential complexity. Essential complexity comes from things like Conway's Law, where organizations are destined to design computer systems that reflect the way they are as a business or really communicate as a business. It's hard to change that complexity without changing the nature of the business itself. The second area where complexity comes from is something called accidental complexity. Basically, that is the way in which we build software. There are some people that believe you can actually simplify the amount of complexity by improving the rate of accidental complexity. Most people believe, really, you're just moving the complexity around, because you will take a complex system, and you try to make it simple, you have to change it to do that. There's inherent relationship between changing something and making it complex. What are some things we're doing today to make it complex? Look at all these things on the screen, you see cloud computing, service mesh, CI/CD, and circuit breakers. You see all these techniques that are helping us to deliver value to market in better and faster ways. Also, they're also increasing the amount of complexity we're having to deal with as humans, in terms of how the software's operating at speed and scale.

Software Has Officially Taken Over

I'm not saying these things are bad. We should also recognize that software now has officially taken over. If you look at the right, you'll see the new OSI model, it is software. Software, it only ever increases in complexity because software is very unique in that what makes it valuable is the ability to change it. That also creates an inherent level of complexity in the nature of its construct. What we're trying to really do with chaos engineering, is we're trying to learn to proactively navigate the complexity, and learn out where it is, and give engineers better context, so they know where the boundaries of their systems are, so they can build better and more reliable systems.

How Did Our Legacy System Become so Stable?

For example, let's consider a legacy system, whether you're cloud native, or you're a digital native company, or you're a large enterprise that relies on mainframe. Everyone has some of them on legacy system. Legacy system, what that usually means is it is business critical. One of the characteristics of a business critical or a stable legacy system, is that it becomes known as somewhat stable. Engineers feel confident in how it works, and somewhat competent in the ability to run and operate it without having too many problems. What's interesting is that, if you consider the question of, was the system always that way? Was that legacy system, that critical system that we rely on, that makes all the money for the company, was it always so stable, and really so competent? The answer is, no, it wasn't. We solely learn through a series of surprise events, the difference between what we thought the system was versus what it was in reality. We learned through outages and incidents, and said, the system didn't work exactly the way we thought it did. We discover, and we remediate the system. We continuously react and respond and fix, and improve the system over time. Through that process, we're learning about the system, and ways to improve it, and how it operates.

Unfortunately, that's a very painful exercise, not only for the engineers involved, but also for establishing trust and confidence with customers, because they're also encountering your pain. You can think about chaos engineering. What it is, is it's a proactive methodology for introducing the conditions by which you expect the system to operate, and say, computer, do you still do the things you're supposed to do? If you think about things like retry logic, failover logic, circuit breakers, or even security controls that are supposed to prevent or detect under certain conditions, that logic almost never gets exercised until the problem itself manifests or occurs. With chaos engineering, we introduce the conditions by which that logic is supposed to fire upon to ensure that it still does do what it's supposed to do. Because what happens is, is that we design that logic at some point, early in the system, but the system itself has evolved several times in cycles beyond that logic, and it may not work. That's why we need to proactively make sure the system is going to do what it needs to do under the conditions we've designed it for. That's what we're doing. We're being proactive. We're proactively identifying some of these issues in the system before they manifest into pain.

Systems Engineering is Messy

One of the issues with why this is so prevalent is that we often forget as engineers that the process is pretty messy. In the beginning, as humans we have to simplify things, and our brains keep it all straight. In the beginning we love to think that there's this easy plan with the time, resources. We got the Docker images, the code. We have secrets management taken care of. We've got different environments for staging and production. We got a plan there. We got a nice 3D diagram of what it was going to look like. In reality, a system never is this simple, this clear. It's never as clear as this understanding that we've derived. We learn this because what happens after a few weeks or a few days is that there's an outage on the payments API, and get the hard coded token. Of course, you should go back and fix that. There's a DNS issue. It's always DNS. There's a DNS resolution issue, or somebody forgot to update the certificate of the site. Things that we slowly discovered, we forgot to do that. We're human.

What happens is, over time, our system slowly drifts into this state where we no longer recognize it, because no one has actually a full complete understanding. Usually, you're part of a team that is looking at a different lens of the system. You're looking at the security pieces of it. You're in the payments microservice. You're looking at the payment microservices code and delivering that. Let's say you're on the reporting microservice. One team and one group of individuals typically focus on that. What I'm seeing is everybody has a different lens of what the system is, but nobody has a complete picture. What happens is over time those gaps start to surface as problems and we wonder why. In the end, what you take away from this is our systems have become more complex and messy than we remember them.

Cybersecurity is Context Dependent

What does all this have to do with security? I'm getting to it. Cybersecurity is a context dependent discipline. As an engineer, I need the flexibility and convenience to change something. That's my job. My job is to deliver value to a customer through product, and that product is [inaudible 00:11:36] software. I need the convenience and flexibility to change something because I'm not sure what permissions and security I need just yet. My job is to deliver value to the customer. I'm constantly changing the environment, changing what I need in order to do that. I need the flexibility. Security itself is context dependent. You got to know what you're trying to secure, in order to know what needs to be secured about it. Security by the nature of how it works is that we're forced into a state of understanding in terms of context. Before we build the security, what's happening is the engineers are constantly changing, and delivering value to the market via product. What's happening is the security, we don't know it no longer works, until it no longer works. We start to get this misalignment, or the need for recalibration. What happens is we don't know there's a problem that security no longer works, until it no longer works. The problem with that is, is that if we can't detect that fast enough, an adversary can take advantage of that gap before we can actually fix it. With chaos engineering for security, what we're doing is proactively introducing the conditions we originally designed for to ensure that they can still do the things they're supposed to. What's interesting is you'll find that it's not often that they actually do anymore. It's just we're changing things so fast, and we're not recalibrating often enough.

Instrumenting Chaos Testing vs. Experimentation

Where does chaos engineering fit in terms of instrumentation? It's a very loose definition but I'm a huge believer in instrumentation, and data, and feedback loops when it comes to engineering. Testing is verification or validation of something we know to be true or false. We know we're looking for it before we go looking for it. In terms of security, it's a CVE. It's an attack pattern. It's a signature. We know what we're looking for, we go looking for those things. With experimentation, we're trying to proactively identify new information that we previously did not know. A greater understanding about contexts we're unaware of, by introducing exploratory type of experiment. That's where chaos engineering for security fits in.

How Do We Typically Discover When Our Security Measures Fail?

How do we typically discover when security measures fail? We typically discover there's some footstep in the sand, so some log event, some observability material. It could be a log event. It could be an alert. It could be that it can no longer phone home to get a manifest. It could be that it can't access a certain resource because a port's been blocked. We typically discover them through observable events. Usually, those observable events cause some security incident. The point I want to recognize here is that often security incidents are not effective measures of detection, because often we're not actually traditionally very good at detecting nefarious actors in our systems. We have to start using good instrumentation to identify the problem before it becomes a problem. Because when there's a security incident happening, it's already too late. It's not a very good learning environment for engineers to discover really the things that led up to it.

What Happens During a Security Incident?

Let's look at what happens during a security incident. In reality, people freak out. People are worried about the blame, name, shame game. They're worrying about, "No, I shouldn't have pushed that code. Somebody's going to blame me for this. I'm going to lose my job. This is all my fault." On top of that, people are really worrying about figuring out what went wrong. Within 15 minutes, there's some executive, or the CEO's on the phone saying, "Get that thing back up and running, we're losing money." This context is too much cognitive load for any human to work under. It's not a good way for engineers to learn. We don't do chaos engineering. Chaos engineering is a proactive, not a reactive exercise. This is the reactive world that we typically live in today, which is chaos. We don't do chaos engineering here. We do chaos engineering here. There is no problem. We think the world is going our way. We're doing to proactively verify that the system is what it was supposed to be. We do that by asking it questions in a form of experiment and hypotheses. In summary on this, we don't do chaos engineering here, we do it here. We do it here because it's about learning, about deriving better context from an engineering perspective.

What Is Chaos Engineering?

Chaos engineering, what is it? Chaos engineering in terms of Netflix's original definition, is this discipline of experimentation on distributed systems. We want to build confidence in the system's ability to withstand turbulent conditions. Another definition I use is, this idea of proactively introducing failure or faults into a system to try to determine the conditions by which it will fail, before it actually fails. It's about proactively building trust and confidence, not about creating chaos. We're deriving order from the chaos of the day to day reactive processes in our system. Currently, there are three publications on chaos engineering. There's the original O'Reilly Report from Netflix. There's this book that Casey Rosenthal and Nora Jones from Netflix wrote. I wrote one chapter in there. There's the report on Security Chaos Engineering.

Security Chaos Engineering

Security chaos engineering, there's not a whole lot different to chaos engineering. Really, it is chaos engineering as applied to security use cases to proactively improve the way we build and deliver secure solutions. The point I want to recognize about this is chaos engineering and engineers, I've always believed this, is that we don't believe in two things. We don't believe in hope, and we don't believe in luck. They're not effective strategies. We believe in good instrumentation, in sensors. We believe in feedback loops that inform us, it worked or didn't. Ok, what do we need to fix? That's what we're hoping that our security works the way we think it does. Is it not a very effective strategy? It works in Star Wars, but it doesn't work in what we do. What we're trying to do is proactively understand our system security gaps, before an adversary does.

Use Cases

Some use cases, these are not the end all be all of use cases for chaos engineering for security. These are the ones I liked. These are also documented in this O'Reilly Report. A great use case is incident response. A great place where most people start is proactively validating that security controls and measures and technologies work the way they're supposed to, by introducing the conditions that they're supposed to trigger upon. Another way is, because we're being proactive, we're introducing the signal proactively, we can actually monitor that signal. We can look at whether the technologies actually provide good observability data, because during an active incident, we're not looking at the log data, we're looking at the log quality. Because we're doing this proactively, we can say, that firewall or that configuration management error didn't give us enough context to derive the problem that we introduced. Had this been a real world problem, we would not really know where to go or what to do, we would have been scrambling for finding good context.

Because we're proactively introducing these conditions in the system, we know when it started. It's not a real failure. We can see how the event unfolds and how it performs, and derive better context. Prima Virani at Pinterest has a great use case in this book. You can read about how she started applying this at Pinterest. The last thing that's important, is all chaos experiments, availability, stability, or security based have compliance value. Basically, you're proving whether the technology or the security worked the way it was supposed to. You should keep that output in a high integrity way, by hashing it or whatever way you would like to achieve that, and overlaying it with some control framework. It could be PCI. It could be NIST. Don't lose the good audit data that you can derive there.

Incident Response

The problem with incident response, is response. You're always constantly responding and being reactive to an event. The problem is that security incidents are a subjective thing. At UnitedHealth Group, when I was directly over 1000 people in security, we spent a lot of money, like a billion dollars on security controls. It wasn't proactive trying. No matter how much we prepare, how much we do, or how many things we put in place, we still don't know what they're trying to do. Why they're getting in. How they're going to get in. What they're trying to achieve, and where it's going to happen. We just don't know these things. We put all these things in place, hoping that when those things happen, we're ready. Being reactive is very hard to measure and manage how effective you are. When you're proactively inducing the signal into the system, you can start to see, did we have enough people on call? Were the runbooks correct? Did the security technologies or the load balancer, whatever, give us the right information to understand what was happening? We could proactively improve that context and understanding. We can also measure how long it took, and where we can improve. There's a lot of great opportunity with incident response, and sharpening that sword with chaos engineering for security.

ChaoSlingr - An Open Source Tool

One of the first tools ever written for security chaos engineering was ChaoSlingr. This tool is a deprecated tool now on GitHub. I'm no longer at UnitedHealth Group. For the tool actually, if you go to the GitHub repo, you can actually find a good framework for writing experiments. There are three major functions. There's Slinger, Tracker, and Generator. What generator does is it identifies where you can actually run an experiment based upon AWS security tags, a tag of opt in or opt out. A lot of chaos tools have this. For example, you may not want to introduce a misconfigured port on the edge of your internet in AWS. There is Slinger. Slinger actually makes the change, in this case Port Slinger. It opens or closes a port that wasn't already open or closed, to introduce that signal, that condition to the environment to understand how well our security response is. The last thing is Tracker. What Tracker does is it actually keeps track of what happened in terms of the events with the tool. It reports that to Slack so we can respond and understand it in real time.

Example - Misconfigured Port Injection

Back to this example of Port Slinger in context. When we released ChaoSlingr at UnitedHealth Group, we needed a good example experiment that everyone can understand whether you're a software engineer, a network engineer, executive, or a security engineer. You need to understand what we're trying to do and the value we're trying to achieve. This was actually a really valuable experiment that we ran. For some odd reason, misconfigured port seemed to happen all the time. We've been solving port for 20 years, but it still happens. I don't mean maliciously, on accident. Accidents occur. It could be that somebody didn't understand flow, because network flow is not intuitive. It could be that somebody filled out a ticket wrong, and they got applied incorrectly. It could be administrative that entered incorrectly, and they entered the change wrong. Lots of different things. Our expectation at the time was that our firewalls would immediately detect and block that activity. It'd be a non-issue. We're proactive so we started running this on all of our instances. We started finding out that, about 60% of the time it caught, and blocked it with the firewall, but we expect 100% of the time. That's the first thing we learned. What we learned was there was a configuration drift issue between our non-commercial software and our commercial software environments. There was no incident. There was no problem. Proactively discovered, we got an issue here, and fixed it.

We were very new to AWS at the time, so we needed a way of understanding that what we're doing is right or wrong. We figured out it wasn't as good as we thought. The second thing we learned was as a cloud native configuration management tool, it caught it and blocked it almost every time. That was like, this thing we're not even really paying for is performing better than the firewall. That's the second thing we learned. The third thing we learned is that we didn't really have a SIEM, a Security Incident Event Management tool, like essential volume tool for security. We build our own. It wasn't because we were new to the cloud, I wasn't confident that that was actually going to drive alerts and send those alerts to the SOC in a meaningful way. It actually happened. The log data was sent from the firewalls to the AWS environment, from a configuration management tool, and it correlated an alert. The alert went to the SOC. The SOC stands for Security Operation Center. The SOC analyst, the analyst in the operating center got the alert, they couldn't tell which AWS it came from. As an engineer, you can say, Aaron, you can map back the IP address to figure out where it came from. Yes, you could. If this were a real world incident, that could take 15, 20, or 30 minutes. If SNAT, which hides the Source NAT address intentionally, it could be an hour.

When I was at UnitedHealth Group, it was very expensive every minute we were down. What was great is that there was no loss. There was no war room. We proactively understood that this wasn't working the way it was supposed to. They didn't have the right information to understand it. All we had to do is add metadata pointers to the log event. We fixed it. Nobody is worried about being fired here. Nobody was freaking out. There was no real problem. We proactively identified all these issues with the ability to fix them. That's what's great about this. This practice can really improve the way you build and deliver reliable systems, in a confident, trustworthy way.

Here's the link to download the report on Security Chaos Engineering. It is verica.io/sce-book. On the right, you see my co-author, Kelly Shortridge. She is amazing. We're really proud of what came out of this. We're actually writing the main O'Reilly book right now on security chaos engineering.

Summary

We're trying to proactively validate our system security by navigating complexity. That's what security chaos engineering helps us do, and chaos engineering does in general. You can use the practice of security chaos engineering to build confidence at the things you actually built, and your security continued to work. You start off with experimenting with these misconfigured port things. We're trying to observe and understand how well the system works. Once the experiment succeeds, and your security does what it was supposed to do, then you can continue to run it like a regression test, and continuously verify that your security works the way it does.

We got a great burst of information in the O'Reilly book. We're trying to be proactive, versus reactive. I know it sounds simple, but you'd be surprised what you can find, and you'll love the value you get from practicing this. Lastly, I'd like to leave you with some words from John Allspaw that really resonate with me, with what we're trying to do. John Allspaw is one of the fathers of resilience engineering for software. It's really about, stop looking for better answers, and asking yourself better questions.

Questions and Answers

Westelius: The first thing that comes to mind is that in the past few years, we've seen more investments in security with shifting left, essentially security moving closer and earlier in the SDLC to reduce investment later. From your perspective, do you see this as another step in security shifting left, or is it engineering shifting right? Is it a tool for engineers to do security experimentation, or is it something for security to drive in an organization? What does the collaboration model with engineering look like?

Rinehart: Really, the problem we're addressing is post deployment. It's post-deployment complexity. It's not necessarily in the build. Doing chaos engineering, or chaos engineering for security, we're not changing. We're not saying you don't do smoke tests. You don't do unit tests. You still do all those things. You still do all your traditional scanning and fuzzing and things like that. What's happening is the world of post-deployment is getting super complex and large in scale. We have so many different groups of humans working in that pool. We're not all responsible for everything in that pool. We are collectively.

Let's say you have 10 microservice applications. Microservices are dependent upon each other and they're not independent. They depend upon functionality for the services. Immediately out of the gate, for each microservice, you got a team of humans. Rarely do you have one team doing multiple, but that could happen. Still, not every team of humans has an idea of all of the services, all the dependencies, and so we're all delivering at maybe the same schedule, maybe different schedules. That's all ending up in that post-deployment system of hundreds to thousands of microservices. Nobody has a good mental model of really what's going on out there. We're continually changing things like DevOps and even CI/CD. We're just changing things so rapidly. It's hard. If we don't have a good understanding of the system itself, how is it ever possible that the security is any different? It's moving context left by instrumenting right. What's been interesting is that people adopting chaos engineering for security, and also reading the books are mostly software engineers, and software security folks. I found that really interesting.

Westelius: I think it is super interesting, because as we experience more security issues in general at large, how all these systems become more complex, I think automation and scale is hugely important. I think this is a huge part of it. How do you foresee this? When does it stop being chaos? You said testing versus weird science. When does chaos engineering turn into regression testing or standardization? Where does that feedback loop fit into the process? When is it experimentation? When does it turn into something else?

Rinehart: There's the experimentation phase. Chaos engineering is not about chaos. There's too much BS on the internet about breaking things in production. Netflix never really said that. It's about fixing them in production. What we're doing is proactively verifying that the system, after all those changes, all those events, all those things, still does what it's supposed to. What we're doing in the beginning is introducing the conditions we expected. Let's say a security control to detect an issue, or to prevent some problem. We're introduced to conditions that would trigger that. I started off with the Port Slinger example, which is ChaoSlingr's main example, by opening or closing a port that wasn't already open or closed. I sent that signal to the system to see what happened if it expected.

In the beginning, what we're doing is we're learning. We're doing it slowly. We're experimenting on the system to make sure it does what it does. I rarely have ever seen a chaos experiment succeed the first time. We're almost always wrong about how we think the system works. That's not just the security, that's the system in general. Once you've validated that the system did exactly what it was supposed to do. It could be a cron job, let's say you wrote some Bash, or you wrote a Python script, that's all you really need to do. You don't need a fancy tool to really do it. That's all chaos engineering is, is just a bunch of Python and Lambdas. You can schedule via cron, run it from Spinnaker, or a Pipeline, periodically, and just verify that it's still doing what it's supposed to do. That's automation.

Westelius: One of the questions was, how is this different from penetration testing by fuzzing inputs? Do you see, essentially, security chaos engineering as a framework in which you can do all of these different kinds of testing? What does that look like for you? Would you foresee this being a tool used by people who do penetration testing when companies don't invest in this locally?

Rinehart: As a former penetration tester myself, I fear putting that label on it. I actually have a whole section on this because I get this question a lot, especially how is it different than purple teaming? How is it different than red teaming? What is pen testing versus purple, versus red anymore, and adversarial testing, and breach and attack simulation? Where does it fit in all these AppSec crowns? Really what we're doing is we're doing fault injection. We're injecting fault conditions. They're the conditions, you can think of it also that way. We're being very targeted, very specific, from an engineering perspective.

We're also trying to account for distributed systems. Distributed systems have problems magnified quickly. The example I gave was, if it's a Node.js app, and you're doing a Node.js Mass Assignment attack. I'm making so many changes and sending so many signals to the system, it's difficult to sift what worked, what didn't. That's the goal, what we're doing. We're not trying to overwhelm the system. We're not trying to break in. That's not the goal. Don't mishear me either. There's value in purple teaming. There's value in red teaming, however you see the definitions. I prescribe the definition of red teaming was supposed to evolve to purple, but it didn't. Because red, there is a wall of confusion like Dev and Ops. Then there was blue and red. Then it was like, we got in. You guys suck. Here's a PDF file. Purple was supposed to evolve into more of a cohesive, collaborative exercise, but it evolved into something else.

I don't know if pen testers will use it. I've talked to a lot of pen testers that see value in it, and can probably do it. I'd love to see where this goes. I am seeing a lot of software engineers pick this up. I see a lot of people building programs around this that are in the red teaming, or they're in the architecture function. A lot of chaos engineering and SRE are blurring with architecture in an interesting way, because you're trying to validate those diagrams actually represent something in the system.

Westelius: I think to some extent, it's almost a representation of how software and systems engineering have developed, and we have infrastructure as code now. I think that the lines between all those different responsibilities have definitely blurred.

When do you think it is worthwhile making the investment into security chaos engineering as an organization? Is there a point where you shouldn't?

Rinehart: Here's how I see the world. I don't think you're going to have a choice to the future. We're asked to do so many things in security. It's almost an impossible task, if you really think about it. If you're going to spend the time and effort to launch a new commercial solution, or write a bunch of code to do a thing, to satisfy a requirement. Don't you want to make sure it works? Don't you want to have some feedback loop that as the engineer is building product and delivering that, that your tool is still providing the value you promised them? That we're delivering on the value? What I'm saying is like, now, yesterday, we need to be expanding this as a field. There's probably 20 or 30 different companies now experimenting with this. Just because these are areas I started in, doesn't mean you can't take it in a new direction. These feedback loops, post-deployment, it's going to kill us in security. The complexity is what's causing a lot of our problems, in my opinion.

See more presentations with transcripts

Recorded at:

Sep 04, 2022

Aaron Rinehart

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Building Trust & Confidence with Security Chaos Engineering

Summary

Bio

About the conference

Transcript

Background

What Are We Doing Wrong?

Speed, Scale, And Complexity of Modern Software Is Challenging

Where Does It Come From?

Software Has Officially Taken Over

How Did Our Legacy System Become so Stable?

Systems Engineering is Messy

Cybersecurity is Context Dependent

Instrumenting Chaos Testing vs. Experimentation

How Do We Typically Discover When Our Security Measures Fail?

What Happens During a Security Incident?

What Is Chaos Engineering?

Security Chaos Engineering

Use Cases

Incident Response

ChaoSlingr - An Open Source Tool

Example - Misconfigured Port Injection

Summary

Questions and Answers

Related Sponsored Content

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ