InfoQ Homepage Presentations Quantifying Risk

Quantifying Risk

View Presentation

Speed:

Download

50:58

Summary

Markus De Shon talks about the Netflix risk quantification that they introduced in their highest impact areas, and are gradually expanding across the enterprise. De Shon shares his experience and approach to defining appropriate loss scenarios.

Bio

Markus De Shon has worked in security since 2000 at SecureWorks, CERT, Google and Netflix, mostly on problems in Detection Engineering. He has a passion for developing a comprehensive framework to guide the engineering of detection and response systems, an effort that he has written about and continues to work on today.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

De Shon: To start things off, I would like to show you something that's really scary. Are you ready? There it is. It's a red box. I made this box red because I'm scared about something, and I would like to communicate to you that I'm scared about it. Hopefully, you're just as scared as I am, but how would you know? Do you know how scared I am? Are you more scared than I am, or not? What if I put another red box next to this one? Is that one scarier? Or is this one scarier, and by how much?

Imagine you're the CEO of my company, and I come to you and I show you a red box, and I say, "Give me $1 million, and I will buy tools and hire people to turn this red box into a yellow box." Are you willing to give me the money? Sure. Even if you don't give me the money, I'll do it for free.

There it is. That's a lot better, now you feel more relaxed. How much did I improve things, actually? I told you I'd turn it from red to yellow, but how much did I improve it? Did I cut the risk in half, did I cut it into one-fourth? You don't know. You just know that I turned the red to a yellow. Imagine you're the CEO again, but I come to you with a different proposal. I again say, "Give me $1 million. I'll buy tools or hire people. This time, what I'm going to do is take a risk that's costing us $10 million a year, and I'm going to turn it into a loss of $3 million a year." Now, you have something you can put your hands on. That seems like a pretty good investment for $1 million.

What does it mean to have a $10-million-a-year loss in the area of risk? That could mean that it's something that you're actually losing money all the time, and it's maybe a lot of small losses, and they add up to $10 million a year. Or it could mean that once every 10 years, we're going to suffer a $100-million loss. Either way, you can kind of annualize it into a yearly loss of $10 million, and I'm telling you, "By taking these actions, I'm going to reduce that to $3 million." Now you have a sense of the magnitude of the risk, you have a sense of how much I'm going to improve it, and you have a sense of the return on investment. "Do you agree that I've made the company $7 million a year by taking these actions?" Now it's something that I can actually make a good argument about.

My name is Markus De Shon, I work at Netflix in information security. My specific area is detection engineering, but one of the first things I did when I came in was say, "What should I work on?" I took what was some of the highest-risk areas, and I started to break things down and to quantify the risks. As a process of doing that, I learned a lot, and so, I'd like to share some of those lessons with you today.

The first book I read about quantifying risk was this one: "How to Measure Anything in Cybersecurity Risk." That's a pretty good one. That was one of a series of books that's how to measure anything en bloc. It's a good book for arguing for why things need to be quantified, and for convincing other people why they need to be quantified. That aspect of it is really good, but it uses a simpler quantification than the other standard that I'm going to talk about. It's not really the methodology you should adopt, but it's a really good book for convincing yourself and other people.

Then, the second book is the one that lays out the FAIR methodology. "FAIR" stands for "factor analysis of information risk." It's a way of breaking down the risk into components and quantifying each piece, and then, rolling them up into an overall risk figure. I'm not going to go into extreme detail on this. I'm actually going to do the very top-level version of this, but I'm also going to be going into some things that FAIR doesn't really teach you about, which is how to start and how to identify the loss scenarios that you want to work on. I had the pleasure of seeing Jack Jones speak the other day. He's a very smart guy, and he has a lot of new ideas that he's working on. The whole FAIR Institute is an organization built around this. That is, to teach people about FAIR, and also, to evolve the standards. It's a good thing to take a look at, and also get involved with.

What Is Risk

What is risk? There's a lot of definitions of risk that have been floated over the years, but under the FAIR standard, there's a very specific definition that's an actual quantity. It's a product of two quantities. The first one is the frequency, which is, how often is a loss going to occur? This gives a prediction. You might be basing it on past data, if you have past data. Or you might have to just use your expert opinion of your security engineers. Either way, it's like, "How many times a year do I expect this to happen," and we'll get into the details of that.

Then, the other number is a magnitude of loss, expressed in money - per US dollars, in our case. We'll go into a lot more detail on that. This is risk under FAIR, and I would argue that this is the definition that we should be using, and that we should quantify our risks.

I mentioned that we're measuring a magnitude of a loss, but what is a loss? A loss is anything that actually impacts our critical data or our ability to operate, or somehow impacts our revenue, impacts our expenses. Whatever way it impacts us as an organization, it has an actual impact. There may be certain steps along an attack tree where there's no loss yet. The specifics of the attack tree may be very important for determining how often this may occur, because there may be steps that were more or less difficult for an attacker to take, and by making them harder, we reduce the frequency.

The loss has not actually occurred until they reach some kind of critical asset. When you're defining your loss scenarios, you need to really keep this in mind. As an example, if you think of a laptop getting infected with adware, this is not a loss. It's not a loss of the magnitude that I'm talking about. It's something that maybe is an operational annoyance, and there may be a small cost associated with cleaning up the laptop. Hopefully, you minimize that cost, but it's not going to be critical to the organization. Even a compromised laptop wouldn't necessarily be a loss, unless it leads to something else. The sooner you catch it, maybe the better it is, because you could prevent a loss from occurring.

However, if you think about an open S3 bucket that has a bunch of customer home addresses in it with their names, that's a loss. That was then made public, and anyone in the world could then access that data - that's a loss. That now becomes something where you'll not necessarily suffer a primary loss, which is something that directly impacts your business right away, but secondary losses. There might be regulatory fines, there might be lawsuits. That's a loss.

Risk Analysis

We want to be able to define certain scenarios of loss, but how do we get there? First, what we need to do is to understand, what are our critical assets in the organization?

A lot of times for my realm, information security, these are going to be data assets, or collecting some kind of private data. Or there's some business data associated with the company. They have different levels of worth, depending on what kinds of impact happen. Let's say, we lose access to a particular piece of data, how does that affect our business? We may not be able to make recommendations to people, because we don't have their data. Other kinds of data may have a direct financial impact, if there's a confidentiality breach. First, though, we have to understand, what are the assets?

Then, we need to understand our architecture, because this often will determine how someone would traverse the infrastructure and get to the data. This is a more traditional kind of architecture that you would think of, like a systems architecture or a software architecture, but at a pretty high level, because you don't want to get too specific too quickly.

Then, we need to define what I call - or not what just I call, but this is gotten from a different book - something called a control architecture. The idea here is that you have processes within your system, and then, you have controllers that affect those processes. I'll go into a lot more detail about that, but essentially, think of it as a hierarchy of components of your overall system, and the controllers are acting upon the processes that are lower down, and things cascade down to where your critical data is stored. By understanding this control architecture - which may include systems, it may include groups of people and other things like that - you have something on which you can think about, "Ok, how can this go wrong?"

Once you have the control architecture defined, then you can start to think about, "What are loss scenarios? What are situations where I would actually have an impact on the business?"

This is all very abstract. I think what we need is some kind of a concrete scenario, so that we can walk through it, and you can see what all the steps look like, and you'd be able to replicate this in your environment. I also don't want to make an example that is an actual computer system, because then, we'll get bogged down in details, there'll be lots of complications, everybody'll have opinions. I want to start with a scenario that I like to use. It's pedagogically useful, but unfortunately, it's based on a very popular cartoon character. As not to infringe on anyone's intellectual property, we're not going to talk about that character. We're going to talk about his third cousin, twice removed.

He's in this picture. Do you see him? He's right there. That's right, it's Sam the Sponge. Everybody loves Sam the Sponge. Here's his friend, Peter. Then, he works in a business for his boss, Mr. Prawn. Mr. Prawn owns a restaurant called The Proud Prawn, and at the Proud Prawn, they make a very delicious burger called the Prawn Patty. Everyone in town loves the Prawn Patty. In fact, they love it so much, they can't live without it. It's a very critical resource that the flow of Prawn Patties continues.

The Prawn Patty has a secret recipe, and this is our critical information resource. This is the thing that we now need to be able to protect. It makes it pedagogically useful, because this is an actual physical object. A simple version of a controls architecture for this secret recipe is that there's only one copy. It's a physical copy, there's only one of them. It's not memorized by anyone. No one either seems to be able to memorize it, or they haven't, for some reason. It's kept in a safe that's in Mr. Prawn's office, so only Mr. Prawn has the combination to the safe. He does give access to the recipe to certain trusted handlers when they're making the Prawn Patties.

Now, we understand a little bit about how this critical information resource is handled. Most of the time maybe it's secure in the safe, perhaps. Some of the time, we take it out, and we give it to trusted people, and they use it, and then, they bring it back, and we put it back in the safe. That's kind of our overall controls architecture, and how information flows through this system.

We need to be able to protect this recipe, and we have a number of different risks to the recipe. The first class of risk is confidentiality risk. We don't want anyone else to have access to the recipe, because then, they can make the Prawn Patty, and that's not good for our business. Who would want to get the Prawn Patty recipe and make it for themselves? A competitor, obviously, somebody who would want to make it and go into business. Potentially, if it got released to the general public, that would be bad. Then, you might have many competitors, or everyone could just make it at home, and your business would suffer. Either way, we don't want the recipe to fall into a competitor's hands or to the general public.

There's also integrity risks. Given that there's only one copy of this recipe and no one's memorized it, if it was somehow changed or destroyed, then there's no way, really, we'd be able to recover the original recipe. That's bad.

Then, third, there's availability risk. That's kind of a unique feature of this particular scenario, which is that if the recipe's unavailable, we literally can't make the Prawn Patty. That will have negative effects, as I'll show.

Threats

Any of these losses could occur deliberately or accidentally. Under FAIR, deliberate actors and accidental scenarios are lumped together, to some extent. They call them both threats. I think that it's important to distinguish those cases. They do, of course, mention that yes, they're different, and you have to handle them differently. I like to make that a little more obvious by calling the first one, the deliberate ones, threats.

This is a person who wants to harm your business, or some sort of threat actor. It could be an outsider, it could be an insider. In any case, they have deliberate intention to harm you. The reason that's important is because when you start thinking about frequencies, they have a directionality to their actions. They're trying to get to the critical resource and compromise it. Whereas, in a hazard scenario, that is a more accidental thing that happens. Perhaps the directionality is much more random, maybe have a lot of internal users, they're taking a lot of different actions all the time. Somebody hits the wrong button, and there's three buckets public.

It's not that they were intending to cause a problem. If they intended to, they could do it very easily. Instead, it was just something that happened, and our controls architecture allowed it to happen. In other words, we haven't instituted controls that would prevent that accidental action from taking place. The two types of controls could be very different one where you want to prevent a deliberate action, and the other one where you want to prevent somebody just doing something not smart.

Now that we know we can worry about threats and hazards, I'm going to focus just on threats. In other words, deliberate threat actors, just in the interest of time. There's two threat actors in our scenario. The first is Tardigrade. Tardigrade owns a restaurant nearby, and they would really like to compromise the Prawn Patty recipe to make it in their restaurant, and be able to make lots of money. The other threat actor is Patty Pirate. I'm really proud of that one. Patty Pirate's really far away on land, but she's tired of just running a juicing stand, and she wants Juicy Burgers instead. She would maybe want to come and compromise the Prawn Patty recipe and add the Juicy Burger to her menu.

Let's talk about just the scenarios where one of these two threat actors wants to steal the recipe. We'll focus just on those, and we'll try to quantify those risks.

Frequency and Calibration

Remember that to quantify the risk, we first need to estimate the frequency of loss. Frequency of loss is really a security-engineering question, so your security people understanding the architecture. Once you've put that together, and you have the controls architecture, you sit down with your security engineers and you start hashing out, what are all the ways you could attack the system and gain access to the recipe, or your critical assets.

The security engineer might be yourself. If you're working on risk, you might be a security engineer. That's fine. Your opinion matters, at that point. Frequencies are expressed as a rate per unit time. It's literally the number of times per year that this happens. It'd be defined on a range from zero to infinity. Zero would mean, literally, it's impossible, this loss cannot occur. We've engineered it out of the systems, or there's no way it can happen. If you can conceive of the loss scenario, the rate is probably not zero. I'll just put it out there. Then, infinity means it's just continuously happening.

If the loss per event is non-zero, and there's an infinite number of them, then you're probably going to go out of business pretty quickly. Just a note that you can have more than one per year. If you think of a step on an attack tree within your infrastructure, let's say it's pretty trivial, and maybe they have to figure out something about your infrastructure, but it doesn't take very long. Then, you could say, "That could happen a thousand times per year." In a couple hours, they'll figure it out. One step on the attack tree, the rate might be 1,000. Then, some other step in your tree, hopefully, is much smaller, less than once per year. Otherwise, you're not rate limiting, and you have a problem.

Things that are happening all the time, though, we do have things like that. They're usually called operational risks. Think of things like account fraud, you could have a constant rate of account fraud going on. You do whatever you can to reduce it, but you're probably never going to reduce it to zero. That's a good kind of risk, also, to keep in mind.

It's important, when you're estimating frequency or having people estimate frequencies, to calibrate them. This is also true for estimating magnitudes, but we'll get to that. Some people will be overconfident, some people will be under-confident about how fast things happen. Just to be able to get people in the right mind-set, you might mention to them that a rate of 0.1 once every 10 years, that's since Obama was inaugurated. I know it feels like 50 years, but it's only been 10 years. Then, a 0.01, that's 100 years. That's the end of World War I. You think about, if your business was operating since the end of World War I, and the control architecture was the way it was that entire time, this would only happen once. When you calibrate people like that, they're usually reluctant to go to 100 or above. Then, 0.001, that's Brian Boru defeating the Vikings in Ireland in 1014. You really want to say something is only going to happen once in 1,000 years? Ok, really? Since the Vikings? It's just a way to psychologically calibrate people.

What do we think the frequencies are, of Tardigrade stealing the recipe? Tardigrade actually tries to steal the recipe all the time and never succeeds. Somehow, our current controls architecture seems to be pretty robust to whatever Tardigrade is trying. Tardigrade doesn't have a large brand, maybe it's not really capable of coming up with good plans.

Actually, using past data to estimate frequency is just a little beyond the scope of this talk. There are all kinds of great research about how to use the data to estimate that, based on what's happened, and also, how to combine it with expert opinion. If, for example, you believe that the future is somehow different from the past, you might need to incorporate an expert opinion, rather than just data. That's a huge field, and so, I'm not going to go into that. I'm just going to say that based on past data, I'd say that Tardigrade's going to steal it once in a hundred years. He just seems to be really bad at stealing the recipe. Anyone want to take issue with that?

Participant 1: [Inaudible 00:23:26]

De Shon: That's fair, yes. That actually brings me to an interesting point, which is, what actually are we estimating? I'm not actually saying that this isn't going to happen in the next 100 years. What I'm saying, really, is, this has a one-percent chance of happening in the next year. It's pretty unlikely. Yes, we're going to have to re-estimate these frequencies pretty frequently, because the infrastructure's changing, the risks are changing.

Participant 2: [Inaudible 00:23:59] De Shon: Black swans are things we have not anticipated. Hopefully, by going through this kind of an exercise, thinking about your architecture, and thinking about the kinds of things that could go wrong, you might actually anticipate something that would be unlikely, but would have a very large impact. You're much more likely to think of it by going through this process systematically, than not doing it.

What about Patty Pirate? She would like to steal the recipe, as well. She actually has never tried before. However, she's a human being. She has a much bigger brain than Tardigrade, so likely, her plans would be much better. Let's say that she might succeed once in every 10 years, 0.1.

There's a lot more systematic ways you could do this. I mentioned attack trees, where you could actually build out, what are all the steps that could happen? Do we have data on, let's say, how often someone's able to compromise an SSA server, if that's one of the steps? There are ways you could decompose the problem further, and break it down and do more estimates. There are also ways to have a distribution of frequencies. Not to say it's just a number, like 0.1, but that it's some range.

Participant 3: When you talk about the attack tree, this exercise is for every node, or for the end goal?

De Shon: This frequency is for the whole thing. This is from beginning to end, until the loss occurs. The attack tree would be a way to decompose it into steps, but you'd have to be kind of comprehensive. The first time around, I just basically was like, "What do you think the overall rate is, given all you know about the infrastructure, all the different ways?" Typically, what a skilled security engineer will do is, they'll think of the easiest path. They'll be, "If I took this path, that would be pretty simple, and then, the rate's high."

That may be the dominant path. Even if you've mapped out the whole attack tree, if everything else is more difficult than that easiest path, it's going to be the dominant one for this rate. That's still, for me, an open area. I'm still working on, "How could we build an attack tree of all of the paths to all of our critical resources," and then figure out, "Where are the choke points where we can most effectively reduce frequency by improving the security?" That's still extra work to do. Other questions?

Participant 4: How do we even know what these rates would be? There are so many different ways someone could get in. There's software vulnerabilities, there's someone injecting themselves into the dependency path and putting malware onto your system. There's maybe upstream network attacks that could happen, all kinds of stuff.

De Shon: Absolutely correct. The expert opinion that we're using to estimate should be informed by all of those factors. That's why we go to security engineers, not to the system owner, because they're not used to thinking about those types of things. If a security engineer knows what the threat landscape looks like, they know who the threat actors are. If you have a good threat intel program, you might even know who's targeting you and what their capabilities are. That may be the dominant thing that would drive your frequency, "They're really targeting us. They're focused on our infrastructure. They tend to do things these ways, and one of those is a weakness for us." It's a guess. It's better than not doing it. The thing you do when you put a number on the page is, you get people to argue with you. That's what we at Netflix call "farming for dissent." That's a really good process, because everybody's bringing their ideas, "No, that's stupid because of this." You can wise up, "What information do you have that I don't have?" Your estimates get better over time.

Participant 5: Why don't we evaluate, perhaps, every six months?

De Shon: Yes, you'd want to re-evaluate on a regular basis. Particularly, if you can get really dynamic about it, if you have that attack tree, or if you know which applications are processing which kinds of data - let's say have an inventory of all the dependencies of that application, and one of those libraries has a vulnerability that's remotely exploitable. If you have that all in the database, you could now say, "This application is now dominating how likely it is that this data will get compromised," and that'll bubble that patching up to the top. If you can get to that point where this is all automated, that's great. The beginning is, so can you build it manually, and think things through.

Magnitude

The next thing is to estimate the magnitude of the loss. As I said, we're trying to do it in dollars. Who do we need to talk to? Probably the asset owner, in this case, Mr. Prawn. Mr. Prawn knows how much money is made per day by selling Prawn Patties. He knows how much the ingredients cost, all those kind of things, and operational costs for the business.

Then, you might also talk to, in your organization, incident responders who would know how costly it is to respond to an incident of this kind. That's a kind of primary loss. You could talk to Legal, to see what your legal liabilities might be, if this happened. Finance, business development "Would we lose partners?" "Would we lose business associates if this occurred?" Sales, "Would this impact our ability to sell?" You'll probably decompose the problem into a bunch of sub-components, each of which might be a cost. I'll show you an example, including the primary losses, as I said, and also later, secondary losses.

What we want to do is estimate a low and a high. This is where we're getting into details of the FAIR methodology. The low scenario is, my controls worked, my detections worked. I did a pretty good job of finding that this was happening, and so, we responded quickly, and we cleaned it up. That is a best-case scenario. The high is, everything went wrong. We didn't detect it, our controls didn't stop them, and they maybe got everything. How bad would that be? This should be a 90% confidence interval. That's where some calibration also comes in, getting people to understand what it means to be 90% confident of something. The idea is that you have a 90% chance that the actual number would be in that range that you give.

Then, we need to put it in terms of money. What we do with that low and high estimate is, we actually put it on a lognormal distribution. Lognormal has a couple of good qualities. One is, it's never less than zero. If you use a normal distribution, it can go less than zero, and you would be gaining money by having a loss. That's not a good model. Then, also, it has this long tail. You can think of that as "What if this happened, and it got on the front page of the 'New York Times,' and everyone started suing us, and there's a PR disaster?" That's the kind of blow-out loss that you need to be able to take into account, that it might just be higher than you thought. That long tail kind of captures that.

Why Money?

Why money? Why should we put everything in terms of money? Money's not everything. The best things in life are free. You can't put a value on everything, so putting things in terms of money just seems dirty. Why money? Because I think what you need is a couple of really important qualities in this quantity. One of them is that it needs to be composable. Dollars - you can add them together, you can subtract them. It's not something like, what is red plus yellow? What are two reds minus two greens? Then, sometimes, people use a categorical scale, like one to five. That is not actually quantitative, because you can't also say, "Two plus three equals five." That doesn't actually work. It's just really misleading, and you're going to make mistakes in your analysis if you do that.

You also want them to be comparable. You can say something's greater than another. Red's greater than yellow, but by how much? If you have 10 reds, which one's greater than the other ones? With dollars, you're going to end up with a range of stuff. Even with a bunch of scenarios, there's going to be ones that are bigger and some that are smaller.

Then, you want it to be interpretable by your business. That's a really important quality. That's ultimately why you should be quantifying things, because as I said, you're going to be going to your business leaders and saying, "Give me money to fix this problem." If you don't have numbers and don't have dollars behind things, it's going to be really hard for them to justify it to themselves or to anyone else.

Problems: what is that's something that's priceless? Human life. Obviously, you don't really want to put a value on human life. I would say that however, you actually already do. In business, you're taking certain actions or not taking certain actions, maybe because they're too expensive, and there may be a safety impact to somebody. it may not even be an internal impact. Here, I'll give you an example. If you hold privately-identifiable information and there's a breach, there are going to be people who are in domestic violence situations whose safety is impacted by that. If we're not protecting that data adequately, we are actually valuing their lives at a rate probably much lower than we want to be valuing their life. Let alone our own employees. If we're talking about a safety loss, that's impacting somebody that we actually know and work with every day. We are implicitly valuing these things. I'm just suggesting, let's make it more explicit, so that you know what collar you're putting on.

Then, what about things that are intangible, like reputation? This is probably the most popular thing that an information security team will tell you, "This is bad." "Ok, why is it bad?" "It's going to hurt our reputation." "Ok, what will that do?" "I don't know. People hate us." Ok, what may happen then is, less people will buy your product. Or other companies won't want to work with you, because you're, "We want to keep that at a distance."

There are actually impacts for reputation that are business impacts. It may be difficult, as a security engineer, to think about those, but remember, I was talking about talking to the asset owner and to other parts of the business. Talk to Marketing, talk to PR, "How much do you guys invest in improving our reputation? If suddenly, we lost that, all that investment would be gone." That's minimally what you would lose. Then, in addition, all the benefits of that higher reputation would be gone. That's something that potentially, people will have some quantities about. I'll admit, I haven't done this yet, but I think it's going to work. Any questions about any of that?

Participant 6: [Inaudible 00:37:05]

De Shon: His point was there's a lot of cases in history, where maybe companies have made choices, and there were actually impacts, people died. The whole core business of some businesses was killing their customers. Obviously, I'm not trying to hold that up as a standard. Instead, what I'm suggesting is that by making the potential impacts explicit, and by trying to quantify how dangerous it is for a particular outcome, we actually are taking that into account more than in those examples that you're talking about. It's all a question of what value you do ascribe to. Let me walk through my examples, and then, we can see.

Let's talk about the Tardigrade example. The recipe becomes unavailable, sales stop. That's a primary loss. Because we don't have access to the recipe, we can't make them. Let's say that we make $10,000 a day on selling Prawn Patties. We'll say that we get it back within a day, that's a $10,000 loss. Then, the high scenario is, somehow, Tardigrade is able to hide the recipe from us for a long time, 100 days. That's $1 million dollars. Meanwhile Tardigrade is making knock-offs. Maybe we lose some customers permanently. Maybe they're, "I like it better the way Tardigrade makes it, so I'm just going to go there from now on."

Let's say we lost 10 customers, and we valued each one as $100 of future revenue. That would be a $1,000 loss. Then, on the high end, let's say we lost 1,000 people, so just 10 per day for 100 days, and that's $100,000. The nice thing is, because these are composable numbers, we can say the low is just the sum of the low scenarios, and the high is the sum of the high scenarios. You could get more sophisticated about that and think about them as two different distributions and draw randomly. Really, what we're doing here is, we're saying, "What happens if it's one day, and what happens if it's 100 days?" That's really the distribution we're using, and then, the losses scale with the number of days that we don't have the recipe. That's really what's behind this. That's one way in which you can decompose it and put it in terms of the business. If you just go to them and say, "How much will we lose if the recipe's gone for so much time," they'll be, "I don't know. How many days would it be?" It's, "What if it was 10 days? How much will we lose?" By driving it by something that's sort of easy to measure, then they can figure out what the revenue impacts are.

The thing is, Tardigrade's really bad at stealing the recipe. If we push this into a lognormal distribution and get the mean, multiply it by 0.01, it's really small, less than a day's revenue. We can just have fun making Tardigrade fail over and over again.

What about Patty Pirate? Patty Pirate's much smarter. We're also going to lose sales, but she is further away. Probably the minimum for recovering it might be 10 days, but we're still losing 10K per day. Then, let's say that on the high end, it's still 100 days. Then, the problem here, though, is, now there no Prawn Patties anywhere in town. It turns out people love Prawn Patties much that this leads to the immediate collapse of civilization, and everything's on fire, and there's a dystopia. That's a secondary cost, in a way. Really, it's an external cost, what economists would call a negative externality.

Traditionally, businesses have not worried about negative externalities. They're just, "That happens to somebody else. That doesn't impact our top line or bottom line." I would like to argue to you today that we should actually internalize, when we're doing these risk calculations, some of that external loss. The reason why we should do that is, number one, it's the right thing to do. Number two, because I think there are long-term benefits for your business for being a good citizen, and for doing things in an ethical way. Especially if you let people know, "We made this change because we think this would impact you more than it would impact us. We felt like this is the right thing to do."

Let's say that we don't like dystopia, and we value that as a million-dollar-a-day loss. We don't like to see our neighbors' houses burning down and the whole village on fire. Even if they didn't burn down our business, we're going to say it's a million-dollar-a-day loss. That immediately puts this way up there. Also, add to that that Patty Pirate's a lot better at this. Then, that makes our loss actually about $4 million per year. That means it bubbles right up to the top of the priority queue. We might want to think about, "Is there something we could do, so that the Prawn Patty would not be inaccessible, in the case of it being stolen by someone far away? Maybe we should have a second copy somewhere."

This is how you uncover what would be, essentially, intolerable losses, and then, do something about it. If we hadn't internalized that cost, then this wouldn't even be high-priority. It would just drop right to the bottom. Or it might be a little higher than Tardigrade, but still.

Hazards

The last think I want to talk about is hazards - I mentioned them at the beginning - because this involves the control architecture that I mentioned. This book here, "Engineering a Safer World," Nancy Leveson at MIT, is a really great book. It's actually about safety engineering, but all of the ideas are really applicable to information security, as well, and designing secure systems.

What it is, it's a systems approach. You're actually analyzing the system as a whole, and trying to understand how the controls can break down. What she does is, show that an individual part of the system, some process in the system, is that thing at the bottom, the thing that you're trying to control. Through a sensor, it gives data to a controller. Then, the controller has a model of what the system is doing, and using that model, projects, "If I take this control action, I'll improve the situation." They'll take some action through the actuator and change how the process is happening.

This is a safety-driven thing, but it actually is both more effective and less detailed than what people actually do with, for example, analyzing a nuclear power plant. It's better at uncovering problems. First of all, there's a very systematic set of things you can think about, of how different parts of this can fail. I'll just point one thing out, that your process model could be wrong. It could be incomplete, it could be inconsistent, it could be incorrect. That means your model of what's happening, due to the sensor data, is wrong, and you're taking bad control actions.

You can get real deep and think through all of your control architecture this way. This is one tiny piece. This is kind of the idea of an overall control architecture. You've got a critical data there at the bottom with some internal application that processes it, and you've got different kinds of users. You've got an application user, and you've got an administrator. You also have the system that it runs on, which can take control actions on the application, and it's able to affect how the application works. That's run by a system administrator. All of them are subject to the directives and culture of the corporation.

Then, the corporation itself is subject to laws and regulations, and also the way customers look at the company, and whether they want to do business with them or not. There could be other parts to this as well. By drawing this out, you immediately get away from this "Blame the admin. He made the wrong choice." Yes, he made the wrong choice, because there's a systematic under-investment in creating a good controls architecture and having processes that are robust. We want to get away from blaming the poor admin, and get to the systems-level view, where we understand all the moving parts.

This is the process, in a nutshell. Identify your assets, study the architecture, define the control architecture. That includes everything, including people. Identify your loss scenarios, estimate frequencies, low and high magnitude of loss, and then, calculate your expected losses. When these slides are online, I have a little bit of Python code here. From the low and high, it gives you the actual statistical distribution. There was some math involved to get it right, so I hope to let you have the benefit of that. That's it, in a nutshell.

Questions and Answers

Participant 7: I have a simple question. You mentioned the involvement of security experts within the company. What do you do if you don't have that?

De Shon: We have the benefit now. We actually now have two risk engineers who are every familiar with the FAIR framework and how to quantify things, so that's wonderful. Somebody has to be thinking about how the system can break down. If that's you, if you're the one who cares about the risks, then you're building a role for that in the company. There's often some kind of an audit board, or somebody that reports to the board of directors about the risks of the business.

There may be a document already that enumerates at a very high level, all of the risks to the business. That's a really good starting point. You could try to go up your reporting chain and say, "Is there a document like this? I'd love to see it. I'd love to start drilling down into things." Maybe at the point when you have your controls architecture mapped out, you could maybe hire a contractor in temporarily, and just be, "Help me think through this. Am I missing some problems?" Or just to come in and estimate frequencies.

Participant 8: During your talk, you mentioned you do the 90%, and then, you went through two scenarios. Why not just go to worst case? "This is the worst case that can happen." Because it is a possibility.

De Shon: The worst case is always a possibility. If you just have no range, if no one's willing to give you a range, then that can be a starting point. The goal of this is to be able to show improvement and to say, "We've reduced risk by making it ..." Let's say that the minimum time to recover was 10 days, but you can move that down to 2 days, but you haven't changed the worst case. That still moves the average. The worst case can sometimes be seen as an exaggeration. It's better to have a range and to understand the amount of uncertainty that you have around it, but it is a starting point.

Participant 8: Even with the worst case, no matter what you do, you're never going to mitigate that worst case. There would be an improvement, anyway.

De Shon: However, it may be unlikely to reach that worst case. For example, let's say you have a database with critical data. If it's all unencrypted, and then, the attacker gets there, then, they're just going to take the whole thing. Let's say it's encrypted at rest, and they have to watch it as it's being processed and decrypted. There's a time component. It's going to take time to get all of the data. That means that the worst case may not happen. You may be able to catch them before that happens.

See more presentations with transcripts

Recorded at:

Jan 14, 2020

Markus De Shon

InfoQ Software Architects' Newsletter