BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Stop Talking & Listen; Practices for Creating Effective Customer SLOs

Stop Talking & Listen; Practices for Creating Effective Customer SLOs

Bookmarks
50:09

Summary

Cindy Quach discusses some of the common pitfalls that arise from collecting and analyzing service data such as only using 'out-of-the-box' metrics and not having feedback loops. She discusses some practical tips for reducing noise and increasing effective customer signals with SLOs and analyzing customer pain points.

Bio

Cindy Quach is a Site Reliability Engineer at Google. She has worked on various teams and projects such as Google's internal Linux distribution, mobile platforms and virtualization. She currently works on the customer reliability engineering team helping large scale GCP customers learn about and implement SRE principles.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Quach: My name is Cindy Quach. I'm a Site Reliability Engineer at Google. I've been an SRE at Google for about seven years now, working on various projects, and teams, and dealing with infrastructure of varying degrees of sizes and scales. Last year, I decided I wanted to do something a little different. I wanted to talk to people. Not just any people, I wanted to talk to customers. I can feel you cringing in your seats, as I say that. I've been doing that ever since. I've learned a few lessons along the way. The team I'm on at Google is called the Customer Reliability Engineering Team. We are a group of Google SREs that have taken this knowledge and expertise we've had working on internal Google products, and we go out and talk to our Google Cloud customers to help them operate more reliably in the cloud by implementing these same practices and principles that we use.

A common pattern I see with customers, no matter what industry they're in, could be startups, retail, healthcare, finance, is how they monitor their service reliability. We tend to see comments like, "When I go on call, there's way too much noise. I can't really figure out the quickest way to troubleshoot an issue or to isolate an issue. Developers are pushing releases that keep breaking us. We have lots of outages, lots of false positives, high volume of alerts," and so on. We tell them, here's this concept of SLOs that we use at Google. We use it to measure the service reliability of all of our internal products. Why don't you try this and let me know how it goes? We've actually had varying degrees of success. In fact, the most recent SRE book that Google wrote, the SRE Workbook, in chapter 3, we actually list two use cases and case studies of large customers that have implemented SLOs and how well it's worked for them. Today, I want to talk to you about this common pattern I see with our customers and show you how SLOs can help you and actually how to go hands-on and create SLOs yourself.

Customer Focused SLOs

Then customer SLOs. SLOs are service level objectives, is just the target reliability of our service. I'm going to talk about that in depth. True or false, if my monitoring isn't screaming at me, everything is fine. Judging by the laughter, yes, it is absolutely false. You want to ask yourself, how many of your customer outages are caught in your monitoring, 30%, 70%, 100%? Chances are you have gaps somewhere. When your customers are experiencing outages using your service, and you have these monitoring gaps that aren't alerting you of the issue, the first time it happens, you'll probably get a couple memes on Twitter. Or worst case, it makes some news outlet and people will just be unhappy. If it keeps happening over again, they'll just leave your service and go to a competitor. We want to try to avoid this situation.

Symptoms vs. Causes

The most common pattern I see is cause-based monitoring. I want to help flows rethink about monitoring in this way. Cause-based monitoring, you have tons of data about your service. It tells you everything you want to know. I generally see people measure things like how full their server disk is, how high their CPU usage is. They correlate these things to high CPU usage, meaning our users are experiencing slowness or lag on our servers. That makes sense. You think about it, you're on your laptop. You're working on something. Something's eating up your CPU resources. Everything's fighting to get a bit of the resources. Everything is just slow and it's chugging along. That makes sense to us. This cause, the high CPU usage is what I see people monitoring. What happens if, your customers are complaining to you that they're seeing an outage or something is super slow, and nothing happens on your CPU monitoring? Or, vice versa, you have something on your CPU monitoring, and customers are silent. Nothing's happening. Nothing's affecting them.

We want to avoid this because we tend to find ourselves monitoring for every single cause to an issue. It may or may not give us the right signals we want. In fact, do your users even care that your server's disk is full? Do they care that your CPU usage is high? Not really. In fact, they don't even know they exist. If you're running a storage service, they care that they can get their pictures from your storage. They care about how fast they can retrieve them, whether it's safe and secure. These are the things that they care about. We want to pivot this way of thinking from cause-based learning to symptom based. Causes being the problems, disk full, or CPU high, to symptoms being the actual pain that our users are seeing, such as user data not being available, or the images when they retrieve them are really slow.

Instead of making these assumptions about user experience, and correlating CPU to slowness as an example, instead, we actually just measure the experience. We measure how long it takes for them to actually retrieve the image. I'm not saying causes are not useful. In fact, you should still have them. They provide an extremely useful data point when you're troubleshooting issues. When you get an outage, you're going to figure out the symptom sooner or later, and you might as well just reduce some redundancy and aim to catch the symptom first.

What are the symptoms to look out for? That depends on your customer. If your customers are unhappy, that means we could potentially make the news and we don't really want that. We base our monitoring on our customer SLOs, or service level objectives. This is the target reliability of our service. We have a rule of thumb with SLOs. We say that your SLO should capture the performance and availability levels of your service that if they're barely met, would keep most customers happy. If it was any less reliable, you'd no longer be meeting their expectations, and they would start to become unhappy. In theory, if you meet your target SLO, that should mean happy customers, if you are missing your target SLO that should mean sad customers. A key thing to note, not all customers will use your service in the same way. You want to make sure that your SLO is covering the typical user and not the user that's using your spreadsheet service as their company's database, for example.

How do we get this information? If you like this way of pivoting from cause based to symptom based, and thinking more about everything from a customer's perspective, what do we do? Are you ready for this? You actually have to go and talk to someone. I know as engineers, we scurry away at the notion. It's true, in order to get the best customer signals, you actually need to sit down and learn what the user experience is for your customer. Fortunately, for us, as engineers at companies, we have business folks. We have product managers that know how customers use our service, and what is important to them. If your customers are people, it's pretty easy. You can figure that out, user surveys, research studies, things like that. If there are services that are using your service for their customers, turns out behind every machine is just another human so you can also talk to them as well. There's really no excuse.

Mobile Game Example

Let me actually show you this process of creating SLOs. I'm going to do it with an actual example. This is our mobile game that we're going to walk through. Basically, we have a mobile strategy game, where users get to build settlements, recruit people, upgrade weapons, battle other settlements. There are the leaderboard rankings for top settlements. We have a basic architecture for our mobile game here. We have our user who accesses on some client machine, hits a load balancer, directs to website API servers which talk to back-end game servers. It's a pretty simple architecture. We'll go ahead and walk through creating SLOs for this game example.

Step 1: Critical User Journeys

Step one, is critical user journeys. A critical user journey is a set of steps that a user takes to accomplish some goal on your service. For example, you're a retail company, maybe they want to buy something. You want to post a topic or maybe pay for bills. Your product managers or business folks already know how your users are using your service. If not, someone somewhere needs to be having these discussions. The first thing you want to do is list all the critical user journeys for your service. Then create an ordered list and rank them by business impact. Why do we choose business impact? At the end of the day, we're all engineers for our business and our business has to make money to keep the lights on. We think this is a pretty important way to prioritize a user's experience.

Going back to our game example, I've listed on the left side all the critical user journeys for this game. We have a view profile page, where they can change the settings, update information about their profile, buy in-game currency. People can exchange real money for online money and buy some weapon upgrades, strengthen up their settlements. The app launch, they'll obviously launch the app. Manage settlement, you can recruit people, add, join, removed. Then the middle column, I sort them by business impact. The first one is buy in-game currency. We are a business and we want to keep the lights on. Buying in-game currency is how we make the money to keep the lights on. The next one is app launch, then view profile page, and manage settlement.

For the rest of these exercises, I want to talk specifically about the buy in-game currency. Once we've sorted it by our critical user journey, by business impact, on the very right here, we make a sequence diagram of how the user interacts with your service. Let me just break it down here. We have three actors here: our client, our server, our Play Store. User opens up the buy stuff UI. They get their SKUs, gets returned. Then they talk to the Play Store. It gets those details returned to them. The dotted line here indicates that it's going to an external, outside of your service. Then they choose a product, and then they launch this billing flow, and it returns all this information. Then you do some extra verification. Make sure you actually bought it. At the update account, we do some database write that says, "You bought it. Now it's in your account".

Step 2: Implement SLIs

Step two of creating SLOs is you want to implement SLIs. SLIs or service level indicators are a measurement of some reliability aspect of your service such as availability, latency, correctness. These are all SLIs. We represent SLIs as a ratio, a ratio of good events over valid events. We multiply this number by 100 so we get a nice percentage. We can use a percentage to standardize how we talk about reliability. For example, let's say a user expects your app launch to work. We can measure the availability by the number of successful responses we get from it. Good events would be successful responses. The valid events would just be all the responses that you get back. You want to choose which SLIs align with your user expectations. One thing to note is you also want to think about how you implement it and where you actually measure this. Then we also recommend starting out pretty simple with SLIs, one to three per critical user journey is always a good starting point. They can get pretty messy, looks like spaghetti afterwards. We want to avoid that when you're first starting out.

Measuring SLIs, we want to figure out where we're actually measuring our SLI, and break it down. It only goes to five basic ways to measure SLIs: logs processing, application server metrics, front-end infrastructure metrics, synthetic clients, and client instrumentation. For logs processing, you process server-side logs of requests or data. Some pros of this, is that, if you have existing logs, you can probably start backfilling some SLIs. Some cons, is that, application logs don't contain requests that didn't reach your servers. Then application server metrics, these export our SLI metrics from the code that's serving requests from users or it's processing their data. Some pros of this, it's often pretty fast and cheap in terms of engineering time to add new metrics. Some cons like the logs processing, is that, application servers aren't able to see requests that don't reach them.

The next one is front-end infrastructure metrics. Utilizing metrics from our load balancing infrastructure, you most likely already have these metrics and recent historical data on it. This option is probably going to be the least engineering intensive to get started. One con though, is that, it's not really great for measuring complex requirements. The next, we have synthetic clients, or as we like to call them, probers, where we build a client that sends some fabricated requests at regular intervals. It validates this response, like black box monitoring where you can't really see inside what's happening, but you want to validate that you're getting what a user would see if they were using your service. Some pros of this, is that, synthetic clients can measure all the steps of a multi-request user journey. A con though, is that, you start to build lots of probers and then you start to get into this gray area of integration testing. We want to avoid that as well.

Finally, the last one is client instrumentation, which is adding observability features to the client the user is interacting with and logging the events back to your serving infrastructure. This way of measuring SLIs is the most effective and it measures your user experience most accurately. Some cons of this, is that, sometimes you'll see high variable factors involved in the whole process. You might measure things that will be out of your direct control to actually fix and resolve. You want to think about these trade-offs when you are figuring out where you want to measure your SLIs from.

Going back to our mobile game SLIs example. We've listed our critical user journeys. We've broken down our sequence diagram. Now we're going to break it up even further. I have about five request/response pairs here. I can fetch a list of SKUs from the API server, fetch SKUs details from the Play Store. The user launches a play billing flow, and then send token to API server, and verify the token with the Play Store. This journey basically has two parts. The first part, number one and two, fetch the SKU information. Here, we're thinking about this. The user might not actually buy an item here. We don't really care about this that much. Then step three, four, and five. At step three, they're telling us they have an intent to buy the item. This is where we want to focus our user journey on.

We've broken out what we want to look at. What SLI should we measure, availability, latency? What are you thinking? Availability is a pretty good one to start with. We can specify that our SLI be the proportion of valid requests served, and where I say the proportion of launch billing flows served successfully. Which requests are valid? Considering step three is where the user actually launches the billing flow where they intend to buy the product. I'd say this is pretty much where we care to look at. What do I mean by served successfully? All the interactions have to be successful. They have to go through all these steps and get these status codes that we deem are a success criteria. For step number three, they have to get a successful purchase token. Step number four, they have to get a good status code from their account being updated. Number five, they must receive a valid token.

Our SLI is starting to look pretty meaty. This is what we end up with. We say the proportion of launch billing flows where the billing flow returns all of these status codes I just mentioned. These API calls return these correct criteria that we mentioned. I also forget to list out where you're actually measuring the SLIs. We say it's measured on our game client, we can report it back asynchronously. That's an example of availability, and now I want to show you one for latency because it does change a bit.

Similarly, we want to get a proportion of valid requests served faster than a threshold for latency. Which requests are valid again? You might be thinking, maybe we'll use the availability one, we just did that. Number three, we'll do the billing flow. Not really. In fact, what I really want to measure is step number four, send the token to the API server. Why do I want to do that? Because when you launch the billing flow, and you're measuring how long it takes someone to go through it, it's pretty variable. There's a lot of poking the device time. The user can decide, "I don't really want to buy this," continue to think about other purchases. They can think about getting their credit card set up. If you're measuring the time it takes them to launch this billing flow to actually purchasing it, you're adding in all those variables that you don't need to. When we think about latency, we want to think about these types of things.

We say we can start off the proportion of API/complete purchase requests served faster than a threshold. What do we consider fast enough? We can do some rough estimates. Let's say we want to verify our purchase token calls to return a status code in 500 milliseconds or less. Then we want to say the database write to update the account is 200 milliseconds or less. Running up that number, we got 1000 milliseconds. We can say the proportion of API/complete purchase requests served within 1000 milliseconds. Then finally, don't forget to add where you measure this. Our final SLI looks something like this. The proportion of API/complete purchase requests where the complete response is returned to the client within 1000 milliseconds, measured at the load balancer.

Step 3: Look For Coverage Gaps

We've listed all our critical user journeys. We've made some SLIs. They are really good. We're all set. We're all done. Not quite. The next step in creating SLOs, is that, we want to look for coverage gaps. What happens if the last step here, update account, fails? That means that we just took the user's money, and we're not going to give them a product. Buying in-game currency is pretty business critical for us. It has important side effects. We want to make sure we get this right. In this case, we want to measure the correctness of this user journey. Specifically, for update account, we can add a synthetic client or prober, where we send a mock request to update the account and validate with a given response. The thing with probers is you also continually probe with this call at multiple frequencies. If it fails enough times in a row, there's probably a problem that alerts you that it's not correct, or the pull mechanism that it's using, there's something broken in the pipeline.

Step 4: Set SLOs

Now that we've looked at our critical user journeys, we've listed out. We've ordered them by business impact. We created some SLIs for availability and latency. They're pretty good. Then we walked through our user journey again to see, we could probably use correctness to cover some coverage gaps. The next step is just setting your SLOs. How do you decide which target to use? You throw a random number out there, 99.9%, 99.99%, maybe 10% if you want to actually reach it, and be like, "We did reach our SLOs." Probably shouldn't do that, though, but you can. The best way to start is actually just using past performance. If you assume that no one's yelling at you, your reliability target is probably ok for now. You really want to work towards what your business goals are for reliability, because chances are they'll have a stricter and higher threshold that you currently aren't reaching now but you'd like to get to eventually. You want to work towards these SLOs so that they eventually converge these numbers.

Recap: Steps to Develop SLOs

A recap to steps to developing SLOs. We have our critical user journeys, where we list all user journeys, staggering them. Implement our SLIs. We have one to three SLIs per critical user journey. We want to choose the one that best aligns with our user expectations. Walk through, do the coverage gaps. Set your SLO targets with the starting point of past performance.

Now that I've told you how to develop those SLOs, we've chatted about it for the last 25 minutes or so. There are still some common pitfalls people do, and these habits die hard. I want to bring them to the front of your mind so when you do develop SLOs, that you can think about these things.

Out of the Box Metrics

Out of the box metrics are really great for when you get started, but you still need to add some elbow grease. I know I've seen a lot of companies and teams say, "This is really easy. We already have all this symmetric, even though the customer signal isn't super high. It's going to cost us more engineering time to get better metrics. It's not worth it for us." If your customers experience outages and the reliability signal you have isn't high enough, you want to fix this. We always want to think about things from the user's perspective. We want to make the effort to implement tooling to get the best customer signal.

Over-monitoring, this is my favorite because everyone does it. I've been on teams where we do the same thing. When it comes to monitoring, I always say less is more. I know people will probably argue with me on that, but hear me out here. The reason we say that is when you have tons of monitoring, and you want to measure everything, because you're like, "We want data. We want to measure this. We want to look at it. We want to know everything there is about our service." Makes sense. In practice, it doesn't work out too well, because then you'll have teams that decide, "I don't want to use Grafana. I want to use Datadog. I want to use Prometheus," all these other things. Now you have three separate monitoring platforms for multiple different teams measuring their own random things. When you get an issue, because your tech stack was complex, and there are many layers, you can't really isolate where the issue is coming from. You have multiple alerts firing for the same thing coming from different origins.

When we do this with regular monitoring, cause-based monitoring, it gets even more cluttered. I do see people do this with SLIs and symptom-based monitoring as well. It's like, "I want more. I want more coverage. I want more SLIs to cover everything. I like it. It's going to measure my user experience." You run into the same issues when you have overlapping SLIs. You will start to clutter noises and you won't be able to tell quickly where an issue is occurring. [inaudible 00:25:03] monitoring, we're not there to sit at it and look at it because it looks pretty. We're there to open it up once we have an outage and see, that's where the problem is. We want to be able to quickly see where an issue is.

If you want to look at this graph, the top one, if you're on call, which one would you rather see? The blue shaded section is where your customer is happy, and the red is where an outage started. On the top one, there are tons of variants. You aren't able to pinpoint exactly what's causing the issue. You have focused metrics that you're monitoring and SLIs, you tend to start to see the second graph here where you can quickly see, HTTP load balancer is saying that customers are getting 400s on accessing the website. This is a clear indicator of an issue. It's why we use good events, overall valid events, because you're just measuring this aggregate of all these things that you quickly want to see if there's a problem with.

When you start out monitoring for every nook and cranny of your system, you'll over-complicate things. We want to make sure that, in addition to this, to periodically clean out monitoring data that you don't use that maybe you collect, but aren't really useful. Maybe if you want to group together functions to monitor overall functionality, versus adapting things for specific use cases. If you were running a website that sold things, users can search for a product, maybe they can click some categories, get to the same product. Or just do some extra browsing, some other mechanism you have for browsing. You can just lump all these functions together, and be like, "I want to measure what browsing looks like, and whether or not the browsing experience is good for my customer".

Uptime = Availability

Another thing I see when folks look at SLOs, is they confuse it for uptime. Maybe back in the day when you're running your service on one server, uptime was probably a good indicator. Now you have hundreds if not thousands of machines. Having a handful of them go down, it's probably not going to affect your user all that much. Planning for downtime is something I see folks doing with SLOs. It doesn't really make a lot of sense, because if your user is not actually using the service at that time, there's no customer affected. Instead of looking at uptime, we instead want to measure the availability of whatever our customer is using. I represent this as the ratio of good over valid events. If you have periods of downtime and maintenance, maybe you can put up some maintenance page that you can collect this information, where users hit it during your planned downtime and say, "These are requests that I can't serve. We consider these bad requests." Aggregate that throughout the entire duration of how they're using your service, if that's how you want to plan for downtime. There's many other ways to do it. Basically, TLDR, distributing computing is difficult and it makes things difficult to monitor and observe into.

Iterating On Your SLOs

This last section, iterating on your SLOs. Just a reminder, if your monitoring isn't screaming at me, everything is not always fine. Just because we have this monitoring for our customer SLOs now, we're not out of the woods yet. Customers will still experience outages that we don't have any monitoring for. We can't say, "Monitoring says it's fine, therefore there's no problem." Your monitoring doesn't determine the reliability of your service. It's your users that do.

The way we do this is with regular SLO reviews. SLOs are never set in stone. You always need to iterate them over time. This doesn't mean that when there's an outage, you tweak the knobs and the numbers and it looks pretty to your business stakeholders. Don't do that. Create some feedback loops you can do every 6 to 12 months that says, "We're all going to get a room. We're going to review these SLOs. Then we're going to update them and iterate them so they're better." We want to make sure that they represent our users' expectation with our service.

When you're first starting out with SLOs, you want to make sure that you start doing these pretty frequently. You don't want to be six months or a year down the line, and be like, "The initial SLOs we built aren't that good and they're way off." A lot can change in a year. Your business could completely pivot to new markets, and have different expectations and experiences. We want to periodically review SLOs. There's multiple ways to do this. You can just track how often they go out of violation. You just see outages that you don't catch that maybe you aren't catching in your SLOs. You want to look at outages too.

We want to analyze our outage and support gaps. If your company practices a postmortem philosophy, which they should, so when an outage occurs, you basically write this report, says you want to learn everything about this postmortem, everything about this outage. You include a ton of details about it. What products are affected? How long users were affected, all these things? I'm sure you know how many outages you have. We take this data, and we say, we know the outage occurred at this time and was mitigated at this time. Let's see how that was reflected in our monitoring. I have this red band here that shows when an outage occurred, and when it was recorded in the postmortem. As you can see, it's not covering everything for SLO. Something needs to be fixed here. We want to make sure that when we do these SLO reviews, we can update them. Make a plan to iterate our SLOs as well. For folks that don't have outages, maybe support tickets are more of the thing that your company sees. Analyzing support tickets can be pretty useful too, but you want to be cautious if they're not availability or performance related. Because then you'll start to get in the weeds of things. The way to do that is you could probably set tags and some transfer your support tickets, in that case, so you can collect this data somehow.

Surveys and Other Feedback Channels

Then you can also ask your customers, do some surveys, allowing your customers to actually reach you and leave comments and feedback. It's not great if the customer has to talk to you every time there's an outage. It is a useful data point. I've seen companies where sometimes that's their only signal, so it's not great. Other things you can do is you can use some web trend analyzers or scrapers, and you can see how many folks are mentioning your website's reliability on social media, and adjust accordingly. One other thing is not everything has to be on fire for you to do these things. You can also proactively reach out to your customer and ask them how their experience with reliability on your service has been. Whatever method you choose, it doesn't matter if you collect outages, look at support tickets, feedback, whatever, just make sure you have some data. If you have these postmortems, getting those tags that help you categorize where an issue occurred, is also extremely useful.

You don't tweak your SLOs just because everything's on fire. In fact, if you're doing a rockstar job, and your SLOs haven't been violated, and people aren't complaining, they're really happy with your product. You can also loosen your SLOs a little bit. Give some breathing room to your engineering team so they can allow your developers to push out more releases. They could potentially break things with the increase in frequency of pushing new releases. We want to do all these things, because in the end, we care about our customer. We care that they're happy with our service, that they don't leave our service and join our competitor service. Also, makes us happy as on-callers and engineers to not have to wake up in the middle of the night to deal with these things, to not rip our hair out trying to figure out where the seg fault's occurring. Happy customers, happy engineers.

Questions and Answers

Participant 1: [inaudible 00:34:38] sees in its normal functionality. Then how will you set up SLIs and SLOs on those?

Quach: If you had a service that you don't have an explicit customer, but everyone uses it, and it could break everything. How do you set SLIs for that? Network is really tricky. It's one of those things where as the application teams and the smaller teams who depend on your network, they have to account for the failures the network might face as a dependency because the network is a huge dependency for them. Those would be your customers that you would chat with. Everyone knows what network outages do. It's not pretty. They would work with you to account for your network team being in dependency.

Participant 1: Can I use somehow the outage data? Suppose I am getting customer problems, I can count them as 100 in a month, and I can set an objective of I will reduce it to 90 in the next month and 80. Can the reported problems be used as an objective SLO?

Quach: Absolutely. For some companies that don't do outages, like in that traditional response model, but they have support tickets, the way they measure their success in whether or not they produce toil, whether or not they've reacted to reliability problems, is by measuring the amount of support tickets they actually get. That's absolutely a valid reason.

Participant 2: You mentioned a couple times when the numbers of the SLOs aren't positive. I'm probably not alone in that there are cultures where perfection is expected. Do you have any guidance or tips on how to break into an SLO driven approach in that transparency that's needed and required when there is that resistance to, I don't want to show bad numbers?

Quach: It's difficult. You have to get buy-in from your top leadership. They have to buy into this idea that your service will fail. You cannot avoid that. The best way to counteract that is to make sure that you are measuring the right things and you're responding the right ways. They don't like that answer. Because we work with banks and stuff, everything needs to work all the time. We need it to work as flawlessly as possible. That's not how the real world works. We pitch them this idea of SLO is necessary and how reliability works in that sense. You do need top level leadership buy-in to break through SLOs on your organization?

Participant 3: How would you change the approach that you described when you apply SLOs to microservices? As an example, data processing pipelines, same type of day process and different datasets, would you still measure it as a single set of SLOs for each individual microservice?

Quach: It really depends what makes sense for your organization. We tend to want to gravitate towards the whole user experience. Do you have separate teams that are responsible for each microservice?

Participant 3: Same team, different customers as a user base.

Quach: You can create your SLOs in a way that will integrate that you're measuring from the microservices like, "It's hitting this other microservice. We want to get this response." You can measure the whole thing holistically. I have seen cases where people do measure pretty specific functionality as well.

Participant 3: How do you integrate the dependencies on third-party services? For example, you're forwarding the data into Elastic, and it's down but not your service that it's down. How do you incorporate that into the SLOs?

Quach: Dependencies are tricky. For starters, you can't build a five nines service on a three nines dependency. You want to actually account for that. If you know that that's what the reliability is, you want to make sure that you can account for that. Maybe you add your service in other regions that don't depend on that dependency, or things like that.

Participant 3: Let's say, if our service is not meeting the SLOs, and then not improving it. They're not also changing what the SLO currently is. You present your numbers, and you constantly end up explaining to someone that it's not really your numbers that are down, but it's that third party?

Quach: There's a huge chunk of this presentation I normally give, because I wanted to focus on the technical aspects of it. A lot of it is this culture. You have to get buy-in from these executives who manage these teams and do this top level down mandates, and everyone has to follow SLOs, and everyone has to work together to make them work. That's a whole other thing. Essentially, you do need leadership buy-in to get everyone to play nicely.

Participant 4: If you're monitoring a globally distributed service [inaudible00:41:09]. Do you focus on the service level then say, I don't [inaudible 00:41:15]. Do you guys focus on service levels of each individual site or aggregating the service at home or …?

Quach: How do you monitor for global services but you also want to make sure that the smaller sites and stuff are taken to account? Because if you have a large site that most of your users go on, but maybe you have something in Alaska or something that only a few people access. If that goes down, that's 100% outage for only 3 people. How do you account for all these different scales of users? We do both, depending on the business impact. Maybe you want to add weights to these different sites and set the percentage of requests that we care about at that weight. That's another method you can use, but you should definitely do both.

Participant 5: You mentioned that a good indicator would be off of historical data to come up with some SLOs. What if you're on a fairly new project that maybe doesn't have that historical data? What would you suggest as far as coming up with reasonable SLOs? My initial instinct is to talk to your customer. The problem with that is maybe they don't give you necessarily a realistic answer, all things considered.

Quach: You could probably find out pretty quick once you launch the service. Then they're like, "Wait a minute." I've never actually had that question before. I've never seen that. You can gauge as a user yourself. If it takes 10 seconds for me to get this response for this query, I'm going to be unhappy. Maybe some users are ok with that. That's something you can do. Talking to the customer, you should absolutely do that and see what makes sense for them. As a user for the service, put yourself in their shoes as well.

Participant 5: I figured you could probably just reevaluate once you have more data that you can go off of.

Participant 6: A question related to the ratios and the valid request. One of the challenges that I have for our services is that we are the ingress point for all requests. If that ingress point fails, I know we have synthetic monitoring, and so on. You don't know how many valid requests you had because you failed at the ingress point. The way we work around it is using time as an approximation of requests and seeing how long was that an outage? During that period, what was typical traffic metrics? Is that the right approach?

Quach: I think I may have touched upon this a little bit, but I'll go into that in the presentation where I see folks measuring for uptime. How do you measure responses if your services are down and they can't actually access it? You lose all of that information. That's why I recommended, put up a maintenance page or something, and when you know you're getting hits to it, then you can add that as a data point.

Participant 7: If you have a service that's partially subject to a third-party dependency for a business critical workflow, say e-commerce where some of your customers use your API, and depending on which customer it is, part of that workflow might be fronting an API call to a third-party integration that they brought with them. Would you want failures specifically caused by that third party to be reflected in your services SLOs or do you try to compartmentalize? You have an SLO for the third-party service that you can look at, and then separately, you have an SLO for your service that you try not to let it be polluted by third-party caused outages or failures?

Quach: What it all boils down to is just how the customer experiences your service. If they're accessing this third-party service that's maybe not as reliable as you'd like it to be. You do want to add that. You want to take that into account, so that you can accommodate that and improve your reliability without third-party messiness in the middle. Think about the customer's whole experience and see where that would go in.

Participant 7: To some degree, it would depend on whether or not the customer even perceives that this fault is not with your service, but with theirs. Also, to some degree, you could use that as just a data point for perhaps advising customers to pick different third parties to work with, if you can actually have data to tell them that specific ones are much more unreliable than others.

Quach: Yes. Generally speaking, if they're a traditional customer using a service, they probably don't care where things are breaking. They're just unhappy at you like, "Fix this please." You want to make sure that you can think about all those things. I'd like to say that you could do that but people don't care.

Participant 8: You mentioned that less is more when it comes to metrics, or at least SLOs. I find it's really useful when you've had an incident to be able to see, what were the metrics on everything? Because then you can root cause and figure out what happened?

Quach: I don't think all cause-based monitoring is bad. In fact, you should have it for that exact reason, for troubleshooting. What I tend to see though, is that people will alert for every cause-based monitoring, "Server's disk is full," all these other things. It's, "I don't want to get woken up in the middle of the night to deal with that." If we find out there's an outage, and you can see, you have all these things. Then you can correlate that. Actually, what we recommend is maybe when you do your actual alerts, you just have little footnotes in the alert that you send to the person. It's like, "These are the other things you could probably look at as causes." Don't make it the main alert that wakes you up in the middle of the night.

Participant 9: I've heard the argument from some, one of the things that observability tooling helps you look at high cardinality things, like customer IDs and stuff like that, can help with is seeing information that is otherwise aggregated away. Say I have a 99% availability target, or three nines, or looking at a 99th percentile, or something like that. What if that 1% or 0.1% are actually some very valuable users that just have something in common with them. They're failing because their database shard is really hot, or something like that. Do you have any techniques, or you have patterns for combating that problem?

Quach: When you have an outage and your SLO is being affected, but you have some really valuable users also being affected?

Participant 9: It would be where your SLI is aggregating too much, where on the whole, my service looks fine, but there's actually a class of customer being really impacted. I've thought of maybe I could break things down into tiers, maybe that is good. Do you look at, any individual customer should have a certain experience?

Quach: That is an absolutely valid reason to split your customers up into different tiers. If you have high availability customers that are paying you lots of money, you care about what they are experiencing rather than the other folks who are using your service for free or something. That's absolutely a valid method to do that.

 

See more presentations with transcripts

 

Recorded at:

May 22, 2020

BT