InfoQ Homepage Presentations With Observability, Cloud Deployments Don’t Have to Be Scary

With Observability, Cloud Deployments Don’t Have to Be Scary

View Presentation

Speed:

40:37

Summary

Martin Thwaites discusses how to have the confidence to deploy at will. This ability allows developers and the wider team to know when things go wrong, and remediate them quickly.

Bio

Martin Thwaites is observability Evangelist & Developer Advocate @Honeycomb.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Thwaites: My name is Martin Thwaites. I'm here to talk to you about observability and cloud deployments, how they can be scary, and more specifically how we can use observability to make them less scary. I'm an observability evangelist. I work for a company called Honeycomb, as a Developer Advocate. We make observability products that can be used in these scenarios. We're going to be talking in general around observability, and how we can make things better. Something I hear quite a lot is deploying every day is impossible. Generally, that comes from people who've not embraced observability and have not spent the effort.

Why Are Cloud Deployments Scary?

Let's talk a little bit about why. Why are cloud deployments actually quite scary? For some people, they may not be. For some people, you may have already embraced a lot of the things that make cloud deployments good and scalable. For some people, they can be quite daunting. Generally, this is people who come from a world where we have things like long times between deployments. This for some people may have been days, weeks, months, in between releases. That can be a scary time, if somebody's telling you that you need to be deploying every day, this could be quite a big step for you. In those scenarios, what you'd end up with is a code freeze. You'd end up with a few weeks or months, where you're not allowed to make any changes other than small bug fixes. While an offline QA happens, where we've got a QA team, looking at a different environment, trying to replicate what's happening in there. Once all that QA has been done, what would happen is you'd have this thing called a release day, one of the scariest times for developers where everybody gathers round, and we start to look at a deployment. One of the things that you would do is you would take your environment offline. You'd throw up a holding page, so nobody can access that environment while you put that latest and greatest code onto that machine. You'd potentially delete the code that's there, add new code in there. Run it up in a local environment to be able to say, yes, it's now working. You do everything you possibly could to try and ensure that it is going to work when it goes live. You're relying a lot on those previous QA methods that they got all of that right.

Then comes a potentially worse day, which is post-release day. Post-release day is that really stressful time, where what you're trying to do is work out, did it actually work? Are customers actually getting that new code? Do they get the right expectations? A lot of people are quite happy with that approach. It's worked for quite a long time. However, now we're in a world of cloud, our customer expectations have changed, unfortunately. When you throw up a holding page on a site today, people think you've done something wrong. People think that you've been hacked. They think that something has fundamentally gone wrong in your system, so we just can't do those things now. What we have to do is start to think about our deployments differently. This is why we're also getting the business telling us that we need to deploy faster. We need to innovate faster. We need to be deploying every day. Google can do it. Facebook can do it. Why can't we do it? Those customer expectations are driving the business expectations.

Why Don't Our Teams Feel Safe?

All of that is then making our teams feel not safe. Because, I can't deploy every day, I've got all of these problems that I need to solve. They're fearing the reprisals from the rest of the business, that they don't deploy something right. They question the motives of the team, if they get something wrong, why did you deploy? If it was failing, why did you deploy it? I didn't know. They question the ability of the team, about whether they're capable of being able to deliver on the business's agenda because they can't deploy. All of that leads to a fundamental lack of psychological safety in that team. That lack of psychological safety is a vicious cycle. Because the team doesn't feel safe, they won't deploy the thing. Deployment times end up being longer.

What's Different With The Cloud?

What's different? Why is the cloud something that is different than what we were used to? We'll start with hardware provisioning. I come from a world where to get a new server required 22,000 forms signed in triplicate, in order to justify that expense. That just doesn't happen in the cloud. Spinning up a new machine is a command line. That makes things a lot more easier to do in the cloud than we used to be able to do when we were doing things on-premise. We're not just talking about on-premise here, people who've moved to the cloud and tried to use on-premise methodologies, where we're using virtual machines, where we're spinning up one virtual machine killing down another virtual machine, those are still in this mode where we've got some problems. What we're talking about here is cloud native. We're talking about applications that understand they're running in the cloud. They are using native functionality in the cloud. They're using databases that scale with the cloud. They're using messaging functions. All of these things put us in a very different world than we used to be. The other big thing that is going to be able to be different in the cloud, is advancements in observability. That's what we're here to talk about, to see how we can make things better.

Observability

We're going to talk a little bit about observability. Observability comes from a scientific concept. Wikipedia's definition says that it's a measure of your internal states, of how well your internal states can be inferred based on what the external states are. About six years ago, our co-founder and CTO termed it in the context of software. That's very close to the definition of what we were doing in science. It's understanding and debugging unknown unknowns. That's the enhancement to this. We talk a lot about unknown unknowns, which is the things that you know that you don't know, because you can't know. It's still about being able to answer those unknown unknowns when they come up, based on what your system gives you.

How does observability fit into what we're talking about? Observability is not going to stop your application from failing. It's not going to stop it from throwing errors. It's not going to stop it from failing in the middle of the night. None of this is going to stop that from happening. What it does allow you to do in a much better way is deliver working software, is being able to deliver software that does what it needs to be able to do, but also does it in a way that means that you know that it's doing those things. Production is the only real system we've got. All those QA environments that we're talking about, they don't matter. They are not the same system. You can get quite far with functionality but you're never going to be able to test all of the different permutations. You're never going to be able to analyze what happens when a user hits your system.

What observability is going to allow us to be able to do in this context, is it's about asking questions. You think about when you were going live, we're talking about this post-release day, we're talking about even release day, of things going on. What you want to be able to do is ask questions that you maybe didn't know that you needed to ask when you see something that's a little bit off. If we think back to what it is that you're going to be able to do as an engineer, what things might I want to know? I want to know what the impact of my change was. It might be on response times. It might be on latency. I might want to be able to slice and dice that by different things because it might not be that it affects all the users. It might be that the stats aren't quite telling me what I thought they might tell me. What I don't want to be able to have to do is add more data, add more metrics, add more logs to be able to get more data out. I want to be able to ask those questions that I didn't know I wanted to answer.

I want to know whether my code has been used, by how many users is that new code that I've just been pushed out there? How many users are using it? Because great, I've pushed out a new feature, the system hasn't gone down. It's not being used by any users. I've not really tested it yet. I might want to know whether my feature is actually being used, and again, by how many users. I may want to know, how has the time, the latency, the duration of a certain bit of functionality, has that changed positively or negatively? I might want to know my dependencies. Does my dependency change based on the data that I put into it? We're working in the cloud, we go cross region. We go across availability zone. We have multiple instances. Is it one particular thing doing worse than the other things? One of the common things that we see is load balancers using the wrong affinity. All of the traffic go into one server, and one server just sitting, twiddling its thumbs. These are things that you may not have needed to know, you may not have known when you were writing the code that you needed to know these things. All of those questions, if you can answer those questions really quickly, once things have been deployed, that is how you're going to gain this ability to know whether your code is right.

DORA Metrics

I'd like to introduce the DORA metrics. DORA is an assessment that Google Cloud did, and have been doing on a yearly basis. What exactly are the metrics? The metrics are something that they use year in year out. They're based around four key metrics. We've got the change failure rate, how often do you deploy things that break stuff? How quickly can you recover when you do break things? Because inevitably, we're all going to break things sometime. How quickly can you recover from that? How long does it take you to get a deployment into your live environment? Also, how often do you deploy your changes? Those may seem like superficial metrics, and I get it. They are important. Why are they important? For a start, they're all based on cloud. These are based on users and engineers that work in the cloud, day in, day out. They're really key to people who are working in those paradigms. You're not going to be getting these metrics skewed by people who are also working with on-premise systems that can't do a lot of the things that we need to do when we're in the cloud. They're also important because they allow you to track your maturity around your releases, and more specifically your deployments. They also monitor application health, and team health. Team health, to me, is the most important thing that DORA allows us to do. It's not explicit as to the reason why DORA exists. To me, this is about an organization showing that this is something that they care about. It's showing that, I understand that things like deployment frequency are important to my developers and me.

Recovery

I want to focus on one specific metric that DORA provides us with, because that is where observability can really help us. It's also how you can stop things being as scary. That's recovery time, the time to recover your service. More specifically, it's the time that it's going to take you to fix the thing that's wrong in production. Observability is not going to make it easy for you to write that code. It will in certain areas, if you can do certain things. The big thing though it is going to do is reduce the time it's going to take you to discover that there is an error happening. Because if you invest in observability, you're going to be able to see things happening quicker. More specifically, you're going to be able to isolate where that thing is occurring quicker. You're going to be able to find where in your entire stack that particular error is happening at that point, which is a really interesting thing that observability gives you. We're not talking here about you knowing that a problem is going to happen, and you writing a metric, and you writing some log entries, and you putting that on a dashboard, and you monitoring that on a dashboard. That's not what we're referring to here. What we're referring to is the scenarios where you didn't know this was going to happen, but you have received some alert that something wasn't right. Observability and focusing on those abilities to know the things that you didn't know that you needed to know, which I get sounds complicated. Observability is about the idea of saying, I might need to know something about this in the future. I'm just going to dump all of that data in.

Tracing

What is it that you can do today that will help these things be a little less scary, with a focus on observability? The big thing is something called tracing. Tracing is a very key component in observability, because it's what allows us to see what's called causality. It's about being able to see that one thing caused another thing to happen. That is something that's obviously really important when we talk about finding where things are, and finding where these problems are. Specifically, when we talk about tracing, OpenTelemetry is the de facto standard now. It's a cross-platform piece of functionality that's very easy to implement. It doesn't cost anything, it's not a product that you buy. It's something that you can implement in your product, in your application now. On its own, it doesn't really do anything, but what you can do is you can add something called auto-instrumentation. A lot of the libraries that you'll use today, whether it's libraries around using SQL, using document databases, using Redis caches, or communicating with third parties over HTTP or gRPC, all of those generally are out the box in most languages. You can just add those things in and start to see some really in-depth information about your system. On its own, that isn't going to send anywhere, you need somewhere to send it to. There are a lot of open source solutions that are out there that you can use. Obviously, Honeycomb provide a cloud based solution that is incredibly easy for you to get that data in and visualize where things are at.

The other thing that you can do with OpenTelemetry is you can start to use that when you're developing locally. You can start to use this as a replacement or an augmentation to your current debugging lifecycle. You can start to see how an individual request, whether that's a message from a queue or a API request, you can start to see how that tracks through your system. You'll start to see, what information might I want to know? What information is available to me when I'm running through this code? I'll add that to my spans and my tracing, and then I'll be able to see that when we go to production. All of those are things that you can do today, that don't really affect the functionality of what you're delivering. They're quick. They're easy to add. You don't even at the moment need to push that to a third party, you can just put that in and start using it locally. That's something that you can do today, inside your product.

Monitor the Metrics

The other thing that I would really encourage everybody to start doing is monitoring some of these DORA metrics, because we can't improve what we don't monitor. Some of the key ones that are not going to be hard for you to do, especially if you're in a world where you're deploying on an irregular basis. If you're deploying every couple of weeks, every couple of months, these are things that are really easy for you to put together. The amount of time that it takes from your commit to that being deployed into your production environment is something that you should probably be able to do quite easily. As soon as you can then see that, you can start to say, is this having an impact on our agility? Is this having an impact on our ability to deliver to our customers? The other thing that's really interesting to start monitoring is pipeline times. Linked to what we were saying about tracing, you can actually use tracing to start looking at your deployment pipelines and your build pipelines. To be able to look at, how do I improve the time of these build pipelines? Because that will then feed into your commit to deploy times. Again, a really easy thing for you to start monitoring, so that you can then start to use that to improve things going forward.

The other really interesting one, is time to customer. Obviously, the previous two metrics are included within there. This is that wider metric, of how long does it take from a customer or a stakeholder requesting something, for us to actually get it in the hands of that customer? To me, it's not specifically a DORA metric, but it is a really powerful metric when it comes to business stakeholders, because they care about this. If we can use that as our metric, then we can start to think about how we prioritize some of these other metrics. The obvious one that we do need to monitor is downtime. How long was our system down when any particular thing happens? That's really important. All of those metrics together are going to allow you the ability to go back to your business stakeholders, your product owners. The people who are giving you work and telling you what the priorities are, whether that's an engineering manager, or even just your CEO, depending on how big your business is. If you can get these metrics together, there's something that you can provide to the business that will allow you to focus on some of the things that we've talked about. That will allow you to then get down to your daily deployments, hourly deployments, per commit deployments. These things that I've talked about will allow you to get back, and they're all around this idea of investing in observability.

Questions and Answers

Losio: You started talking about, basically if you don't deploy daily, now I'm pushing a bit on one side, that is, if you don't deploy daily, you're doing it wrong, more or less. Is there still a case for not deploying daily for a heavy project where, maybe we're doing things not in an ideal world towards observability, towards continuous deployment, whatever? Things might still work without going into the full maintenance page scenario.

Thwaites: You're not wrong by not deploying every day. It's all about context. That's the true consultant answer, obviously, it depends. It is all about context. What deploying daily gives you, and it's not about daily, it's not even about weekly, it's not even about hourly, it's about deployment frequency, and reducing the staleness of the code that you write. Is there a scenario where you shouldn't be deploying daily? Yes. If you don't write any code for a few days, don't deploy anything. If you're writing code day in day out, and then you're waiting two weeks for that to be then deployed into production before you see customers using that particular bit of code, that's where we get risk. That's why deployments become scary. Is there an interim? Absolutely. This is all about reducing the amount of code that you're about to deploy and reducing the staleness of that code.

Losio: I'm not saying you definitely need old-style, as you talked a lot about going cloud native and using cloud technology, leveraging cloud technology, I hope that we are not scaring too much new developers that are still maybe not on the cloud or just in hybrid situation, whatever. That often the very first step is a basic, or maybe sometimes not even that basic lift and shift of the entire stuff to the cloud. At that point, what would be the next step? Is there any advice on the DORA, or whatever, still apply? How can you do it in an incremental way maybe?

Thwaites: The key is focus in on that metric around, how long does it take you to deploy? That first metric about being able to understand if I make a change, even in my lift and shift environment, how long is it going to take me to fix that if it goes wrong? Really focus in on that, and focus in on, what small changes can you put in that will allow you to get from a fix that you've got, that one line fix, to getting that deployed? Absolutely, just run some more servers. That's fine. How do you get a change onto those servers, and don't neglect that, just because you want to get those servers live, because it will bite you. It will absolutely bite you when you've deployed that to 60 servers, and you've not thought about how you're going to get that change on, and how quickly you can get a new change on. You go, I forgot to put the GDPR compliant language on the bottom of the page, now I'm non-compliant. It's going to take me four days to deploy this thing. It might be small, or it might be large, but it will bite you.

Losio: When we talk about cloud provider, we would like to think about, how to keep your monitoring portable across products using open standards. As well, I might use a cloud provider. I might use AWS. I might use Azure. I may use something else. There are already tools there for me that I can use, often they're not really open source or they are derived from open source. How well does the tool you mentioned integrate with that? Do you see a problem with that? Do you see a vendor lock-in with that? Do you see a step not to take?

Thwaites: Vendor lock-in is the very key concept to OpenTelemetry. OpenTelemetry is made up of this idea that you receive data in, you can do some stuff with it, and then you post it back out in a compliant way. They use something called the OpenTelemetry Line Protocol, which is the mechanism of sending that data in a standardized format, so you can send it anywhere. All of the cloud providers, Azure, AWS, GCP, they're all building on top of OpenTelemetry now. AWS has their own distro that they've created specifically for Lambda, to allow you to be able to do this sort of thing. They're all moving towards this idea of standardize the way we do telemetry. Leverage all of the stuff that we get from the people writing instrumentation on top of that, so that you can just push it to whatever provider it is that you do. You can push it to one provider. You can actually push it to multiple providers, if you want to, and do a tradeoff. OpenTelemetry has three different types of signal, which is tracing, logs, and metrics. You can actually use three different providers for each of those three different types of signal if that's what you so choose. All the cloud providers are pushing towards that now as their mechanism to get it into their data, whether it's X-Ray, or Application Insights, now Azure Monitor. Stackdriver is the GCP one. They're all moving to this idea that they will work on the OpenTelemetry Line Protocol, and they would receive that data in that format. It's the visualizations and what they do with that data once they've got it, which is what differentiates those backends.

Losio: You don't see that point that one is not really a vendor lock-in, because that's their own specific tool with their specific maybe visualization aspect, but you can still move from one to the other one in an easy way, or at least easy compared to other parts of your stack.

Thwaites: Absolutely. If you use something called the OpenTelemetry Collector, which is an interim gateway, you can actually move even quicker. Because you can have 100 services pointing to one gateway, change the settings in the gateway and send all the stuff to a different one. The vendor lock-in becomes how easy your tool is to use. It's not vendor lock-in because I have to write code changes. It's vendor lock-in because I don't want to lose the functionality that I've got in this vendor for analyzing my traces, for analyzing the performance of my system. To me, that's the best kind of vendor lock-in, because that means the vendor is doing their job so well, that you don't want to move away.

Losio: It's basically the case where your operational team may say, I want to stick to that one because I like that because I'm used to it, because it has functionality that the other one doesn't provide. The case that it is not that I cannot get out, it's more like a feature itself.

I have on-premise, and AWS. How do you really manage that? How do these two support observability across, what is, at the end of the day, quite a common scenario?

Thwaites: This, again, comes down to the idea that OpenTelemetry, yes, it's done by the cloud native foundation. This makes it sound like it's the cloud, it's really not. The idea of OpenTelemetry is it's a load of tools that are cross-framework that allow you to have consistent outputs. You can absolutely run that on-premise, and then you can send that data out via the OpenTelemetry Line Protocol to your providers from wherever it is, from all of those clouds. It doesn't care which cloud it comes from. There's also a load of key semantic attributes that allow those backends to be able to provide analysis that says, this one came from AWS, this one came from Azure, because those attributes are just added by default into there. OpenTelemetry as a standard can run anywhere you want. It runs really well in the cloud, because if you're running Kubernetes, for instance, then we can add collectors as pods, and we can have lots of other nice things that go with it. That doesn't mean that you have to do that. I've run this as a full lift and shift on EC2 instances, to be able to do it that way and run it on ECS containers, so that all of the data is coming through there. There's a million different ways for you to deploy either an OpenTelemetry Collector or be able to put it into your code. It's that versatile. It's that much about not vendor locking people in and saying, "I'm a vendor, I need to write an open source. I need to write an on-premise version of this thing." I can just go, the community are building these things, and somebody might have built it for Honeycomb, and we only accept OpenTelemetry data so it's available to everybody.

Losio: I'd like to actually shift a bit to one topic you mentioned, it was team health as well as metrics. Because some of the metrics you mentioned, I found really interesting, because they were not just really purely technical metrics, you were talking about observability. I loved the time to customer of a feature or things that you have as a metric. I was thinking as well, maybe because I'm dealing with operation, I love new metrics. I'm always a bit scared of that metric fatigue, that you add one more metric, because we had a problem. We had a bug, we fix it, and so we monitor that. Then after a while the team start to forget about that metric, because another metric becomes more important or is simply newer. Sometimes I have that thing like, let's get rid of all of that stuff from scratch. Is that more of a problem?

Thwaites: Interesting that metrics, as far as I'm concerned, are the devil. That's one of the reasons why, because we create metrics because a problem exists. I always use the analogy of a sign doesn't exist because somebody wanted a sign to exist. A sign exists because somebody did something wrong, and you want to tell somebody not to do it anymore. Don't feed the bears with your hand. I'm pretty sure that exists because somebody fed the bear with their hand and it didn't end well. We put in metrics for that reason, because we want to check the response times, because response times are important to us. Tracing is what takes that to a different level. Tracing is about this idea of high cardinality and high dimensionality. The idea of I put in my tracing, and I add all of the data onto it. I don't create a metric that says I'd like to track the response time by GET request. I don't say, I want to map the response time onto this group of users. What we do is we just dump all of that data out.

Then those metrics, we can ask the questions afterwards. We can say, could you tell me what this was? We don't get metric fatigue in that way. What you actually do is go, this is important to me now. I'll do it in my observability tool, that visualization tool. I will ask for those questions and those metrics. Then I can then remove it if I don't need it. I'm not having to write code to remove a metric. I'm just putting it in my tool and then I'm removing that alert. That's where it also comes into something called SLOs or service level objectives, which is the idea of, you dictating what is important. You're negotiating with your engineering teams and your product managers and your CEOs, whatever it is, is your business stakeholders about what is important. Then we use those as what gets people out there. We use those as the things that indicate when things are wrong. Those are the things we monitor. That gets us away from all of that alert fatigue, and that's key to the observability concepts and key to tracing above metrics and logs. Because we can ask all of those really interesting in-depth questions and provide things like error budgets, error budgets by a certain criteria.

Losio: You're saying that basically the tracing helps me to collect the data, the data is there. I still don't have the question in my mind. I still don't know that a problem exists until the problem is there. Then I can eventually ask that question and set my alarm eventually in the future.

Thwaites: That's that key concept that production is our only real system. We write a metric because the QA team did this and it was really important to them. There's 5 of them on the QA team, there's 10,000 users. I think they're going to find a lot more things than those five QA team members are. Nothing against the QA team. The QA team is highly likely amazing, but they're nowhere near as good as the 10,000 users operating my system. That's through experience, and learned experience over the many years of running production systems. Users are much better QA.

Losio: I found an interesting point in your presentation, about what it means to have time at the end, and what it means to deploy a new feature. The key point is when the system should be up and running, and the key metrics that were running before the new feature has been added, are still there. At the same time, you still don't know if the new feature works. It might as well be that no one is still using your new feature, so it adds a level of complexity that is quite hard sometimes to measure. I was actually quite curious about the time to customer that you mentioned about, for example, a new feature or whatever, we're going to add. How do you define that? What is the time to customer, it's between the time you start to implement it? It's when it goes into your backlog, or when it goes live?

Thwaites: Two points. What's important to you is not what's important to me. What is important for you to be able to change the process and put in front of business stakeholders is not the same for me as it is for you. It's about context. It's really about trying to think about the thing that you can measure, because if you can't measure it, it's pointless putting it as a metric. If you look at that intersection of what's important to you as an organization, maybe it's the CEO is always breathing down your neck, that a feature hasn't been deployed. In that case, it's not useful to get it from the customer. It's from when the CEO knows about a feature. The time from CEO knowing about the feature to it being deployed should be a day. It doesn't matter whether it's spent four months in a backlog, it's from when the CEO cares about, because that's the thing that the CEO is going to care about. He's not going to care about the fact that it was four months in the backlog. He might care that it is four months since a P1 VIP customer mentioned it. It might be since it made it onto an actual production backlog. Those are the things that are really important, really, very business centric. They're business context specific.

Losio: How do you see chaos engineering? As you mentioned that production is only our environment, how do you intentionally break it? It is part of the process or it is another dimension entirely separate from the entire topic of observability?

Thwaites: Chaos engineering without observability is just chaos. Chaos engineering doesn't make observability better. It allows you to look at what you're doing with the chaos engineering and be able to say that it's working, or I know that it's gone wrong, because chaos engineering is replicating what could happen in production. Yes, you could do that in your own environments but it could happen in production, so why don't we try it in production? I wouldn't recommend doing that on a life and death system for the first time ever. Those maybe aren't the right things. There are many large scale organizations with business critical functionality that are doing it. Don't go into chaos engineering without observability. Don't go into chaos engineering without really thinking it through. Don't just spin up Chaos Monkey on your cluster, and just make sure it's just bringing down pods all over the place. Then get your CEO to say, why is this unknown?

If you write any code, and you don't know whether a customer's touched it, did you actually write any code? Because that's the only way that you know that it was written, is a customer touching it.

See more presentations with transcripts

Recorded at:

Dec 04, 2022

Martin Thwaites

InfoQ Software Architects' Newsletter