BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Reduce ‘Unknown Unknowns’ across Your CI/CD Pipeline

Reduce ‘Unknown Unknowns’ across Your CI/CD Pipeline

Bookmarks
46:52

Summary

The panelists discuss monitoring and observability methods that DevOps and SRE teams can employ to balance change and uncertainty without the need to constantly reconfigure monitoring systems.

Bio

Richard Whitehead is Chief Evangelist @Moogsoft. Christine Yen is CEO/Cofounder @Honeycombio. Aaron Blythe is Organizer @DevOps Kansas City. Katie Poskaitis is Staff Site Reliability Engineer. Paul Snider is Senior Director, Data Platform Engineering @Cerner. Michelle Brush is Engineering Manager, SWE-SRE @Google (Moderator).

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Brush: Welcome to the roundtable on reducing unknown unknowns in your continuous integration and delivery pipeline. I am Michelle Brush. I am a software engineering SRE manager for Google, working in Google Cloud Platform.

Blythe: My name is Aaron Blythe. I've had the opportunity to lead two different teams where we got to reimagine the code pipeline for getting into production inside of a company. I probably consulted with about a dozen more. The first one in the past was pre-Jenkins 2.0, and the most recent one was the more YAML, the ones that are contemporary now.

Yen: I'm Christine. My background is on the dev side of DevOps. Today, I'm the CEO and co-founder of Honeycomb. For anyone unfamiliar with Honeycomb, we do observability.

Poskaitis: I'm Katie Poskaitis. I'm an SRE at Google. I work on backup and recovery strategies in GCP.

Snider: I'm Paul Snider. I'm a software engineer. I lead the database platform organization at Cerner Corporation. We're a healthcare IT vendor. We support large hospital systems, rural hospitals, doctors' offices, down to your medical record on your phone. A little perspective of the company, about 3000 babies are born a day in our software. There's more doctors' offices running our software than Starbucks worldwide. I've had a gamut of jobs from a frontend software developer, and now I've found my way to running our software operations on our data tier.

Whitehead: I'm Richard Whitehead. I'm a Chief Evangelist at Moogsoft. We're on a mission to try and provide AI based technologies to the role of observability. I'm in marketing right now, but I do have an engineering background. I try and avoid writing software, because I keep running into production, and that's too much responsibility for me.

Deploying To Production on Fridays

Brush: I wanted to kick things off by asking one of the more controversial questions that we find in DevOps, and I didn't know that Christine had dressed for the occasion. I will send it her way first. The question is, do you support deploying to production on Fridays?

Yen: Honeycomb absolutely supports deploying on Fridays. Our engineering team, of course, I have confidence in them and our tooling to be able to see quickly if your behavior deviates from expectations. We build a tool for other folks to be able to have the same competence. Deploying is the piece of the whole software delivery process. We think that we should be making sure all the other pieces are as rock solid as the deploy piece, not just Nerf one part of it to provide some full security over the weekend.

Brush: As someone who does not deploy on Fridays, I definitely sympathize though that I would love to have software systems that we had that full and complete trust in, in order to say like, "Weekends, no worries." I wonder if anyone else is in the position that I find myself in more where we don't feel like we're there yet on the no worries on Friday deployments.

Snider: I think it depends on the spectrum of risk. I can say we've been in specific Friday deployments, Saturday deployments, Sunday deployments, in some of our most critical systems that could tolerate downtime if needed. Specifically, I can't interrupt healthcare. Someone can't be in the middle of a surgery and the system be interrupted. We try to avoid those. At certain risk levels, we get in deployment. There is a goal to get away from it, Christine. Definitely interested in what you have to offer, but we do find ourselves there.

Journey to Confident Deploys

Brush: That leads me to some of the work that Aaron mentioned that he'd done around getting more confident in your ability to deploy. I'm curious if you can talk through some journeys that your previous employers or people you've worked with had to go through in order to get confident in their deployments.

Blythe: I don't deploy on Fridays currently, because I have a compressed schedule and I don't work on Fridays. The last three Thursdays, at 4:00, I've deployed. I think that part of the reason I have the confidence, and this is the hard part about software is the most recent thing that we built, my team built it from scratch in. It's a data pipeline to get data from the on-prem up to the cloud. I know every line of that Python code, because I've either reviewed it or wrote it. For legacy systems, in the past, if someone else wrote it and no one's touched it in a long time, I don't have the same level of confidence, because there's a lot of bits that I don't know. There's a lot of those unknown unknowns. That for me is usually the line, like how much of it have I been able to soak into my being, and if I can't soak all of it into my being, then I'm a little bit apprehensive and I don't touch it with the same gloves I do if I've understood it all.

Best Practices to Make the Black Box Less of a Black Box

Brush: That's an interesting distinction about the way you might treat a legacy system that's more of a black box to you, is different from how you treat something that you've been more involved, and it's a recent development that's more fresh on your memory. A lot of this panel roundtable was meant to touch on observability. I'm curious if anyone has any best practices or approaches that they've seen for making that black box seem less like a black box, even when it's code that you inherited, or it was long ago written.

Poskaitis: The best way that I know to do this is to start by improving test coverage, first and foremost. A lot of times in legacy systems, it will be hard to add new functionality with new tests and ensure that you're not breaking things. That's the thing that you're always worried about is like, is the legacy system going to break because I didn't understand exactly what this class did or something like that? Improving your test coverage and working on refactoring things out so it's more testable, can make a big difference there.

Whitehead: My point will be related to test coverage in the same way, you instrument what you can. Instrument it upfront. You can't go back and add in instrumentation after something's gone wrong. Upfront instrumentation is the way to go.

Brush: The funny thing is that that fits perfectly I know, with work that Paul has been doing. Paul, do you want to share some of what you've been working on?

Snider: I think it's instrumentation from the top of the software stack to the bottom, and system instrumentation as well. I work on a lot of scalability challenges, and that can manifest itself in a couple of different ways that can come from a user saying they're having performance issues to your observability metric seeing like a database CPU spike, and trying to actually hone in from a software lens on what that problem is. Can be challenging at points to actually track down what the root causes in some of these, when you're trying to tackle something like a broad category of scalability.

I can think of an example, we added 6000 clinicians into a particular system. We started seeing some database spikes from some of our monitoring applications. Looking at it just at the database tier, you see 100,000 SQL transactions, and you don't know where to start. It wasn't until we started tying the database tier to our services tier, the 8000 services that we're managing, back to the application tier was a couple 100 applications that our clinicians are using. As we started to tie all the metrics together, we were really able to find out and generate a story of, there's 9 services that chewed up 40% of the database. I could tie that to four product applications, and ultimately target four teams to go do work. It wasn't until we told the whole story, because I think sometimes you look just at your particular area and you don't see the holistic view of the system. I'll echo Richard's comments there.

Test Coverage

Yen: I'm so glad Katie started with talking about test coverage, because I feel like there are so many parallels between pre-production testing, what do I think is going to happen? What actually happened? Then we compare the two, the parallel between that and what we try to do in production. Any time something goes wrong, that wrong means it is different than what we expected. There's a real opportunity for folks, for now, for tools, for this moment, people to begin thinking about production as an extension of that dev or test environment. An extension of thinking ahead of time, go into that mental exercise, not just of what do I want to instrument, but what do I think this instrumentation will show me? For so many years, production was thought of as this scary place where code goes, and then weird things happen. It doesn't have to be that way.

That separation is exacerbated also by what Paul said. When something is talking in terms of database CPU, it's impossible to tie that back to the pre-production test cases you wrote. It's impossible to tie that back to application logic, which then means that the developers in the room are less able to help. There's all these cool trends right now around capturing business logic in your instrumentation, dealing with high cardinality fields, recognizing that things that matter to your business belong in your instrumentation, because that's how you'll know whether it matters. The tying things from database to the services, yes, it's necessary as we are moving towards worlds where we're all responsible for what happens in production, not just the folks with the word Ops in their title.

Where to Start on Observability

Brush: Let's say you're just trying to get started on this and you don't even know where to start, what would you recommend is where to start for those folks just going down this journey?

Poskaitis: On my current team, we're actually there right now. I can talk about what we're thinking about doing. We're trying to launch a new internal service. It's currently in pre-prod right now, thinking about going to prod. We said, how do we add observability to this system in a way that makes sense? The way that we've been thinking about doing it is twofold. One, thinking about what our customers care about and what kinds of things they're going to want to see. Two, intentionally breaking it in pre-prod and seeing what we would do to debug it. Basically, anywhere where we cannot look at a log file and look at a dashboard instead is a win. That's how we're approaching it very early on.

Snider: When I think of getting started, I also think of maybe championing an idea, or maybe this can be, you're on the DevOps journey or you need to build in observability, or you know there's a challenge that you need to undertake. I think there's a couple different methods there as an engineer who want to influence change. I'm reading this book, "The Storyteller's Secret." It's by Carmine Gallo. He analyzes a bunch of TED talks, and he talks about the things that we're talking about today is observability, data, making data-driven decisions, it's huge. The other thing he talks about is just the emotional side of it and the story. One thing we've tried to do is tie the data that we're finding, the observability patterns that we're tying, to actually influence what's a story, and what's that value that you want to do. Katie, you were talking about how your clients, your customers are going to want to feel. At Cerner, it's a little easy to tie the complete story. I can say, I'm affecting clinicians, clinicians affect patients. A lot of times I feel like we get to be in situations where we want to go take care of a risk or technical debt or something that's not feature functionality, and we need to champion those things. We try to tie it into a broader narrative of how you're going to impact the business or maybe the patient.

I just had my third kid. He's only 5 weeks old. We're in a hospital, they're running fetal monitoring software that my company supports. Let's say that data is inaccurate, just putting myself and thinking in that position that my clinician might make inappropriate choices, the anxiety that we're going to feel. I think if you can tie some of that to the emotional side, then, obviously, the ROI and the things that we're going to find with data, but data supports that broader story.

Brush: I really like the idea of, one, you have to bring it back to how are you making the customer experience better through this investment, or how are you making the business better through this investment? Also, what Katie said around having these internal principles or goals that are more internally focused as well, which is, let's see what could break. Then our guiding principle is that we should be able to figure this out from a dashboard. I think that was a really cool guiding principle to have when you see your internal developers, your Ops folks, everyone, SREs as users as well.

Whitehead: I'm a big fan of context. I like what we're saying about bringing in the emotional aspects. It doesn't have to be emotional. Certainly, metadata is really important. It's very easy when you're focused on a very specific function within maybe a larger project just to provide information that is meaningful to you. You have to look at the broader picture and say, it's meaningful to me, but is it meaningful to somebody else who's looking at it from a distance? Providing some metadata that puts the information, the instrumentation that you're generating into context, I think is really valuable.

Blythe: I actually worked at the company that Paul's talking about, many years ago. I love slinging code, by all means, but one thing we did was we started to have, basically, health check pages. We wanted to standardize those. This was a six-month process, because we were just in our first foray of microservices, and we wanted everyone to use the same type of health check, and for it to mean the same thing. Because we were losing a lot of time when you've got a big system that has all these micro systems, and people are like, "My system is down." You want to know if it's one of those core systems, or if it's really yours, so you can instrument the heck out of the part you're responsible for. If it's someone else's system that's part of the bigger system that's down, then you're wasting time trying to figure out what's wrong with yours. We needed a communication mechanism, and that came through that standardization. This wasn't six months of me and my team sitting there on the keyboard writing some code, this was six months of us having these every two-week meetings to come to this written out standard. Then someone got to go write the code in each of the languages, like Java and Ruby. There was an implementation to that standard. It was the first time I ever built a standard from scratch. If you're looking for places to start, sometimes someone has to stand up and be ready to take a beating on that, because people are going to have strong feels about those type of things. It is a fun journey to go through.

Yen: My advice is going to be very tactical, because I see a lot of folks want to do all the right things, and then go try to boil the ocean, and then get discouraged, and then run into walls. These are two pieces, one, start T shaped with your instrumentation. Two, expect to iterate forever. What does T shaped mean? T shaped means, find the quickest way to get a broad picture of your system. Maybe this is one service. If you're talking web services, maybe you strap something like an API handler. Don't worry about capturing all the metadata upfront, you should be asking these questions of what would be interesting. Capture what's easy, just to get a picture of what is my system doing broadly. Then find the darkest, most interesting corner of that service, find something that's painful, or where things break and people don't really understand it, and go deep on that. Maybe this is one endpoint. Instrument that. This is where you add a bunch of metadata, build out a trace, whatever you need to be able to begin seeing what that software is doing. Use that T shape to show yourself, convince your teammates, "This is something that can benefit us, if we start instrumenting more of these endpoints. If we have these conversations about what metadata we're going to want." Because while, absolutely, we should have conversations about what dashboards we will need, I think that the reality of our industry is you're never going to know what you will need in the future. You're never going to be able to predict all the ways things are going to go wrong. The more important piece is establishing that mindset of shortening that feedback loop, of make it as easy as possible to add instrumentation, to add that additional piece of metadata, and showing that you can get a little bit of value from a little bit of work, so that people are incentivized to keep doing it.

Gaining Confidence in Instrumentation Functioning, and Its Value

Brush: How do you recommend getting confidence that the instrumentation is, one, working, and two, providing value to your developers and engineers?

Poskaitis: For the working part, the way that we do this is basically through auditing. This is my particular team. What we would do is sometimes you really have graphs that you don't look at very frequently, or dashboards, or whatever. Basically, looking at them occasionally and making sure they're still there is useful to make sure you didn't accidentally delete a metric that you didn't mean to, because you don't want to find that out in an outage. It doesn't take very long and it can be very helpful. Of course, making sure that your systems that you use for observability are as decoupled from your production infrastructure as possible. A really good example of this, I think, was that I used to work for Twilio, which was basically APIs for doing phone calls and texts, things like that. They had to make sure their pager app didn't use them under the hood, to do the pages so that they would always get the page. That's what we do to make sure that's working.

Also, what we do is we have something called wheel of misfortune. It's basically a role playing game where you simulate an outage, sometimes you may actually do it if the outage is safe enough to do, or safe enough to do in a non-prod environment. Then, we will have other people on the team drive debugging it, and talk about what they would do and what tools they would use, and what's missing, and come up with action items based on why it was painful. It's like learning from an outage before a real outage actually happens.

Managing Code Complexity in Instrumentation

Brush: She walked through Christine's proposal of how you iterate, which I think is amazing, where you have this broad, and then you go deep. Instrumentation is ultimately code. At some level, you've produced a lot of code that you're maintaining. You've got a lot of data that you're now keeping. I would think that Richard or Christine, you'd have some good ideas on how to manage all that complexity as you're going through it.

Yen: I like to tell folks to think about instrumentation like documentation or tests. These are all things that you do to improve comprehension of your code that evolve as the logic evolves. Again, if you were able to get in the mindset of this is not a huge, big lift that we do once and then stays there. It's just a weight that we carry around. It's instead something that evolves and we can change along with us. I think that reduces the mental burden or mental weight. Again, more tactically, this is why namespaces are a good thing and conventions around how to talk about metadata, this is why it is good that you have the conversations that other panelists described around what matters to us. Can we standardize on, is it User_ID, is it user.id? These very nitty-gritty conversations. They all tie back to this commitment of, we're going to evolve this. Maintenance instrumentation is going to be everyone's job, because in theory, we all benefit.

I feel like the clearest way of knowing whether instrumentation is beneficial is seeing folks use the tool, seeing folks turn to this tool, not just when something's on fire in production, or when you're preparing for something on fire in production. Any time that someone is about to write new code or validate a hypothesis, you can look at production first and hopefully see that, and reinforce instrumentation helps me ship better code.

Confidence in Instrumentation

Blythe: If I can build on that to what Paul was saying earlier about some of the times the emotional type of stuff, not just technical. What we do is we use the agile events to build confidence. We have some dev thing happen, and we want to answer the question of why did it happen. Then we'll use the demo at the end and make sure that someone used our tool, whatever that tool may be, the APM product or whatever, to answer a question. We want to show other people. That actually, for some teams I've been on, draws in other parts of the company. They're like, "You used that tool to answer that question?" Using that type of mentality, that that is part of your overall system, a lot of times draws in a lot more eyes to it.

Snider: Yes, Aaron, I'll echo that. I think that's where you can start seeing value in instrumentation is that you're seeing it used across the entire software development lifecycle. I'm using observability metrics to prioritize what my team is doing for my backlog. I'm using observability metrics to help with, like Christine said, a potential proving a hypothesis. I'm using the data that I have to influence what my design is going to be. Now I'm going to deploy it, or I'm going to use that information to validate whether that hypothesis is right. Now I'm running the software, and I need to continue to look at that. That's why I think like, if it's integrated throughout your entire process, that's when you start to see value.

Then the other thing I think I'll add, is it's no different, Michelle, like you mentioned, in the code that you manage. It has a lifecycle too. At some point you're going to find that maybe some instrumentation that you stood up yourself isn't warranted anymore, because there's open source, or there's another method that's exponentially better than what you built yourself, and you have to look at it through the same lens, like, is it worth our time to continue to support this? Can we move to some other platform? It's an evolution.

Whitehead: I think I'd also want to echo Christine's comments about a common language across the entire platform. I think it's good to do a bit of normalization. It seems like an impossible task. I think most people flinch when you talk about standardization, because that always sounds like a really hard job to standardize something. This is 2021, we have technologies at our disposal like natural language processing that can accommodate slight variations. We're a multinational company, and we have people writing log messages in varying degrees of tech. I'm English, I don't even spell correctly. I use, 'use' where you shouldn't. Natural language processing is very powerful to accommodate that.

That really takes the burden off of normalization in some way in using some of these techniques and technologies.

Fundamental System Surprises

Brush: As you've been going through these practices, is there any fundamental surprise that you discovered when you were looking at the data or the metrics that made you realize that the system you thought you had was not the system you actually had?

Snider: I might spin this a little differently. I think this happens all the time. I'm glad there's not observability metrics on my failures, because I have quite a few of them. We just saw it yesterday with Microsoft and the global outage with 365. Where you think you know how the system's functioning, I'm going to take that approach, where I think I know how the system is functioning. My code makes it to deployment, and it didn't go as I planned. I think it could probably wear a lot of failures on my belt in this space. Specific examples would be, I can think of times on the software cycle that we mentioned, where I deprecate an API, and I have instrumentation, I understand when this thing is called. I think I'm making a passive, non-passive change, and it gets out into the field. Everything looks great. It's daylight savings time, and there's this random service that calls this API that I thought I had, two months later then that is now failing. That's I think the constant challenge that we run into is learning from those failures. You think you know how to predict the system, and it can take radical turns on you. This is why we are additive. The thing I learned from those failures is, how do you react quickly? How's the resilience in what you're building, and how you do that.

Yen: A different shape story. Having run, built huge multi-tenant platforms for the last decade or so, sometimes the code is doing something I didn't expect. More often it's a user is doing something I didn't expect. Someone is using the system in a way that they weren't supposed to, or that was supposed to be an edge case or a niche case. I love these. It's so fun. Type two fun, not type one fun. The type two fun that you look back and tell stories about, where you're like, "Yes, we can of course deprecate this endpoint," similar to Paul. "No one uses it. We didn't advertise it anywhere." Then you actually look at the traffic and you're like, a quarter of our API calls hit this or use this feature, why? What are they using it for? This is where having that sense of production is more safer, check assumptions. It's the closest line to our users. It's the only real place my code is running. That's what makes it so important to be willing and able and easy to check these assumptions before someone writes code, not optimizing for that use case or expecting a certain workload on an endpoint that is no longer the case.

Whitehead: Having the data really helps. You can make assumptions about what's going on, but having the actual data can shed a completely different light on it. I have one memory, it's an ancient memory. I do remember talking to a customer, we were talking about the network protocols that they were running in their infrastructure. They were absolutely adamant that everything was a specific protocol. Then you just have to put a network analyzer on the network to show them what was going on. It turned out that 95% of the network traffic was a protocol I'd never even heard of. It completely challenged their assumptions of how their network was being used. Having that visibility, having the data available is absolutely critical. Because you might think an application is being used a way that you think it's being used simply because of the forward facing features. What's actually happening behind the scenes in the infrastructure can be completely different. Yes, the data is critical.

Poskaitis: One thing that can make observability particularly hard, in my experience, is systems that are very good at defense in depth. If your system can tolerate one thing going wrong, two things going wrong, when that third thing goes wrong, it's bad. We had an example of this back in 2016 where GCE VMs in all regions were unable to talk to the outside world for 50 minutes or so. If you read the incident report, what it goes through is like, there was a bad config push, but there was supposed to be defense in depth such that if there was a bad config push, it would revert it, but that didn't work. Also, canary analysis didn't work. A bunch of different things just all failed. Presumably, it's possible that they were failing before too, and just the observability system never saw it because it never had impact. I'm wondering if anyone has ideas for how to deal with that a bit better.

Whitehead: That certainly sounds like an unknown unknown, to me, particularly when you've got the onion layers of [inaudible 00:34:32]. I think, in some cases when you're dealing with unknown unknowns is they are, by default, impossible to predict. That's the whole thing. You can't predict them. Certainly, I always tooted that if something is predictable, then you design around it. You shouldn't wait for it to break and say I knew that was going to happen. The unknown unknowns, the key I think, is basically identifying that the issue has happened very quickly. The sooner you identify the issue, the sooner you can start the process of fixing it. If something unknown is going to happen, it's going to fail. The only recourse you have is to mitigate quickly and fix the problem. Being aware of it, not hanging around and waiting for an hour to be told by a user that something's broken. You need to be aware that it is broken and start the process of fixing it very rapidly. Rapid response.

Blythe: We started with Friday deployments. One thing that we don't do during working hours at my current place is the big failover event. The once a year failover event where we try to break everything, like we try to cut off connectivity between the colo and the main data center, or whatever. We don't do that during working hours. I think that those events are super important. That's the biggest one. During that weekend, we do a lot of other things, like we shut things off. I don't know if this would hit the possible case that you had brought up, but I think one of the most important thing is to inject chaos in some way when it's safe to do so, and try to make it safe to do so more often than it is currently.

Backup and Recovery

Brush: I think Richard's point around making it very fast to detect is it allows you to safely do those types of things. If you're doing it on purpose, you have some warning, but you actually have to know what went wrong and what you need to fix very quickly. There's a little bit of those types of fault injection and disaster practice events, do need a level of observability into the system, and what's going wrong. How can you capture it and fix it? I know Katie has some experience, I don't know if you want to talk a little bit about what you've been doing recently with recovery.

Poskaitis: To reiterate, I'm working on data backup and recovery in GCP. What we've found is that there's two categories of unknown behavior during a recovery process. The first one is, do your backups actually work? Again, not a thing you want to find out when you need it. Another one is ensuring that your recovery procedures themselves work. These are special because these are operations that are really rarely done in production ever, so you have to test them really rigorously to keep them up to date. One example may be like, you're able to get your backup out of archival storage, or wherever you have it. You've validated that it looks ok, it has some data in it. It's not obviously garbled or anything. Now you need to get this data back into your production system, and how do you do that safely. There's really no substitute for this kind of testing that was previously described, to just like, you have to try it and see what happens. We can automate as much of it as we can ahead of time. Ultimately, if you never try it, end to end, you'll never know if it works.

Whitehead: I'm enormously fond of acronyms. I understand Google has a wonderful one for that, which is DiRT. Is that correct? It's Disaster Recovery Testing. I'm a big fan of that.

Yen: To riff off of what Richard said, I agree that identification is one of the most important pieces. I challenge us all to go one step further. After you've identified that something has gone wrong, how do you refine the search space, zoom in, and keep identifying? At this point, especially at Google scale, it's never going to be one layer of problems. It's always detangling the giant spaghetti ball of cause and effects and interdependencies. I also think that this is one of the reasons our philosophy of the vendor is we want to enable the people because, ultimately, there's going to be context that all of your engineers have in their heads about, deep defense, like which pieces were built that might be camouflaging another failure? At some point, it's unreasonably difficult to express that instrumentation. It will ultimately rely on, is this football enough for the humans to figure out and ask the questions of this part of the system and then that part of the system? Maybe it's a good thing there's no silver bullet for this thing, or else we'd all be doing something else with our time.

Whitehead: Christine, you're absolutely right. That's another area where context is key to the fault resolution process. I used to work in technical support, and when somebody called and said something had happened, the first question you ask is, what did you do? What's the most recent thing you did to make a change? Because that usually yielded the root cause. It's a little bit more sophisticated these days, but sometimes it can be as big as that when you're dealing with unknown unknowns. Time quite often is a great way of bringing things together, if you can identify things that happened within a relatively small time window. It's not an exact science. You're not using a deterministic process to decide, this happened, which caused this. What you're saying is these things happened at the same time. We are presenting these to you as possible causality. That's a great technique when trying to identify what's going wrong, is looking at a time window. We talk a lot about instruments and everything that could potentially create a vast and expensive data lake. Whereas what you're really looking at is a narrow time window. It's unlikely that something that happened a year ago is causing the problem you're having today. It's possible, but it's unlikely. If you focus on a very narrow time window, it's amazing what information can be yielded within that narrow time band.

Fundamental System Surprises

Blythe: Back to your question about what you're most surprised about. For me, any time I've added an APM tool or a log aggregator or something, is the amount of just no ways. That's usually step one, because if I'm narrowing things down to a time window, and there's four errors in there. I'm like, "It's one of these four errors." Then I talk to someone on the team, and they're like, "No, that error has just always been there." That doesn't mean anything. If you look at a log in a bigger time window, it's there 1000 times or whatever. I'm like, "Update the code. Let's take that out." The same thing with my CI/CD pipeline, if I've got that one that just always fails once an hour or whatever, that's noise that's keeping my time to resolution longer because I had to go ask that question every time. I love the shortened time window of all these tools these days, but then there's also like onus on your team to just get rid of the noise, get that out of there.

Brush: It's almost like you're treating the noise as technical debt that has to be routinely paid down over again. If you're not doing that, then it's going to impact you during an outage or some investigation.

What Next?

We started out after the whole Friday deploys, which is like, if you're just getting started, what should you start with? Then also on the bookend of that of like, if you feel you're already doing most of the things we talked about, what do you see as next? I know Richard mentioned natural language processing, which maybe that is a big next for a lot of folks. Are there any other things that you're seeing as like, this is coming for teams, and they need to be thinking about this in this space?

Yen: I think that with this trend of putting developers on call, of sharing the responsibility of production, there's a real blurring of lines between what the people who used to think about production and use these tools, and the people who should be or people who are trying to. I hope that this means that what will be coming will be an army of developers recognizing that if they had the right instrumentation, they can answer whatever questions they want about production. I hope it means that there's a new era of experimentation, and using feature flags and observability tools together to test out really tiny bits of code on tiny segments of traffic, and using that to drive better code. I hope that there's really cool applications of observability, this whole toolkit, visualizations and instrumentation around chaos engineering. All of these new tools and toys and techniques that we are developing as an industry. I hope people find exciting ways to pair them and combine them to supercharge how we think about production and the entire process of writing software.

Snider: Obviously, there's just tons of really cool tech coming out. AIOps is obviously something we're dabbling in. Back to Christine, she referenced just human beings and nature. With the DevOps movement, obviously we're blurring the lines between your software engineers and our Ops folks. I'm seeing it even further down the line as well, our relationship with our product folks and ultimately the value that we're delivering to get there. Then Richard referenced being in support and maybe being that first line of support. When these events happened, we referenced there were failures at multiple steps, we'd call it like the Swiss cheese, so there's holes in these various processes. There's human beings along the way that have seen that. I think just the other thing to prepare yourself for is just the interactions and the blurred lines between areas of responsibility and the silos that have traditionally existed within what we do in the software Ops side versus the holistic business.

 

See more presentations with transcripts

 

Recorded at:

Sep 07, 2021

BT