InfoQ Homepage Presentations Yes, I Test in Production (And So Do You)

Yes, I Test in Production (And So Do You)

View Presentation

Speed:

Download

43:06

Summary

Charity Majors talks about testing in production, the tools and principles of canarying software and gaining confidence in a build, and also about the missing link that makes all these things possible: instrumentation and observability for complex systems.

Bio

Charity Majors is a co-founder and engineer at Honeycomb.io, a startup that blends the speed of time series with the raw power of rich events to give you interactive, iterative debugging of complex systems. She has worked at companies like Facebook, Parse, and Linden Lab, as a systems engineer and engineering manager, but always seems to end up responsible for the databases too.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

[Note: please be advised this transcript contains strong language]

Transcript

I'm Charity [Majors]. I'm going to talk about testing in production, and it started out as a joke. Did I plug it into the wrong hole? Sorry. I'm not an engineer anymore. Good? I can hit play if he sees it. Yes, I tried that hole, too. Let's try plugging this part. I'll tell you a little bit about myself. I am an infrastructure engineer. My niche as an engineer has always been, I really like to be the first infrastructure engineer who comes in when there's a product that some software engineers have been working on and it kind of sort of works. They're like, "We think this is going to be huge." And I'm like, "Oh, grasshoppers. Let's play." I really like the part where beautiful computing theory meets messy reality and everybody is just on fire and I'm like, "Hahaha. This where I live. This is where I like to live."

I really like to help companies grow up. And after you build the first version of the infrastructure, you hire a team to replace you and you manage the team. And by the time [inaudible 00:01:35] that, I'm pretty bored and ready to move on to the next scary little startup. That's what I was doing a couple of years ago. I worked for Parse as their first backend engineer. We grew up. We were acquired by Facebook. And I wouldn't normally tell you the story of how my company started but I have these slides so I'm just going to do that.

Well, around the time that we were acquired by Facebook, I think we were running like 60,000 mobile apps on Parse. I was coming to the horrified realization that we had built a system that was basically undebuggable. Some of the best engineers in the world, some of the best I've ever worked with. And I tried everything out there. But as a platform, part of the reason it's interesting and hard is that you're taking your users' chaos and putting it on your systems. We had a very flexible platform. Mobile engineers all over the world could write their own queries and just upload them and I had to make them work on MongoDB next to other developers who were uploading random-ass crap and making it work next to each other. Code tenancy problems were incredible.

The same for JavaScript. “Write a mobile app, upload your JavaScript. It's magic. I'll just make it work.” Never do this. We were down all the time and I was trying all these tools. The tool that finally helped us dig ourselves out of this pit … Disney would come and be like, "Parse is down." I'll be like, "Parse is not down. Behold a wall of dashboards in there, all green. It's not down."

Got it? I would send an engineer, go figure out why Disney is slow. And it would take them sometimes hours, sometimes days, because the sheer range of possible things that could've happened. It could be something they did, something we did, something that somebody next to them did. Could be transient, maybe the person next to them stopped running bad queries or something. I don't know. So, we tried this one tool finally at Facebook called Scuba. Any Facebook people here? Hi, I recognize you. And it lets you slice and dice on arbitrary dimensions [inaudible 00:03:49] in real-time. Our time to debugging these problems dropped from hours or days or seconds to really - I'm going to hit play just in case it shows up.

Anyway, I was so enthralled by that experience that I decided to start a company that did that because the more I thought about [inaudible 00:04:12] it wasn't just a platform problem. It's a question of the pure, possible range of outcomes. If that becomes nearly infinite, you can spend nearly infinite time debugging it. I really wouldn't start this with a pitch. I got access to the slides. I'll just tell you what's on them. So that was the intro slide. Then there's my little quote that says, "The only good diff is a red diff," which that's how you know I'm from ops.

Testing Production Has Gotten a Bad Rap

So first slide says, testing production has gotten a bad rap. It's a cautionary tale. It's a punchline. It's a thing that people say very smugly like, "Oh, I would never test in prod." It rubs me the wrong way, because they do test in prod, they're just not admitting it. We'll get to that. I blame this guy. Now, picture – this is the slide you're picturing already. "I don't always test my code, but when I do I test it in production." (Bullshit. Sorry, I promised I wouldn't swear at this conference.) Bullcrap. I blame this guy because we love that meme so much. I love this meme, too. It's caused us to embrace this idea that it's one or the other. If you admit that you are experimenting or testing in production, you couldn't possibly have done due diligence before it hit production.

Now I have a couple of cute memes that you're just not going to get to see. I'm going to rewind one, too, because I want you to see these. It was backwards. There it is how we should be. I always test my code and then I test it in production, too, because here's the thing. Now I'm jumping to the end, but it is just as negligent in my eyes to ship prod and not validate what you thought you deployed is what you actually deployed, as it is to not test it at all. In fact, if I had to choose one or the other, I would choose the second one, which is the opposite of how everyone is building things today. So you could say that that's me being unnecessarily contrary but that's kind of what I do.

So we're caught up. Sweet. Test. To take measure to check the quality, performance, and reliability. Nothing about that definition says before it's shipped. Now, cautionary tale. I don't want this to be used as an excuse for anybody who just is feeling lazy. This is about testing better. This is not about half-assing it. You can construct a lot of hypothetical straw men here, but we all have decently good judgment. This is not what I'm talking about. So this is just one aspect of the bigger story, which is that our idea - I just realized I don't have any speaker's notes - our idea of how the software development life cycle looks is kind of old or overdue for an upgrade. We think of it like a binary switch. We think of it like, "Oh, it's in prod now."

But it's more like a continuum. The deploys, which is more like turning on an oven and starting to bake it - the oven metaphor doesn't completely work because the longer your code is in production, the more environments it's exposed to, the more types of query traffic, the more everything happens, the more confidence you can have in it until you change the version. But for lines of code, typically, the longer they've been running, the more loathe we are to change them without a really good reason, because we trust them more. This continuum starts with nothing. And we start writing code. We start deploying to our dev environments. We maybe promote them to other environments. Some of these are going to fail. Some of these are going to not fail. Some of these are going to be rolled back. Now we start cherry-picking. Maybe we've got many versions in prod at one particular time. There's an outage. Okay, okay.

Now this is what production looks like in my mind. I don't know about yours. It's a little messier in reality and the testing starts at the beginning and doesn't end until that line is pulled from prod. And I love how the kids are not trying to toss chaos engineering on top of this. I don't know what you think this is now but chaos to me is pretty much shit.

Why Now?

Anyway, why are we all talking about this now? Why are we all talking about the need to shake up our view of what deploys are, staging, prod and all this stuff? Well, if we take a step back to the big picture- I forgot to warn you that I'm a liberal arts dropout. I'm a music major. So when I try to draw graphs to prove things scientifically, this is what you get. But you still know it's true because I have a graph. It's going up. Let's look at this.

There on the left is a LAMP Stack, the humble work horse of all the systems that we know and love. By the way, side note, if you can solve your problems with a LAMP Stack, please do. Don't go looking for these problems. Don't do microservices because they're cool. If you can run it with one service, god bless. The middle graph there is the architectural diagram of Parse, a couple of years ago in 2015. That middle blob is a few hundred MongoDB replica sets with code queries written from developers all over the world just running together, a few dozen services or so.

Point being, there is national electrical grid. When you're building systems and thinking about how they're going to fail, that's what you should have in your mind. That's your model. Not the LAMP Stack but this. This is a distributed system. It's going to fail in ways that are hyper-local, like a tree falling over in Cleveland Main Street by some other street. Okay, are you going to write a unit test for that? I mean, could you have predicted it? No. No, you can't. And you can't debug it from up here. You have to get hyper-local. You have to write a certain type of telemetry for that and a certain type of recovery procedures.

Other problems, you can only see when you zoom way out and look for big systemic patterns, like if every bolt that was manufactured in 1982 is rusting 10 times as fast but only the ones that came from a particular factory and were deployed between January and March. This is not an unreasonable problem to have. These kinds of problems happen all the time. This is how we need to be thinking about our big distributed systems in software. A lot of it is just about changing our mental model from one where you build a system, you look at it, you could predict mostly how it's going to fail. You start using it. You write some monitoring checks for those thresholds.

Over the next few months, you bake it in and you get exposed to 10%, 20% of errors that you didn't predict. And so you create runbooks for those. You write monitoring checks for those. Then pretty much it only mostly breaks when there's a serious infrastructure fail or you ship a piece of code that's bad and then you roll back the code, you debug it and you ship it again. This has been a lot of ops for as long as I have been alive, but it's all kind of falling apart. I also don't have the slide preview. I don't know what the next slide's going to be.

Well, this parallel is a shift from monitoring to observability. Monitoring being the idea that you can look at systems, predict how they're going to fail. And observability being the idea that you just have to release it, go with god. Just, "I'm never going to predict everything that's going to break. It's a waste of time to try. I should build resiliency." Because this is the key lesson of distributed systems. It's never actually up. It may, if you're lucky, asymptotically approach being up over time, but if your tools are telling you that everything is working, your tools are not good enough, or they're lying to you, or your granularity is off.

We Are All Distributed Systems Engineers Now

We're all increasingly distributed systems engineers, and we need different tools in our toolset for dealing with these different types of systems. Distributed systems are particularly hostile to the idea of staging because distributed systems have this infinitely long list of things that almost never happen or are basically impossible, except you have 15 impossible things happening a day. And the more time that you sink into your staging environment trying to clone it, trying to make it exactly like production, trying to make sure you can predict and test all the failures, well, it's a blackhole for your time. Raise your hand if you've never wasted a day on trying to clone staging or get something different between your laptop and staging and prod differently? Did someone actually raise their hand? Really? Are you an engineer? Just throw away test environments? You've evolved beyond the rest of us, I suppose.

I say this sarcastically, I think it's a reasonably good best practice to use your laptops in production. The thing is, it's not free. It's not like you're getting zero value out of your staging environments or you wouldn't have them. You're getting some value out of them. You do need to go through the process of exposing it to a more production-like environment between your laptop and user seeing your stuff. So the reason that I press this argument so hard, and I use the inflammatory statement to start, is that we're almost all of us chronically under-investing in the tooling that makes production safe. Almost across the board. Leave the Facebooks and Googles aside, they've mostly learned this lesson the hard way many times over. The rest of us seem to think we can just grab Capistrano from GitHub and use that to deploy our software and we're good. I'm not even going to say anything about Capistrano. That's a laugh line for anyone who's ever used Capistrano.

I think every mid-sized company that I've ever seen - I'm not going to pick on startups. When you're that young and you're trying to survive, it's fine. But, once you have customers who you care about, the process of shipping your code, the code that does the work of shipping your code is the most important code that you have. It's the stuff that gets run the most, has the most impact on production, up-time scalability. It has the most impact, way more impact than a random line of code in your application code. And yet we leave it to the intern.

So I'm advocating with this rant, I mean, if you were to boil it down to two sentences, it would be, take a fat chunk of that time that you're currently sending off to staging, and reinvest it in the tooling around your production systems, in making it safer. Also, get your designers involved. Your designers know how people are going to see things. Your eyes don't work anymore when it comes to your tools. Get your designer to go through it with you so that the things that are going to be tricky are hard to do and the things that are the right path are easy to do. I'm doing this kind of ad hoc because I literally don't remember if that's where the slides are. I can't see what's coming up next.

Sorry. It's frustrating. Actually, one second. I'm going to change that. It's fine. Just stand up and yell something if something isn't making sense. I don't like that one there though. But it's a fact of life that only production is production. Even if you could guarantee that staging was identical to prod in every way, no drift, nothing, you can still do a typo. Has anyone ever done this? “ENV = producktion”? Well, that broke it, too. You're not going to catch that by doing “ENV = dev”.

Testing Environment?

Every deploy is actually a unique exercise of the combination of your deployment tools, the artifact and the infrastructure as it exists in that moment. Deploys scripts are production scripts. Get them code reviewed. People always put their junior engineers on this kind of utility tooling, which is backwards to me. Put your senior engineers on it, if for no other reason than to signal that it matters, that you care about it, that you'll invest your real time in it. This is a little bit of a pet peeve of mine.

Staging is not production. I skipped this whole bit ahead of the slides. I apologize. Why do people take so much time into staging when they can't even tell if their own production environment is healthy or not? By the way, if you have staging environments, monitor them. Do all of your monitoring and observability stuff, because if you can't actually see what's happening and do the same thing at prod, it's basically useless. It astonishes me how many people are literally sailing blind when it comes to any of their environment. And they're just like, "Well, our monitoring checks pass." I'm just like, "Well, you have two or three - do you know how many versions of your code you're running in prod? Can you compare them one over the other?"

So you can catch the majority of bugs with 20% of the effort, and I don't just say that with my music major dropout math. [Inaudible 00:18:24] gave a talk on it and it was excellent. So computer scientists say so, too. There are all this diminishing returns that we're investing into staging to make it to look just like prod. That's really not necessary. Math said so. The reason is that you're not going to catch very large chunks of your bugs unless your code is talking to real data. I mean, this is the one that people realize first "I need to add an index to this MySQL table." "How long is it going to take?" "I'll just run it in staging."

That's not going to work. You can't really mirror your production users to staging usually because of security, traffic, scale, concurrency, blah, blah. At the end, it's the last one, that even if you get all those others right, even if you're the best engineer in the world and you get all that shit right, even if you do capture/replay, which Capture/replay is a gold standard. If you can somehow sniff all the requests that are coming into a system for 24 hours and replay them on identical hardware - I've built that system three times now. Still it's not going to tell you what's going to happen tomorrow. Still it's not going to tell you about that tree falling over on Main Street.

By the way, the closer you get to your data, the more paranoid you should be. All three times that I wrote that piece of software, it was for load testing database performance and concurrency and all that stuff. So the reasons that staging is not equal to prod and the many reasons that you can't actually try and realistically make it look like prod, are the ones that I just listed because I wasn't sure where the slide was. But it's the security of the user data, the uncertainty of the user patterns, the sheer cost of duplicating all that hardware. Oh my god. Nobody is going to spin up a second copy of Facebook to test Facebook.

The observability effects are even bigger. I think every request that came into Facebook in their front door generated 200 to 500 debug events behind it, which is great for engineers who have to understand it. You cannot try to incentivize people to capture less detail. That is the worst. So what do you do? Well, you sample because nobody is going to pay you for something that's 200 to 500 times the size of production for your observability stack either.

Testing before or in Production

So let's look at that continuum a little bit more and talk about some of the places on that continuum that various types of testing belong. Test before prod. Does it work? Does it run? All your unit tests, all your integration tests, all of these are things that you have learned either by predicting them with your big human brain or, by being exposed to failure. You've learned ways that it's going to fail. So we write checks for these things, and that's great, because this is the most efficient way to make sure that you make the same mistake as little as possible. Because, otherwise, we do tend to make the same mistake over and over. Well, this time, your test will catch it.

There's nothing more frustrating than having something break and you spend a really long time trying to figure out what it was. And then four months elapses, which is exactly the amount of time it takes for everyone to completely forget. And then it happens again. After tracking it down, we're like, "Did nobody write a test for this?" Hate that.

Test after prod, in prod. Well, everything else, honestly. Everything that you can't predict. Everything that requires a larger, more sophisticated environment. Everything that requires the apparatus of whether you're doing lots of different regions or data centers or whatever. Okay, so if your bosses don't like the word "test," you can also say experiment. A lot of this falls into the category of experiments. Blue-green deploys. Canaries, behavioral testing. A lot of this stuff. Load tests. So I know of at least a couple of companies that actually run a 20% load test at all times in production. Well, if they start to tip over, they have 20% excess capacity just sitting there waiting for them. I think that's kind of brilliant.

Data problems. God, no one can predict all the ways that your data is going to be unclean or edge cases there. Most of your tooling around deploys has to be tested in prod. Testing and then staging is dead simple. You have to test it in prod because you have to test it against the software that it's designed to work on. There are a lot of other things like for security purposes that you just can't transfer off-prem. I don't know.

The point is that once you’ve build up all this tooling to make it safe and easy and robust and reliable, you start to use it, which is good. It has a lot of ancillary benefits. Too often, software engineers don't get the opportunity to touch production often enough so they never reach even intermediate level mastery. They're always avoidant, terrified. And, by the way, I'm not talking about SSHing the machines. No, no. In fact, I recommend removing the SSH access from everyone if you could. It's terrible. It's an anti-pattern. But doing things like viewing the road through the eyes of your instrumentation in your observability - I use Honeycomb to look at a problem to say, "Is this even worth building?” I'm going to decide what I'm going to build based on real stuff that I can see by asking questions about my stack through my instrumentation instead of just making something up.

We just shipped something for compression in our storage engine. First thing we did was deploy some details to let us calculate how much we're going to save, how much the mean user would save, what the distribution was and whether it was worth building. I can't imagine you just blindly deciding to build it just, "I hope it ends up being worth three months of time." In interrogating your systems, you should have muscle memory. Did I just ship what I think I shipped? Is it doing what I thought it would? Does anything else look weird?

Interacting with production. A lot of people will give up the objection that, "Oh, compliance." (Bullshit. Sorry. I promised Wes. I need to tape something right in front of me. I practiced for days beforehand. What was I saying? Whatever.) Let's move on. Chaos engineering. You're testing disaster recovery. You're testing your ability to selectively break things on purpose as the kids say. Data programs, internal users, testing production data, lowering the risk of deployments.

Again, this correlates really nicely to known unknowns and unknown unknowns. And if there's a moral of the story, it's that testing in production is part and parcel with embracing the unknowns, the unknown unknowns, and the mental shift to a world where things are going to break all the time and that's fine. It's better. We prefer to have a bunch of little breaks all the time. We prefer to break somewhat regularly in ways that users don't notice, of course, but they give our team practice that help us all level up, that help us stay sharp. I'm not saying you have to fail all the time.

There are some risks of doing more on production or exposing some of this stuff. There's nothing that isn't a trade-off. The app might die if you do something bad. You might saturate a resource. These are things that can happen. They're eminently preventable though. In fact, the first time or two you're going to run into some real dumb low-hanging fruit. But after you've been doing this for a while, your production systems will be far more resilient than they were before you started, because you're stressing the weak points in a controlled environment when it's the middle of the day, everyone's awake, people are at work, they know what they just did. It's a lot easier to find it.

To Build or Use

Here's some things you can use to systematically make it much different. Feature flags. Feature flags are your friends. They're a gift from god or [inaudible 00:27:48], sent to your personally to give you more control over your production systems. It's fine if things break. It's not fine if users notice, or if they happen to users' data. If you've shipped something behind a feature flag and, "Oh, no. It's twice as slow for the users using the feature flag," which are all of the developers internally, that is amazing. That is breaking, in a way, but that is a win. That is a huge win.

High-cardinality tooling. This is my company so, obviously, I'm plumping for it, but the ability to just legitimately run several versions in production so that you can - "I want to run this version here. I'm not sure, it might have a memory leak so I'm just going to run it on a couple of nodes for a couple of days, keep an eye on it." The ability to break down by high-cardinality dimension and just look at them side by side, it's phenomenal. The ability to explore, to have ad hoc exploratory interactive, just looking where the crumbs take you, it's amazing.

Canaries. Not enough people use canaries when they're shipping. By canary, I mean, depending on how paranoid you are about the change - and there's a pretty big range there if you want to be really paranoid - you might ship to one node for some length of time then ship to 10, then ship to whatever. Facebook always ship to Brazil first. I never understood. That was half of Brazil gets it and they wait a while and slowly roll out to the rest of the world. What do you have against Brazil? I don't know. What? Somebody said something super snarky and I missed it. I'm sad about that.

If you can, if you have internal users, always ship to them first. God, that's the easiest answer in the universe. Inflict it to internally first and then eventually ... Often, we’d be doing Facebook work and you just go down and we'd be like, "Bad build I guess." Think of how bad it would be if those got shipped to the world. It's great. Experiment on your own.

Shadow systems. I've built a couple of these, too. GoTurbine. Oh, they just died. God bless them. Linkerd is another. But they let you fork traffic. The one we did at Parse for the GoRewrite, literally we would take a request in, split it. You could run it in two modes. That one, it would return the result from the old node to the user and store a diff of the old and the new results in a file. After a day or so, you just go look, "What's different? This type needs to get fixed."

You would also run a chain. You could have one shadow fork off to another that would read from a copy of the database. So you could test reads and writes that way by returning one to the user and diffing them. Super cool. It was janky though, janky as crap, janky as cookies. There we go. These are much better. I think that SEO has some really cool stuff around this. Anyway, anything you could do to just play act that you're serving the traffic, so that the users don't notice but the code pass that are erroneous actually get run. That's really important. There's no substitute for the code getting run in a production environment and you getting the result.

For databases, obviously, gold standard is capture/replay. And one thing I will say, you hear me say how many times I've built these things. Please don't build your own. Please just stop. I think we've all built and discarded so many versions of these over the years. There are now companies out there who are trying to make a go of it by building these as a service. We will all benefit from investing in these companies that are doing developer tools for the distributed systems error. Please don't write your own.

Be Less Afraid

There's more. People don't invest in enough when it comes to just tools that you can run to protect the 95%, 99%, of users who are doing things fine from the occasional bad actor. We do a bunch of stuff about just partial rewrites, partial throttles, velvet ropes to kind of cordon off, if latency is really high and that's causing to this pull stuff to all fill up with requests that are inflight to that, then just blacklist it. It's so much better to take a small section of the infrastructure just down offline, than it is for everyone to suffer because some interdependent resource is starving out everyone else.

Create your own test data with a very clear naming convention and try to distribute it across all of your shards. We've done these great things to distributed systems- by great, I mean eye roll- but with shared pools, we all used to be experiencing the same thing. So if your infrastructure was, say, you have a 99.5% availability, half a percent of the time everyone's getting an error, pretty evenly distributed. Well, nowadays it's just as likely, if not more likely, that for most people it's 99.9% but everybody whose last name starts with LH. The shard that they're on is a 100% down.

There are all these ways that we have broken up users and data stores and everything, and there can be very small, percentage-wise, sections of your system that are 100% offline and it's a way worse experience for your users. They're pissed. But your time series aggregation, I'm not going to show that. Again, high-cardinality tooling.

Failure is Not Rare

Anyway, as I was saying, failure is not rare. It happens all the time. An experience that we have all the time at Honeycomb is people start using us and they freak out. They're like, "There's so many errors." This one company that I will not name, it took them six months to get through an evaluation. I'm like, "What is taking so long?" They're like, "We keep finding really terrible things that we have to fix." I'm like, "It's been that way forever. What's the rush?" They're like, "Oh, we got to drop our evaluation. Go fix this thing." And I'm like, "Dude, it's been that way for like how long?" I don't know.

You know what? It's terrifying especially if it looks like your SLAs are going down. How many of you have had the experience of hearing that it's a good best practice to do end-to-end checks? To stress your important code pass, all the things that are making money. So you put a couple up and they're flaky. You're just like, "Oh, god. The checks must be wrong. Better disable them while we figure out what's wrong with them." I'd give you 99 to 1 odds that the problem is not with the check. It's with your infrastructure. This is the actual experience that your users are having. It's just that those could be often very hard to reproduce so we're just like, "Oh, they didn't turn their Wi-Fi on."

Failure is obscenely common, and the more that you and your engineers get your hands deep in production, the more you start to experience the flakiness or the weirdnesses, the better your intuition is going to be, the more appetite you will have for learning and using these tools and the better off everyone will be. Bad things happen when software engineers spend too much time in artificial environments. That's a different talk, I guess.

Failure is when, not if. Never say, "If it fails." Just don't. It will, yes. When it fails, what will happen? Does everyone know what these things look like? I firmly believe that no one should be promoted to senior software engineer unless they are proficient in production, unless they know how to own their software. By ownership, I mean they know how to write it, deploy it and debug it, not in an artificial sandbox but debug it in production through the lens of their instrumentation. To me, that is one of the non-negotiable marks of a senior software engineer. If we don't factor that in to the promotion and leveling, then our baby engineers are not going to grow up thinking that it's important. We’ve got all these code wizards who think that their code lives in an IDE. Stupid.

Everyone should know how to get back to a known good state. You don't have to love it. You don't have to be an expert. Although, I find that people love it more than they think they would because it's real. It has impact. They've waited because they're scared but when you give them tools, and if they're browsing production through the lens of their instrumentation, they can't take it down. They can't. If you're using a SaaS like mine- this is not meant as a product placement, I swear- but if you're using a SaaS provider, which you should because you don't want your system and your observability system to both go down at the same time [inaudible 00:37:30], they can't do anything bad to it. And they actually feel safe.

And because of how we're wired as humans, we enjoy the feeling of understanding, of mastery. We get a dopamine hit when we figure something out, when we understand it. There's a lot to play with there. This is better. Not having tools is bad and you should fill that and you should invest in those tools. I guess this means we're over. Memes all night. Cool, thanks. We have 10 minutes for questions.

Questions & Answers

Participant 1: You're my new favorite person. My wife and kids will be disappointed. So how do you suggest I go about convincing people that testing in prod is good and we do need tooling for it?

Majors: Where do you work? Or what kind of company?

Participant 1: I work for a financial and investment type of company.

Majors: So pretty far over the conservative side.

Participant 1: They're pretty far.

Majors: Yes. Well, I mean, are your software engineers on call? Are your software engineers on call?

Participant 1: No.

Majors: Well, you have a lot to do, don't you? First of all, don't call it testing in production. Call it hardening production. Call it building guardrails. Use the language of safety, because the fact is, I'm pretty tongue in cheek about a lot of this stuff, but it is necessary for resiliency. I don't know how good I will be at addressing a very specific political situation without knowing a lot more about the situation, but for the conservative ones, don't call it testing in prod. Recruit some allies. Look for some small wins. Look for just ways to show actual impact. Look for places of pain and start there. Do deploys fail very often? You can fix that, right? Do you have guardrails? Bring the designers in and redo the deploy. I don't know. Just look for something that's painful, try to fix it and then point to these policies as ways that you did that to earn technical credibility. That would be my guess.

Participant 2: Thank you for the talk. I got a lot out of it. I guess my question is, you kind of glossed over this but it stuck out to me, why is SSH-ing in an anti-pattern? I do that all the time.

Majors: Yes, I know. Me too.

Participant 2: And I will not tell you my employer.

Majors: It's fine. It's fine. I make big statements. It doesn't mean that I'm unreasonable usually. It's an anti-pattern because it means that your tooling wasn't good enough to give you the answer. At scale, it just doesn't work. You can imagine that it works pretty well if you have a thousand machines. Once you have 100,000, you're just like - and when you are SSH-ing and you're looking at the world according to one node of many, and what you really care about is the story of the hole and break downs by things that are not just host.

Sometimes individual host might not actually experience the problem for reasons that are whatever. And so it's just a much better practice to get used to interpreting what's going on in production through your instrumentation, through your tooling. And SSH-ing is the last resort where you're not sure if your tooling right or not to check yourself.

Participant 3: Great talk, by the way. So if you're in a financial industry, a highly-regulated industry, what are your suggestions there to get to the point where you can be testing in production? Because it feels like there's a lot. You mentioned security for staging. There's even more in the production environment.

Majors: For sure. So qualifiers, I'm not, as you might have guessed, an enterprise software person. What I'm going on is the fact that my friend Adam Jacob from Chef and others have done a lot of talking about how a lot of the things that people say, "You can't do because of compliance," when it comes to these things, it's just that they don't want to or they're scared to change things, and it's absolutely possible to do it. But I'd never remember the nitty-gritty because my brain doesn't really work that way. The things you can do to prepare - do you have any specific examples?

Participant 3: Just access to protected data.

Majors: It's access to operational data, not protected data. There should be no PII or PHI or any sort of protected data in your operational data at all. If there is, that's a bug. You should not be crossing these streams because your engineers need access to data, information about how the system is being used, like full access. You've robbed them of that if you stuff in things like social security numbers. That's my philosophy.

See more presentations with transcripts

Recorded at:

Mar 19, 2019

Charity Majors

InfoQ Software Architects' Newsletter