InfoQ Homepage Presentations InfoQ Live Roundtable: Production Readiness: Building Resilient Systems

InfoQ Live Roundtable: Production Readiness: Building Resilient Systems

Bookmarks

View Presentation

Speed:

46:00

Summary

The panelists discuss observability, security, the software supply chain, CI/CD, chaos engineering, deployment techniques, canaries, and blue-green deployments all in the pursuit of production resiliency.

Bio

Wes Reisz, moderator. Adam Zimman, VP of Platform LaunchDarkly. Holly Cummins, Senior Technical Staff Member IBM Garage. Anastasiia Voitova, Head of Customer Solutions, Security Software Engineer, CossackLabs. Haley Tucker, Senior Software Engineer, Resilience Team Netflix. Charity Majors, Co-Founder/CTO Honeycombio.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Reisz: We've got five panelists here. Their roles range from CEOs to VPs, engineers, chaos engineers, security folks, covers a wide range of gamut. What they all share in common is they're all practitioners, they're all engineers that are dealing with real issues in software and are thought leaders out there.

What Production Readiness Means

I wanted to talk about, just for a minute or two, what production readiness means to me. When you think about production readiness, what comes to your mind? Does it encompass things like test, stage, test in production? Does it talk about monitoring, logging, configuration management, deployment, DR, SLOs? Do you think of a checklist? What comes to mind when you start thinking about production readiness?

For me, what I did is I landed ultimately on an SRE book, Google SRE book, that in Chapter 32, it talked about the evolving SRE engagement model. In that, it talks about the typical initial setup for SRE engagement is a production readiness review, PRR process that identifies reliability needs of a service based on specific details. Through PRRs, SREs are able to apply the things they've learned and experienced to ensure a reliable service operating in production. It talks there about lifecycle design, build, implement, launch operate, be commissioned. Then it comes to this, it says, SREs seek production responsibility for important services for which it can make concrete contributions to reliability. SRE is concerned with several aspects of a service, which are collectively referred to as production. These include system architecture, inter-service dependencies, instrumentation metrics, monitoring, emergency response, capacity planning, change management, performance, pretty much everything that can go wrong. Black Swans happen in production. How are we truly ready for production deployment?

Background and What Software Is Through the Lens of Production Readiness

I thought this would be a great place to start. What I'm going to do is ask each person to go through on the panel and introduce themselves. Talk a little bit about the lens that they see software through. Then talk about their definition through that lens of production readiness.

Majors: I'm Charity. I'm the co-founder and CTO of honeycomb.io. My background has been as an ops engineer. I like to be the first ops engineer that joins a company when it's just a bunch of startups, and they have something they think might grow up into a real product someday. I like to join them and help them do that. When I think of production readiness, I really think of confidence. All of these things that you mentioned, absolutely add up. There's such a thing as false confidence. People who don't have the experience to know yet whether or not they should be confident of something. Shipping software should not be terrifying. It shouldn't be scary. It shouldn't be something that gets built up. It should be boring. Success is the ability to move quickly and with confidence, not because you don't think that anything will break, but because you know what to do when it does.

Cummins: I have quite a different perspective from Charity, and probably from a lot of the folks on the panel because I think I might be the outlier. Because when I think about DevOps, there's the two sides. There should be one. On the one side, there's Ops, and they keep it running. Then there's dev, who are the people who just keep breaking things. I'm a developer, I'm a breaker of things rather than a fixer of things. Then part of my journey has been, how do I break things a bit less? How do I make people hate me a bit less? What should I be doing? I work for IBM. I'm in a team called the IBM Garage. We're a services organization. We do a lot of lean startup. Production is something that we think about a lot, because we're at that really early stage of the scale that Charity talked about, where we're not doing these things that are really big scale. We want to be lean. Where is the line between I'm going to production with my very lean thing, and I have just thrown a whole bunch of garbage into production? That was a really bad idea. How do we minimize that overhead that goes with having a huge scalable thing, while still actually having something that stays up and meets the needs that our users have?

Zimman: I'm Adam Zimman. I'm the VP of Platform at LaunchDarkly. For me, I think that production readiness, is all the things. Ultimately, the thing that I think of with production readiness is it comes down to value to the user. The reason I say that, is that, I know at least a few other folks on this panel can claim that they have brought down production. Ultimately, when it comes to having that type of perspective, you realize that the number one thing that you're looking for when you think about changing anything in production is, how do you actually have some level of ability to actually ensure that the user continues to get value from your service or application? If there's going to be some type of misstep, how do you then restore that value as quickly as possible? That, to me, is the thing that I try to think of in production readiness. It doesn't necessarily mean that everything works. It doesn't necessarily mean that everything works the way you wanted it to. It just means that, how do you make sure that the perception from the user's perspective is that they are continuing to receive value?

Tucker: I'm Haley Tucker. I'm a Senior Software Engineer at Netflix. I've had a couple of roles there. I started off in the services organization for playback, responsible for a lot of the things that happen prior to users getting the actual playback experience. Then a couple years ago, I moved to the resilience engineering team with a focus in chaos engineering initially, but since then, we've expanded to load testing, sticky canaries for seeing how changes can affect end users. All sorts of production experimentation, to support our service organization. As far as how I view production readiness, I'll echo some of the things that other folks have said around confidence building.

I think there's a huge number of activities that we go through when we're building software, anywhere from testing and all the way up through CAST experiments. Each of those is just adding a bit of confidence in an area that you haven't confirmed as working as expected. One of the things that I always try to tell new people when they're joining the company is that if you break production, great, that just means we're missing a test. Then we can go and add that and then the next person will make that same mistake. I think as long as you're constantly following that pattern, then you just get better at pushing things into production with more confidence.

Voitova: My name is Anastasiia. I'm from security. Right now I'm working at Cossack Labs as head of customer solutions. Cossack Labs is a data security company. We do security, cryptography, all this complicated stuff. Usually, it's me who helps our customers to integrate our software into their systems. What I usually do is I take a look on their systems and most systems are not secure, and help to make them more secure. In my perspective, production readiness, that's complicated. Similar with security, we can't say that some system is secure, because there is no such thing as 100% security. We can say that some system is production ready, but it's not ready. It's a process. Where are we in this process? Do we have a confidence in our system? Are we sure that our system is being designed and built in a way that it can still work, even if attacks happen. Attacks and incidents will happen, eventually. If our system can still work, when components of the system is not working, I believe, this is the definition for production readiness from my perspective.

Getting To the Eventual Goal of Testing In Production

Reisz: Charity, when I wrote this abstract, I started off by one of the tweets that you've had pinned for a while that talked a bit about, if you aren't testing in prod, you aren't testing in reality? We don't start by getting there and doing this in production? Where do we start? How do we start on this process to actually get to the eventual goal of testing in production?

Majors: TDD is the most significant software innovation of my lifetime. It's incredibly effective. You start it by testing the little components, and then you work up to the bigger components. It's been so effective. It's predictable, by stripping out everything that has anything to do with reality, anything that has to do with anything that is variable, or concurrency, everything that is interesting is gone. It's just testing the logic, to the best of our abilities. Like Anastasiia was just saying, there's no such thing as a perfectly production ready system. There's no such thing as a perfectly tested system. Our role is less about, you must do this, you must do that, and more about assessing where a team or a system is. Then figure out the next step for them. Where I think the industry is, is we have come to rely so heavily on testing, that we have forgotten that it doesn't stop there. It's time to start forklifting all this stuff into production. I can see that there are teams out there certainly who are just going to be like, "Test in production. This means we don't have to do other tests." It is like, "No, grasshopper. You must do that first before you earn the right to start doing all of this chaos stuff."

Haley, I just wanted to say thank you for that. I feel like so often we guarders of production, we actually instill a lot of fear in people. They're really terrified to make their changes or roll their things out. We need to not do that. We need to welcome people in. At Linden Lab, we'd throw a ceremony for every engineer who had started the team, and the first time they brought down production. Not in a mean way, but just like, you've graduated, now you are doing real work. Now you know that your work matters, because code that is not in production is dead code. It does not matter until users are using it.

Tools and Techniques Used To Limit the Blast Radius

Reisz: Everybody talked a lot about confidence as you were answering these questions. Haley, when you talk about putting something in production and breaking in production, that's scary for a new developer, but there's guardrails. When you put things out into production, there's tools, there's techniques that we use to limit that blast radius. Can you talk a bit about some of the ones that you do at Netflix?

Tucker: This has been evolving over years. There was a point in time where you could absolutely go in and update a little JSON blob, and all of a sudden fail 100% of lots of things and do bad things, which was definitely terrifying. Over the years, we've tried to take a much more experimental approach. One of the things that we've leaned on heavily is a canary strategy, where we can take a baseline set of users and an experiment set of users, monitor their key performance indicator for us, which is, are they actually able to get playback? We put them into the experience, we start failing something or injecting latency into that experiment group. We monitor their SPS or playback metric, and if those deviate too far, we will automatically shut down the experiment. That's one set of monitoring. We've also added on a bunch of other random alerts that can fire and shut stuff down. We've just been building guardrails over the years, so that for the most part now, the users that have used the tools feel really comfortable and are just constantly out there running experiments, for all sorts of different things, which is great to see. Because they know that if something goes really sideways, they're not going to take down production. It's been really great to see people build this into their development workflows.

Reisz: Adam, you make a tool to help with this, but you're also a VP. When people start talking about this, what's in your mind? There's a lot of risk here. What is your thoughts?

Zimman: I think that this was part of what I found so compelling and a big part of why I joined LaunchDarkly, was that I had seen how feature flagging could be so impactful, and frankly, an imperative for being able to do any type of cloud delivered service or cloud run service. There's such a need for that ability to reduce risk, and to be able to ship in a boring fashion. I spent over a decade at VMware, and I remember having launch parties for a release, that were literally the day of, and they turned the campus into a carnival fair. It was because it was such a big deal. I also remember being on the engineering management side, and having code branch freezes that lasted two to three months to be able to deal with merge conflicts. It was a mess.

Then I got to spend a little over a year at GitHub, and saw that they were shipping the upwards of 100 times a day. The way that they were doing it was being able to eliminate that risk from adding new code. I think that the other aspect that I saw that was completely different, which was something that I learned in standing up one of VMware's initial global cloud infrastructures, was starts before you start actually shipping code. That's in your architecture step. I think Charity mentioned this, when she was talking in her intro, and I know she's talked about this before. It's this idea of planning for failure, and having this expectation that failure is going to happen. What are your failure modes? Are you actually architecting in a way that you are putting real thought into if this service goes away, what happens to the rest of my offering? Or, if this component goes away, what happens to the other nodes in a cluster?

Majors: When.

Zimman: That's the thing. I think a big part of this production readiness, it has so much to do with that, planning for failure, and being able to have this expectation that I'm going to look at the metrics that actually matter. Not just this, "I'm going to monitor my CPU load, because that's an indicator of all the things." It is like, "No, not at all." It's this new perspective way of looking at things that I think has always been the case for anybody that's dealing with a services based delivery to users. It's become so much more apparent as we've moved to cloud native platforms.

Cummins: One of the things that's so bad about that really heavy release process that you were talking about is as well as being totally inhumane for the people on the ground, who then just collapse every time there's a release afterwards, because they're so burnt out. It means that in terms of risk, if there's something wrong with that release, you can't cope, because you can't go through that whole process again. Then what you end up doing is if something was wrong with your testing, you missed it, you shipped out something bad, or something changes, and you go, "We can't go through this cycle. Instead, we'll just push something out and manually deploy it, and we'll go on to the servers and change the string, and what could possibly go wrong?" Then you end up in this cycle where you have to bypass your own processes in order to be able to cope with the problems that your own processes created. Then you just get worse.

Zimman: My personal favorite term, the zero-day patch that we've had. We already know that the stuff we're going to ship is broken. We're going to ship it because that uses our general process, but then we're going to have a patch that uses our accelerated process on this flow.

Security Risk of Deploying to Production

Reisz: Anastasiia, let's talk about security. All this is great, getting things out into production, but doesn't this open us up to risk? From your security hat, when you're looking at this, what are your comments?

Voitova: Usual answer to any security question is, it depends. Basically, for any architecture question, it depends. From a security perspective, I would say that it depends on the risk profile of a company, of a product, of your organization, on the risk appetite. Because for some systems for some company, it may be totally fine, because they don't deal with sensitive data. They don't deal with a crucial part in the application. For others, it can be close to the vice versa, they can't afford to go into production without being heavily security tested, covered in security tools, covered in all the certifications in the process. The truth is, it's usually somewhere in the middle. I would suggest to start building production ready, resilient systems by understanding the risk profile, by understanding the risks and threats, and more critical functionality, less critical functionality, and just being able to prioritize. For some features, it can be fine, just push it to production and let's see what happens. For other features, sorry, let's think first. Let's design first. Let's implement security controls first. Let's verify and test them and only then push to production and let's see what happens.

Zimman: I think that the other way that we've started to look at things is pushing to production shouldn't mean to all users. That's the subtle distinction that I would make is that I personally believe that you should be able to push any change you want to production at any point in time. The key there is, do you have the infrastructure and the system in place to be able to do that safely? If you don't, I think that you actually are doing yourself a disservice, because that means that your only control point is the decision of your engineer, or the decision of your operator.

That's just it. Anastasiia is absolutely correct that you want to have security in place by the time you release to 100%, or even by the time you release to your first end user that is not an internal individual. I think that being fearful of putting code in production probably means that you have some work to do on your pipeline.

Majors: Yes, 1,000%. One of my favorite quotes was from the Intercom team, who said that shipping is your company's heartbeat. You need to be shipping regularly. The faster you can the better. That does not mean that you're recklessly putting everything in front of your users as fast as possible. Decoupling those is one of the most powerful, fundamental things that people can do, both for their team's sake, for their users' sake.

Voitova: The speed really depends on the type of application, on the type of your service.

Majors: Of course, it always depends. Just structurally, for separating those streams, both cognitively and functionally, I think it's really important.

Zimman: I think that the extreme example that I saw was at GitHub, where the first pull requests would actually be just an empty feature flag, and then it would actually get merged back into. Basically, all it did was it basically said, "I'm going to be doing some stuff, and I want to make absolutely certain that it's protected code." That was the first thing that happened. Then the actual feature development would take place. Doing it that way gave them the flexibility to say, "I can put anything I want into main, and it will only get picked up just at a regular interval and pushed to production. It doesn't matter."

Voitova: I just want to add an example to the other side of the scale, because I believe GitHub is one side of the scale, and from my experience, for example, we do have data protection of telemetric data that goes from power plants. Real critical infrastructure, country-wide, all these power plants, electricity.

Majors: I would argue, that makes it more important, not less important.

Voitova: I'm just saying that for some systems you can't deploy so many times, because you have a lot of testing going on.

Cummins: I think the mechanisms have to be different that you use to protect it. I think there's a whole bunch of mechanisms that will protect things from normal users where you say, "I'm just doing a canary deploy." Your malicious actors aren't going to have the same restrictions on them in terms of I went to the website and it behaved normally, so I assumed there was no ports open somewhere else. You have to have a slightly different approach, even though the principle may be the same.

Majors: I feel like the instinct to clamp down and really inspect the side where you're letting stuff out, I think that we've just found experientially for that to be very flawed. Even for power plants, yes, things will be different, but still, not assuming that everything that makes it out the door will be perfect, and being able to catch it quickly. Isn't this the whole thing of security teams is we assume that things will happen that aren't right? What's important is that they're tracked, that we can find them and detect them, and recover from them.

The instinct that we have as humans to whenever something is scary or hard, to clamp down, and slow down, and add stuff. I feel like we've just found that that's a bad instinct. I think it should be like sharks or bicycles, where when you slow down, you start to wobble and lose your balance and fall off. What we've learned is that treating it like a heartbeat, doing ordinary things frequently and practicing them and treating failure as ordinary and frequent, is even more important in these very sensitive systems. You're absolutely right, it's different.

Using Chaos Engineering to Increase Confidence for Potential Security Bugs

Reisz: Haley, you do a bunch of work in chaos engineering, resilience engineering, specifically. When you hear something like this conversation that we're having about security, how can chaos engineering be used to increase confidence for a potential security bug like this?

Tucker: There's a few practices that come to mind initially, which is the concept of a game day in particular, you could use, where you have a team that's attacking the system and another team that's trying to detect it. That would be an interesting approach.

Majors: We also call those interns.

Tucker: I would imagine you could also build bots and things that could attack your system on an interval and make sure that your detection mechanisms are catching those and shutting them down. Those are a couple of things that come to mind.

Majors: From an observability perspective, having the ability to do stuff with high cardinality, high dimensionality. Stuff that will allow you to go from up higher, the aggregated sense, to drilling down to exactly, what is different about them from all of the other requests at a baseline? Giving engineers the tools like that to swiftly explore and to reward the curiosity. I think so many of our tools shut that down. They make it impossible. It's like, "That's a tar pit. Don't go there." Giving them those tools can be really important. Because often we see these things, but it's too expensive cognitively, or for the amount of time we know it will take to invest in it, we're just like, can't be done. When you have the tools, and it's quick and easy, if it's 5 seconds, that can be really great.

Security and Observability

Reisz: I remember I was introducing Laura Bell for a keynote at QCon London and I was talking about this friction that constantly exists between security and developers, at least from my experience. The first thing she said, of course, there's friction, because if there's a security problem, the security folks are the ones that get fired. With what Haley and Charity just talked about with observability, Anastasiia, does that give you confidence? Are you ok with what we're talking about from a security perspective for us to be doing this?

Voitova: That's the patterns. Observability as a process and, for example, if you're talking about logs, as a tool. That may be tricky. We want our developers to put a lot of things in the logs. We want to make sure that we have all this analytics to understand what's going on. From another perspective, all these logs are a source of information. Very often developers put data they shouldn't be putting in those logs. Logs become part of sensitive data, sensitive assets in our system. Who will protect logs? The cycle can go on. I would say that, of course, observability is important, but also the other aspect is to protect the logs to make sure that we don't put anything like risky logs, and to make sure that we can say, to highly know, like, audit logs. To be able to say that something went wrong here, and to be able to find this exact moment.

Majors: It's unfortunate that audit logs and telemetry logs share the same term, because they're so different in every way. All of the assumptions, all the structure, one should be immutable, append only, like small, structured. One should be free-form, throw everything in, simple, like crazy.

Voitova: That's a pity what we have. Here, talking about observability, I would suggest to take a look on NIST SP 800-57, which is how to build efficient and secure logging in our systems. I really like it. It's full of useful tips on putting things in logs and then protecting logs.

Majors: I'm in the death to all logs camp. Logs should die in a fire with extreme prejudice.

Logs and Security Risks

Reisz: I was actually going to ask that, Charity. What is your response when security talks about risk and things like logs? It's death to all logs?

Majors: It depends. It does. It depends on your risk profile. It depends on who you are. It depends on your level of maturity and development. There is no one size fits all. This is where I feel like there are so many big companies, of course, not small companies, C-levels out there, who are trying to get to a stage where they don't have to rely on their employees to understand their systems. To be perfectly blunt. At the end of the day, somebody's got to understand this stuff. Somebody's got to make informed choices. There is no recipe book that you can take from company to company. These are extremely complex socio-technical systems, that everyone is a snowflake. The only way that you can increase your confidence that you're doing the right thing is by digging in and really understanding the use cases and making the right decisions.

Of course, if you're a 5-day old startup, there's a recipe book that you can get started with, that'll give you some sane defaults. By day 10, you have a snowflake on your hands. I feel like the tendency to want to go, "AIOps will figure this out for us," or, "The computers will say this." They won't. At least not right now. Maybe in 20 years. This is why you're seeing this increasing emphasis on the squishy sides of it. The aspects of, is this system humane to run, like Holly was just saying? At the end of the day, it's all people. It's all about designing systems that take the humans into account, so that we're working for each other and not against our proclivities.

Cummins: That maybe is one place where AIOps can do something actually not in terms of replacing people, but just in terms of instead of spending your time doing this really boring, tedious, [inaudible 00:33:08] through the log.

Majors: Also, stop trashing the logs. Create instrumentation with intent. We should be thinking about our instrumentation the way we think about our comments. We've been trying for years and we've finally, I think, given up as an industry in trying to auto-generate comments, because they can't do it. You can't auto-generate original intent. Instrumentation is really an extension of commenting to include your production systems. It's not just, what do I intend to do? It's, what is actually happening? It's incredibly powerful. It cannot be done without an actual knowledge of what you're trying to do. You can't write code and ship it. Then go, is it doing what I want it to? Is there anything else that's weird? Unless you first know, what are you trying to do?

The Future of Manual Testing, and Automation

Reisz: Test-wise, do you think manual testing is gradually going away, and it all needs to be automated? What are your thoughts?

Cummins: Yes. I'm gradually shifting my position on this. There's two parts to that. One is, should the developers be doing all the testing? I think there definitely is a place for people who are not developers to be doing testing because testing is a discipline. It is a skill discipline. Testers have an idyll in them that developers do not. They can find problems that we can't, because we bake our assumptions into our tests, of course. Then our tests validate our assumptions, and the first time around it still fails, but then we eventually manage. Then the tester comes along and they do this and they do this. We're like, "I never thought of that."

Reisz: Why didn't you click that?

Cummins: Yes, exactly. Then the second part of it is about how you execute your tests. I am incorrectly in the camp that says, you just want to have more tests, and you never delete a test. Then you just keep building up the tests, because tests are good. I know it's wrong. There's two reasons why it's wrong. One, you have to maintain the things, and they become a burden, because they are testing old behavior. Then the second thing is, as well, in terms of your DevOps flow, they start taking a really long time. People commit less, people push less, because the tests are a pain. Even though it pains me in every way, you do have to kill some of your tests. Then, when you kill some of your tests, you want to be keeping the ones that are high value that are finding problems, and you want to be getting rid of one of those low value ones. Then when someone finds a problem, our first instinct is to say, "You found a problem, let's automate that test." Or, let's automate that into a test. Then, maybe we do want to say, actually, is this a one-off problem that we will never find ever again? If so, let's maybe keep that test.

Majors: A risky question, but sometimes true. The default answer is, yes, manual testing shouldn't happen, except for visual stuff. I really think that pretty much the remaining use case for staging for most people, is so you could actually see it with your human eye, does this look pleasing and correct to me? I don't think there's really a replacement for that yet.

Tucker: One of the areas that we usually tell people to do some manual testing is when they're starting to get into chaos, because we do allow people to inject failure or latency just for their one user. It's a really great tool for them to be able to go in and say, "I'm going to do this. Then I'm going to go to netflix.com, and I'm going to play. I'm going to see stuff fail, and be able to trace it through." Because that actually connects in their brain, this is what's happening. I think as a training tool, also, it can be really useful to do manual testing to experience what's going.

Cummins: If you don't know what you're looking for, then you have to do it. Like we say in TDD, one of the great things about TDD is it makes you think about what you're looking for, but then sometimes you just really haven't got a clue. Then you've got to do it manually.

Zimman: I think the other aspect is, when I think about the need for manual testing in perpetuity, I'm thinking about it more from a perspective of, again, to Holly's point, not necessarily the developer, but the product manager, or someone that is in some type of QA or QE type function, who is actually looking at this from the perspective of the end user. Thinking about it from not just, did this code work the way I wanted it to? Is the workflow that I'm looking to actually ship to the end user working the way that they think it should? It's the functions as designed bug that you're trying to prevent.

Reisz: If you work in a resilience team, how do you manage accountability for production readiness with teams that actually build the product?

Tucker: Everybody at Netflix is responsible for what they push. I would say, part of that comes in to just team cultures, and if you push something into production and it breaks, are people rallying around and helping you fix that?

You're expected to follow through and make fixes or changes. We try to not blame people. That's not the point of it. It's just, get everybody in a room. Understand what happened. Now let's figure out what's the best way to proceed, and how can we make this better for the next time?

We don't throw things over a wall. Any time I'm working with a team that's getting ready to push something into production, I usually ask a lot of questions like, how will you know if this is working or not? Do you have the right alerts in place? Do you have the right logging? Because I think you could do a lot of that work up front that then makes any incidents that happen a lot easier to deal with coming back.

Back to the earlier topic of feature flags. That's another thing that I like to encourage people to do. Because if I can push something, and it's behind a flag, and things break, then I could just turn off that flag and everybody can move forward with their day and then I can work independently to fix that going forward. I think there's a lot of tools and tricks that you can use to enable people to feel like they're responsible without making them feel like the weight of the world is on their shoulders.

Feature Flags

Reisz: I was going to pitch that back to you, Adam, to talk about feature flag. Anything you want to extend on?

Zimman: I think that we've covered a bit of it here. The way that I talk about feature flags with folks that don't have experience is, it's thinking about it in the context of, you're creating a control point in your deployed code. You're giving yourself the flexibility to be able to impart a change on your production system or service, without having to have that notion of a requirement of a redeploy or a re-initialization of a service that takes time. Frankly, even in the most efficient of systems is that point in time that you are actually taking on the most risk. I like Charity's point of, somebody has to know how some things work at some point.

Thinking back to the time that I spent at VMware, we used to talk about this in the context of someone needs to care about physical hardware at some point in time. The reality is that, even if you're running in AWS, it's a physical box somewhere. I think that this is that chain of control that you need to be aware of, of how many turtles do you have to count down to be able to get to the point where you can actually do some type of root cause or be able to impart the change that's going to, again, get back to that notion of users still receiving value? I think that the place that we're at now, where applications are so interconnected between hundreds of thousands of microservices, that control point, having that as close to the end user as possible, in terms of in a feature flag, gives you that ability to be able to impart change so much faster than anything else.

Majors: Have you seen the site, whoownsmyreliability.com? The answer is you. Whoownsyourreliability.com? The answer is you. This is I think where with serverless, and all of this stuff that we're moving development further and further up the stack, all of these things are absolutely good. I never want to drive to the colo again to flick the power switch on a MySQL server. When it comes to, who owns your reliability? That's always got to be you. The thing is, I think that humans innately, we reach for ownership, if you don't punish them for it. We want autonomy. Autonomy, mastery, and meaning, we crave out of our work. People want to be responsible for their stuff, as long as you don't make it miserable for them. Developers are usually stoked to be on coffee with their code and help support it, if they're not getting woken up constantly. I feel a reasonable baseline is, once or twice a year. You can get woken up for your code once or twice a year. Beyond that, it's a management problem. We got into this business because we love creating things, and we love seeing them be used. This is just the extension of ensuring that that can be done.

Takeaways

Reisz: What is your one thing you'd like people to take away?

Cummins: To echo what we said before, release often, and then get the quality earlier in the cycle to give you the confidence to release often.

Tucker: Learn to love failure. That's where you're going to learn the most. If you do that, it'll cause you to set yourself up for success.

Voitova: Think first about security, about reliability, about incident recovery, and build second.

Majors: Production is for everyone. Production isn't someone else's problem. Production isn't a thing to do later. Production is a thing to be embraced, is a thing to be loved. It is a thing to lean into, to learn to find the joy in because it will make you a better engineer.

Zimman: I really want to echo Charity's point of production is for everyone. I think that the thing that I find most compelling about businesses now, is that that's on both sides. There's the users that are consuming it. It's also thinking about it more broadly in the context of the teams that are providing those services at your own company. How do you actually get your product management, or your marketing, or your sales organizations to start thinking about the value of production, and being able to impart change on it for the parts of the business that they're closest to?

Reisz: I think production is for everyone, wins.

See more presentations with transcripts

Recorded at:

Dec 03, 2020

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?