Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Building Production-Ready Applications

Building Production-Ready Applications



Michael Kehoe explores how to deploy microservice to production. He talks about best practices for designing, deploying, monitoring & documenting applications.


Michael Kehoe is a Staff SRE at LinkedIn working on Incident Response, Disaster Recovery, Visibility Engineering & Reliability Principles. He specializes in maintaining large system infrastructure as demonstrated by his work at LinkedIn (applications, automation & infrastructure) and at The University of Queensland (networks).

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


My name is Michael Kehoe. And today, we're going to be talking about how to build production-ready microservices. Before I get started, I just want to show this slide. Hopefully, today, I'll be showing some really interesting, entertaining, informative content which you might want to take photos of. I've tried to make it easy for everyone, and you only need to take a photo of this slide, which goes to my SlideShare, where I'll put up the slides later on this afternoon. They'll also be on the QCon website. I still see a couple of friends up, so I'll give everyone another five or six seconds, because some stragglers are at the 90th percentile latency. I think we're good. We're past the 99th.

Today, at a high level, firstly, I want to just do an introduction of myself, what I do, and a little bit about the problem statement and then pivot in to the tenets of readiness. We'll spend a bulk of the talk today looking at the different tenets of readiness, what makes up a production-ready application. We'll then go and talk about creating measurable guidelines. Taking these sort of tenets of readiness and turning them into something that we can actually measure and grade ourselves on. Then we'll talk about actually measuring it. Going from the tenets of readiness to the measurable guidelines to actually measuring it and how to do that. And then we'll summarize with some key learnings and, of course, our Q&A at the end. Let's get into it.

I'm Michael Kehoe. I'm a staff site reliability engineer at LinkedIn. I have somewhat of a funny accent. I'm Australian, but I've lived here for a while. Usually, Americans aren't really sure what I sound like. I was in London last week. They definitely knew what I was. I wasn't one of them. And I'm on the Production-SRE Team at LinkedIn. You can find me online @matrixtek, my blog, michael-kehoe, my LinkedIn.

My Production-SRE Team at LinkedIn, we have four areas that we work in. We sort of oversee a larger bound of the operation of the site and in managing best practices. We do disaster recovery, planning, and automation. We do incident response process and automation as well as responding to the major incidents themselves. Visibility engineering, which is a really interesting area, where we're taking those operational data out. How our site is operating, our availability, our instance statistics and making sense of that data for our executives so they can go and make informed decisions. And then reliability principles, which is personally my favorite, where we're talking about defining best practice and then automating it. As this presentation goes on, you'll definitely notice the reliability principles theme in here.

What is a Production-Ready Service?

What is a production-ready service? This is a topic that has sort of come to fruition in the industry in the past couple of years. Susan Fowler wrote this great book which got published about this time last year, called "Production Ready Microservices," which goes into a much deeper dive than I could possibly do today, about how you take the concept of an application and run it through a set of tenets that will make it ready by the time it's ready to start traffic.

Going back to the original question, what is a production-ready service? Susan defines it as, "A production-ready application service is one that can be trusted to serve production traffic. We trust it to behave reasonably, we trust it to perform reliably, and we trust it to get the job done and do it well with very little downtime." What we're trying to have is an application that's in production that we can depend on, that we know how it's going to behave, and that we know that it's going to be reliable, and we're not going to have a large amount of downtime.

How do we get to this problem statement? Personally, I see it in three ways. Firstly, in the last 10 to 15 years, we've gone from these huge monolithic applications into these sprawling microservices applications. You know, two or three Monoliths and now maybe 20 or 30 microservices. Secondly, we have continuous integration. No, we're not doing yearly releases, quarterly releases, monthly releases. We're doing daily releases now. Any developer can go and commit, and deploy code really first. At LinkedIn, our front-end API service is deployed three times a day into production.

With this, more applications are being deployed faster. We have the same sized Op teams supporting a large number of applications. We end up with this disparate knowledge of how our applications should be designed, how they should be implemented, and how they should be operated. With these factors put together, we need to define how we want to put our applications together and set a set of production-ready guidelines with them. Let's look at the tenets of readiness.

Tenets of Readiness

Tenets of readiness. We have eight main tenets of readiness, which I'm going to compress into five main talking points here. Firstly, stability and reliability, scalability and performance, fault tolerance and disaster recovery, monitoring, and documentation. All of these are really, really key cornerstones of building a production-ready application. All of them are equally as important as each other. Let's look at stability and reliability.

Firstly, having a stable development cycle. As I said a little bit before, we now have continuous integration, which is awesome. We can go and push code. We can go and run tests in a reliable way. We have centralized code reapers, and we have really cool code review systems. We're actually now starting to see tools and platforms come along that help do static analysis to ensure that we've got high-quality code being committed. What we really want here is reproducible builds. If I'm trying to build something, deploy something, I want to make sure that I can go in, build it and deploy it the same way every single time.

If you go look on the Linux kernel mailing list, you'll see people who can't reproduce things. And that is a really large problem. It sucks a lot of time. In our environments, we want to make sure that we have a very structured development cycle. We also need to ensure that we have unit and integration testing. Without microservices architecture, we don't have application A depending on application B, which then depends on application C. We have application A depending on applications B to Z. All of those dependencies, or all of those clients, they need to know that the code that all those API's that they're calling on, are well tested, and they're reliable. So we need both unit testing and, importantly, integration testing to make sure that we are developing and testing our applications in a very stable manner.

Equally, we need stable deployments. What does that mean? Very simply: simple, repeatable deploys, ideally, they shouldn't take a large amount of time. The longer a deploy goes or the more unstable the manner that the application starts up, the less reliable it's probably going to be. Thankfully, over the last couple of years, we have also started to see a trend of people doing Canaries, where they try their new code on one instance or one application amongst a cluster to make sure it behaves in an expected fashion. Or Dark Canaries, where they go and deploy that same one instance. That application takes traffic, but it doesn't actually serve a response back. But you still get a signal of whether the applications or the new code that the application is serving is working as expected without harming the user experience.

Equally, staging environments. There is a little bit of debate in the industry at the moment of what the value of staging environments are. And that's all well and good. Personally, I think they are a good thing. But, if it makes sense for your company to have a staging environment, you really should be investing in it and more or less treating it like production. If your staging environment doesn't look anything like your production environment, if you're not running the same tests in production as you would in staging, the value that you get at your staging environments is going to rapidly decrease.

Let's go and look at reliability. There are three tenets here, or three sub-points, rather. Firstly is dependency management. In this case, I'm not talking about code dependencies. I'm actually talking about application dependencies. A lot of unreliability in the microservice applications actually comes from changes in inbound traffic or changes in behavior from downstream applications. Knowing your environment and understanding how your environment works is not an easy task. But it is really important to know that if my downstream database degrades by 20 milliseconds at the 95th percentile, that's going to really significantly impact the performance of my application. And so what we see at LinkedIn, and we’re a 1,000 plus microservice platform, any small change in our microservice ecosystem can have a much larger effect the closer we get towards the client.

Secondly is onboarding and deprecation procedures. So every single application should have a very well documented matter on how to start using the API's or the services that your application provides. Equally, especially with GDPR, access control is very important as well as clients understanding best practices. I've definitely seen in my experience where a client application will go and ask for 5,000 entities or something from another application. And that really doesn't scale when you start looking at the garbage collection performance in Java. All of these things can be very well avoided by making sure that you have well-documented best practices, and you can tell the clients how to behave in the best way possible. So not only does your application work well, but they get the best possible performance out of your application.

Equally, deprecation. This is definitely something that gets forgotten. I'm sure after a number of years, we all just really want to kill off that application and that code base, and have definitely seen it over my time. But there still needs to be a structured way that we do this. Do we need to clean up access control lists? Do we need to clean up firewalls rules? Do we need to make sure that we go and deprecate client libraries or other pieces of code? We need to know how to do this. Otherwise, if you go and look in your firewall rules in five years’ time, your firewall is going to really struggle to hold all the rules for all these applications that don't even exist anymore.

Thirdly, routing and discovery, which is something that really interests me. Basically, is there a standard way to discover where your application is or how to get to your application? We've definitely seen the rise of services measures over the past couple of years, other service discovering mechanisms like Zookeeper, or ETCD, or Consul. These are all super important pieces of infrastructure to make sure that your application is, number one, discoverable, and secondly, are we actually health checking it properly?

There's been a great blog post earlier in the year by Cindy Sridharan, I think her name is, who did a really good post on having health checks for your applications. There's a large number of ways to do this, but what it really comes down to is will that health check give you a good signal of whether your application can be trusted to serve traffic? And, of course, you need your load balancer to go and respect these in a very quickly converging way. If you are using some sort of DNS load balancing and you have a TTL of 300 seconds, you're going to have a lot of pain potentially.

Equally, we've now also started to see there's a rise of resiliency engineering or chaos engineering, which we'll talk about shortly. The use of circuit breakers or degraders and load balancers. These are super useful tools to make the client experience of your application much better, but they need to be well-tested and understood, because in certain scenarios, it can go really badly.

Let's move on to the next set of tenets, scalability and performance. So, firstly, scalability. Understanding your growth scales is really important, and there are two ways that you do this. Firstly, how your service scales with the business, whether it be goals or metrics, and this is the qualitative growth scale. How does your application scale as it gets more traffic, the quantitative growth scale? This is really tricky to do. For your front-end applications, this might be nice and easy. You can add more container instances. When you get towards the backends, this becomes much harder. And we will talk about in a second - dependency scaling becomes really difficult, especially if you have a large set of microservices with a very dense call graph.

Resource awareness is something that's very much underrated. People have definitely been caught by this when they moved into the cloud and seen how many extra instances they've had running of things they didn't know existed, but understanding what your resource usage is and making sure that you're measuring this correctly. At LinkedIn, we will run a very Java heavy stack, and we are very aware of the different resources it uses, especially when it comes to containers.

It's great putting something in a container. But if you're not measuring the performance that you're getting, or the throughput you're getting, more specifically, and how you'll utilizing the resources in your container, you can very quickly have a really bad time. In containers, the 50th percentile CPU usage is useless. It's absolutely worthless. You need to be looking at the 95th percentile CPU usage time, or looking at how much the container is actually throttling your application or your process.

Knowing what your bottlenecks are, are really important. If you're a back-end services calling a database, that database is going to be a bottleneck. You're only going to be able to go so quickly, but you need to measure that. Knowing about it is great, but you need to actually understand the hard limits that your application has. And then, know how to go and scale up.

We have horizontal or vertical scaling. Vertically scaling, we can go and make out instances of machine's [inaudible 00:18:11]. Horizontally scaling, we're going to add more instances. Each of them have their pros and cons, and I'm not going to pontificate which one I think is a better way to attack the problem. But you need to know how your application behaves, and how you're going to scale it, well before it gets to production. You definitely see this for various launches of Black Friday events. Generally, companies get caught off guard by this, and it takes much larger effort to scale than what it should.

Then dependency scaling. If your service is going to take more traffic, say 15% more traffic, what needs to happen for your downstream applications to serve that traffic for you to take your requests? You need to, number one, understand your downstream scale. But you also need to have some dialogue with them to let them know that, "Hey, we've got this thing coming, you may need to scale up in the near future."

Let's look at performance. Performance evaluation is extremely, extremely, extremely important. Especially as we, in the land of basically, developers, can touch anything in production, every single time that you are deploying a new piece of code to production, I strongly believe that you should be doing some sort of performance analysis on it. Going back to our LinkedIn API example that we deploy three times a day, they could be anywhere between 5 to 50 code commits by 50 different engineers in that deploy. Who am I to go and trust that, well, there is no performance degradation in that? So it is really important that you are doing some sort of red line tests or benchmarking of your application as soon as it's deployed to make sure that you're not adding performance bottlenecks to your application.

This needs to be measured and reported over time. It's very easy to say, "Oh, there was only a 1% degradation in throughput between release one and release two." But if you go and compare release one and release 10 and you're down 10%, that adds up very quickly. Traffic management is also really important. Having quality of service for your application is important. Sometimes you might need to shed load, some less critical load, to make sure that you can take a more critical load, and make sure that you know how this behaves.

Secondly is being able to scale for traffic bursts, which happen, and they might not necessarily be planned. It only takes a tweet now to have much larger traffic than you expect. Or, if you need to go and fail out of a region or out of a data center, you need to be able to have a way to scale that application to take that extra traffic during that failover process. Capacity planning, this really ties all these concepts together. What this really comes down to is, do you have the right numbers in front of you to know how much resources you're going to need in the future to serve traffic on your application?

Let's go and look now at fault tolerance and disaster recovery. Avoiding Single Points of Failure. It's 2018, we have load balances, we have containers, we have these systems that failover. This really shouldn't be a problem. And unfortunately, continuously, is. You should not have your application or your platform affected by hardware failures, rack failures, network failures. You should, at the very most, have maybe a small degradation while your infrastructure reconverges, but this really shouldn't be a problem. My key message here is go and do the right thing early to avoid problems down the road. It is very difficult at times to go and retrofit an application to go and avoid single points of failure. If you're going to invest in this early on, you're going to have a much better time.

Resiliency engineering or chaos engineering, this is deliberately breaking your service to find weak points, and to make sure that things fail more gracefully. If you go back to the original Dev Ops principals that were created in like 2005, one of those key principles is embrace failure. After all this time, or very quickly even after I joined LinkedIn, I knew, I learned, as a very naive 21-year-old, that outages happen. Things break, and you've got to be prepared for it. So you need to constantly test for breakages of your infrastructure.

At LinkedIn, we actually have the ability to go and find critical downstreams that would cause errors to users every time we do a deploy. I can go and see that, in a new release, how we added something critical to the application; if that API call fails, would that create a negative impact for the user? We can do this actually automatically, which is really cool. This is something that's really, really important. You will actually be surprised what you can find. You don't have to start big. All it needs to be is filling up a disk on a machine. That's a really small step to start resilience engineering, and seeing how your application behaves, and how you can make it degrade more gracefully.

Disaster recovery. Again, something my team works on. Understanding what ways your service can break. Again, going back to the resiliency engineering thought. But it's really important to understand what the impact of your service breaking is. It's very well and good for my middle-tier application to go and break, and we say, "Hey, it's broken." But I also need to understand what impact is that going to have on my clients? You know, what is the end user going to see?

If you're not sure about this right now, you should be going and be starting to do some resilience engineering to go and find out. You need to have a plan to respond to this, and there's no one golden way to do this. You might have to do a full data center failover, maybe you can do per service failover, maybe you can do A/B deploy failover, like a green-blue deploy failover. But you need to have a plan for this, and it needs to be well-documented.

For disaster recovery, for larger scale outages, what is your plan? Which really feeds into incident management, which is, what is the process to manage and respond to the outage? It's all very well and good having this down on paper. You really should be practicing a response to an incident where you can go and ensure that all of the things that you've written down on paper actually work in practice.

I had a story from Russ Miles a number of months ago, who's a large proponent of resiliency engineering. He does consulting in different companies with resiliency engineering. In this company that he was consulting in, they had all their disaster recovery plans in an unlocked filing cabinet somewhere in the office. All of their plans were there. So he went to and planned a game day with them to go and test their processes. But what did he do before that? He locked the cabinets and took the key. Very quickly, their disaster recovery plan was in all sorts of shambles. It does really make you think or think outside the box, of how you need to respond to your applications and making sure you have backup processes. Game days are a really good way to go and do this.

Moving on to monitoring. Dashboards and alerts are critical for your application. This should cover both your servers, also, the amount of resources that you're using. We also need to know how our infrastructure is behaving as well. If you're running a container instance somewhere, especially in your private cloud, you need to know that the underlying resources that you are using are also healthy, or trying to have a signal of that because you can't guarantee that your cloud provider or someone else will.

Dashboards are for high-level system health. Is the application behaving generally the way I would expect? They are not for regression validation. I've heard a number of times, one sort of sub-resource of an API call subs working in a release. And someone says, "Did we have an alert for that?" Well, the intent's good. It's the wrong question. The question should be, "Do we have a unit test and an integration test for that?" It's very easy to fall into a trap of having reactionary-based monitoring, where for every outage, you go and add a new alert. Please don't do it. It's a massive empty pattern.

Any alerts that you add should have very well-documented procedures, and these alerts must be actionable. It's great getting an alert for something like, "Disk is 50% full." Generally, that sort of alert is useless. I don't need to be paged at 3 a.m. about that. I want to be paged when it's at 90% full, and it needs to be fixed within the next hour. If you don't do this, you're going to create a lot of fatigue for yourself. I was at LISA Conference last week, and I heard a number of stories from teams where they got paged every five minutes during a 12-hour on-call shift. This is completely unsustainable, and you're going to burn out your engineers. So be thoughtful and mindful when you're creating these alerts.

Logging is a much underrated aspect of software development. Definitely, I write code occasionally now, and it's very well and good to say, "Oh, I've got this great log statement." Then when I'm trying to go and debug my application, I go, "That's useless." You don't know how your logging really works until you need to use it. I do, again, recommend going and doing resiliency testing and trying to hit those cases where you get those error messages or those warning messages, and see how useful your logging is. You should be trying to put this into some sort of central location so you can go and aggregate that data quickly if needed.

Documentation. This is probably the hardest tenent to execute on, because it's the least exciting. I don't know how to make it sound more exciting. Personally, I really like clean documentation, and I like documentation where I don't have to do a whole lot of work to maintain it. If you're going to use something like Confluence wiki by Atlassian, they've got macros where you can go and pull in data from JSON API's and then go and render that into nice tables. So, for my service documents at LinkedIn, about 80% of that information is actually just being pulled from different API's. They're rendered into this nice wiki page, which I print out in case my wiki goes down. I don't actually have to do a huge amount of work to go and ensure that its up-to-date.

You should be having a central place for all of your documentation. Again, a wiki or read-the-doc site somewhere that basically anyone in your engineering organization or your product or organization, could go and get to it, and they don't have to go and search across three different servers to find the information that they want. Documentation should be reviewed by the engineers, the SRE's, or various other partners regularly, and everyone should be contributing to this. There really shouldn't be one person writing all the documentation because you're probably going to end up with incomplete documentation that's seen from one unconsciously biased person, possibly the engineer or the SRE. Everyone should be working together on this, and you should be reviewing it regularly.

When I was supporting profile services at LinkedIn, we actually had a quarterly goal to go and recertify all of that documentation. And that's not only the SREs, but the software engineering people responsible for those applications. Everyone was responsible for signing off on these on a quarterly basis to make sure that they were up-to-date.

What should your documentation include? This is a very short list, which is things that I think are most useful. Key information about your application. What ports does it listen on? What are the hearse names that it's available by? Names of clusters. A description of what the servers actually does? How does it fit into the ecosystem of the product that it supports? Diagramming how the application works and how it fits into the ecosystem is equally important. It's very easy to write a long essay about the ecosystem itself, but as they say, a picture paints a 1,000 words. It's really important to have that and make sure it's editable by other people as well.

More than anything, I really think a detailed description of the API's that the service exposes is really important. Having query parameters, best practices, expected response codes, are all super useful. If any of you work with payment merchants and you use a provider like PayPal, they have almost a 1,000 different response codes. Well, that is a lot. They're all very well-documented, and with that information, you can go and code something that's fairly resilient.

Having information about your oncalls. Who supports the service? Is it the SRE? Is it the developer? How can I reach them? Is really important? It should be tying into some sort of oncall API service that you have. And finally, onboarding information. How do you use the service correctly and validate that you're using the service correctly?

Creating Measurable Guidelines

Now that we've spoken about the tenets of readiness, let's talk about creating measurable guidelines. Not all of the things I've spoken about actually turn into something that I can easily grade. You may need to look at outcomes of specific guidance to go and create your measurable guidelines. If we go back to our stability example of readiness, so stable development, stable deploy cycles. Let's go and look at creating some measurable guidelines here. For the development cycle, we can go and look at, what's my unit test coverage percentage? Has this code base been built recently? If I need to go and make an emergency bug fix to this application, can I be sure that I can go and do this with confidence?

Equally, our library dependency is up-to-date. It's very easy if you're depending on a lot of different libraries that they fall out of date. Those new dependencies might have better ways to do things or change how things work underneath you. You should be ensuring that you are doing some sort of dependency management here and checking that you are using the best practice library for something. If we go and look at the deployment process, has the application been deployed recently? This sort of builds off the library management, like I should be having the best libraries or code dependencies in my application. It's great having that committed, but it also needs to be deployed.

What is a successful deployment percentage? At LinkedIn, we have a measure called developer velocity. And part of that measurement is the number of successful deploys, versus the number of rollbacks. If you go and deploy something, then you've got a number of critical bugs in there, and you need to roll that back. We're going to count that against the instability of the service. Again, is there some sort of staging environment or Canary process for your application? They're all fairly reasonable guidelines. Let's go and look at measuring these.

Measuring Readiness

Why do we want to go and measure this? Having measurable guidelines helps ensure two key outcomes. Firstly, standardization. We can ensure that our microservice applications are more or less built in the same way. Secondly, we have some level of quality assurance. Can we be sure that the service is trustworthy?

How do we do this? Well, there are two ways. You can go and do it manually, which I think the industry is more or less doing, where we have manual checklists and ensure that when we deploy that application the first time, that we've hit most of our checklists. Or you can do it automatically. And we've definitely found at LinkedIn, that we have to do it automatically. What we define as our best practice can change up to a couple of times a week. We have now over a 1,000 microservices. We can't manually go and check these every single week. We have to do automatically.

We've built this tool called Service Score Card, which helps measure our production readiness. Service Score Card automatically goes and finds the services that are deployed in production but also in staging. We can catch these things before it hits production, and we can go and give a breakdown of the scores per team. Unfortunately, I've had to blur out some of this information. When the screenshot was taken, I was a part of two different teams. One was doing slightly better than another, and we get an aggregate score for that team. We can then break it down into the team view, where I can see all of the applications that that team earns and the relative scores.

For my team here, all of my services are internal to the company. It's not critical that they're all at 100%. But I can very quickly see, how does my application stack up to what we consider production-ready at LinkedIn? I can then go and deep dive. And, again, unfortunately, I've had to obfuscate the data, but I can go and deep dive on what checks are passing and what checks are failing.

At LinkedIn, we go and do horizontal initiatives where we spend a certain amount of our engineering time each quarter to go and ensure that applications are meeting some sort of best practice. Usually, this requires a large amount of engineering effort sometimes. Service Score Card is a really great way for us to go and check how many applications are going in, and measuring that best practice that we've defined. In the near future, we're actually going to be providing notifications to teams that fall out of our production readiness lists, where they start to have checks that are failing for production-readiness.

Key Learnings

Key learnings. Firstly, is to create. Go and create a set of guidelines for what it means for your service to be production-ready. What I have presented today may not 100% fit your company, which is fine. So you need to go and internally look at what makes an application at your company production-ready and write that down. Secondly, you need to go and automate these checks. Doing it manually is pretty difficult once you start to have maybe more than half a dozen services. So start looking at ways to automate that. I've definitely seen teams within LinkedIn, where, as a part of the integration testing, they're actually checking for various production-readiness checks.

And then finally, evangelize this. So setting these guidelines, measuring them, are great. But you need to set the expectation with your engineering, with your product teams, with your SRE teams, that we have these guidelines, and they matter, and they need to be met. You need to put them in place, evangelize, set the expectation, and really make it a part of company culture.

That concludes my time for today. We've got about five minutes for questions. Thank you very much, QCon.

Questions & Answers

Participant 1: You mentioned that you don't automate everything that comes to your head in terms of ensuring that you're meeting your criteria for production-ready, right? What kind of guidelines do you use to tell what you should, or not, automate?

Kehoe: So the question, is how we decided what checks should be automated in our production-readiness? For us, I don't know if I have a really good answer. Basically, anything that we know we can measure in some sort of automated fashion, we've already done it. The Service Score Card application I showed you is more or less community-driven. There is one team that sort of owns that, but anyone can go and contribute to that. Any team that identifies the need to set an expectation that, "Hey, this library really needs to be used," or, "You need to have this metadata somewhere about your service”, they've gone and done that. I don't really have any more I can say about that, unfortunately.

Participant 2: How do you alleviate the cost-of-carry concerns when you try to build things scalable and correctly from the beginning?

Kehoe: For LinkedIn, we've got a very heavy focus on crossmanship. It's sort of expected, like our promotion process, our review cycles are all very much based on this concept of crossmanship. So if you're going and committing code with no unit tests, not even using a staging environment, it's very easy for your management to go and see that and mark you accordingly. So we're very much empowered to go and do the right thing from the start, and the Service Score Card is a really good way of sort of validating that. The effort has been put in to make the application production-ready to, you know, how, as a company, we see an application should be built.

It's more of a culture question, I think. For us, it's not a concern. I was working on another team earlier this year, and their Service Score Card was not that great. I said, "All right, this is important. We need to fix it." I'm an SRE, I needed developers to go and make some changes to their application. I didn't get any pushback. The only pushback I sort of got was like, "Hey, make sure it's in the next sprint, and we'll prioritize it," and that was it. It was all pretty simple.

Participant 3: Thanks for a great talk. My question is about, did you have a situation when some teams started to compete for such scores and forgot about some important stuff, and how you handled this?

Kehoe: We haven't reached that point. I think, as engineers, and also our management- and I've spoken to both groups fairly extensively about this- we understand that it’s sort of one signal. So, for our stream processing jobs, it's pretty easy to get a score of 100%. All you have to do is have you library dependency basically up-to-date. There are teams that got like 100%, which looks good, and it's like one signal for management. It's not like in performance review time; information back in time like this is the only piece of data people are looking at. It's just one signal for management to look at, I think.

As a competitive person, and as the most senior SRE in the company, I do want my teams to be closer to the top or at the top. It is not necessarily easy to always have 100%, and I don't think we've really got the issue where teams that are competing or subverting the system, because it goes back to our engineering culture. There are definitely ways for me to go and subvert those checks.

Going and doing that, if I go and have a score of 100%, but I have three outages a week, someone's going to start asking questions at some point. We haven't run into that problem. I think that what it comes down to is production; having a production-readiness score is just one signal in a number of other things. I was talking about the visibility engineering that my team does; we're looking at the availability of your servers, the production-readiness score of your servers, the time to detect on your incidents, the time to resolve on your incidents. They are all a number of different signals that paint a larger story. And again, that's probably not the be-all and end-all of painting a picture about how reliable something is.


See more presentations with transcripts


Recorded at:

Mar 17, 2019