At QCon San Francisco, Michael Kehoe presented "Building Production-Ready Applications". Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of "production-readiness", and argued that all engineers working across the organisation should continually focus on these: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.
Kehoe, staff site reliability engineer on the production SRE team at LinkedIn, began the talk by referencing the book "Production Ready Microservices" by Susan Fowler and an associated quote:
A production-ready application or service is one that can be trusted to serve production traffic…
… We trust it to behave reasonably, we trust it to perform reliably, we trust it to get the job done and to do its job well with very little downtime.
Changes in software architecture and the software delivery lifecycle over the past decade have increased the challenges in asserting production-readiness, and also provided new opportunities. For example, monolithic applications have been decomposed into multiple microservices, which allows rapid iteration and individual deployment and scaling, but also means that more services need to be verified and operated. New techniques like continuous integration, combined with the emergence of different methods of working like that of DevOps and the SRE movement, mean that engineers can automate more and follow established best practices for working together across disciplines.
For the tenet of stability and reliability Kehoe began by arguing for the need for stable development and deployment cycles. In this context stability is all about "having a consistent pre-production experience". Within development this should focus on well-established testing practices, code reviews, and continuous integration. For the deployment practice, engineers should ensure that builds are repeatable, a staging environment (if required) is functioning "like production", and that a canary release process is available.
The value of a staging environment as a concept may be debatable, but if you do have one, you need to treat it like production in order to get value out of it
Unreliability in microservice-based systems usually comes from either changes in inbound traffic or changes in behaviour from downstream services. Engineers should therefore understand production traffic routing, load balancing, and service discovery and associated health checks. Dependency management is key (both from a service and code perspective), and attention should be focused on the engineer onboarding process and the sharing of established best practices. An often overlooked, but very important topic, is the service deprecation procedure.
The discussion of the next tenet, scalability and performance, began with a focus on understanding "growth scales", i.e. how each service scales with business goals and key performance indicators (KPIs). Engineers should be resource aware, knowing what bottlenecks exist within the system and what the elastic scaling options are. Constant performance evaluation is required -- ideally testing this should be part of the CI process -- and so is understanding the production traffic management and capacity planning that is in place.
In regard to the tenet of fault tolerance and disaster recovery, engineers should be aware of, and avoid, single points of failure within design and operation. Understanding concepts within the disciplines of resilience engineering and chaos engineering disciplines is also highly beneficial in the dynamic and ephemeral world of cloud computing. Ensuring that a team knows how to react to failure and manage incidents -- for example, by creating disaster plans and running game days -- and constantly designing and running experiments to test assumptions with failure modes is vital.
The next tenet of readiness discussed was monitoring. Kehoe argued that dashboards and alerting should be curated at the service-level, for resource allocation, and for infrastructure. All alerts should be require human action, and ideally present pre-documented remediation procedures and links to associated runbooks. Logging is also an underrated aspect of software development; engineers should write log statements to assist debugging, and the value of these statements should be verified during chaos experiments and game days.
The final tenet to be explored was documentation. There should be one centralised landing page for documentation for each services, and documentation should be regularly reviewed (at least every 3-6 months) by service engineers, SREs, and related stakeholders. Service documentation should include key information, like ports and hostnames, description, architecture diagram, API description, and oncall and onboarding information.
Wrapping up the exploration of the readiness tenets, Kehoe moved to discussing how to implement measurable guidelines to help implementation and verify assertions. The challenge is that often the general concepts of readiness are not something that can be directly translated into "something that is true or false", and instead cannot be directly scored or measured. One alternative is to focus instead on the outcomes of following specific readiness guidelines.
There is value in creating manual checklists for measuring guidelines and associated outcomes -- especially when a team is just starting to explore the tenets of readiness -- but the real value is delivered through the automated generation of readiness scorecards. Kehoe demonstrated several LinkedIn internal automated service checking frameworks and the associated dashboards that are used to drive readiness implementation and verification.
The key learnings of the talk were summarised as: create a set of guidelines for what it means for a service to be "ready"; automate the checking and scoring of these guidelines; and set the expectations between product, engineering, and SREs teams that these guidelines have to be met as part of a "definition of done".
A PDF of the slides from Kehoe's QCon SF talk, "Building Production-Ready Applications", can be downloaded from the conference website. The full video recording of the presentation will be made available on InfoQ over the coming months.