Many incidents happen during or right after the release, argues Charity Majors, CEO at Honeycomb. She believes that stronger ownership of the deployment process by developers will ensure it is executed regularly and reduce risk. Fear of the release pipeline will lead developers to ship less, which in turn increases the risk and impact of delivering new code. She argues for strong investment in the tooling, high observability during and after release, and small, frequent releases as a way of minimizing impact caused by shipping new code.
According to Majors, if releasing is risky or highly impactful, then you need to make fixing that a priority. In the short-term, she recommends tracking failure frequency and then running post mortems on those failures. This will provide you with areas where you can invest in making your pipeline more resilient. You can also perform your deployments in the morning, during business hours, when everyone is available and fresh.
To begin to drive long-term improvements, the project needs an owner. As Majors notes, "if it doesn't have an owner, it will never improve." Companies can signal the importance of this project by making one of their best engineers the owner of these improvements. Majors puts forward that deploy software is often "a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns' earnest attempts". In her view, deploy software is the most important code you have and should be treated as such. This resonates with Alek Sharma, technical writer at CircleCI, who believes that DevOps is based on the argument that "code isn't really serving its purpose until it's been delivered".
Majors then suggests that the developer who merges the code must also be the one who deploys the code. To ensure this is possible, she states that all developers must be software owners who:
1. Write code
2. Can deploy and roll back their own code
3. Are able to debug their own issues in prod (via instrumentation, not ssh)
Software ownership, according to Majors, is the natural end state of DevOps. It will help to shorten feedback loops and ensure that the developer with the most context of the change is ready to assist if the deployment goes poorly. This leads to better software and a better experience for customers. She goes on to state that a developer should not be promoted to a senior position if they do not know how to deploy and debug, in production, their own code.
However, Serhat Can, tech evangelist at Atlassian, suggests that if the release requires human approval, then the on-call team should perform the release. They can then follow up with the developers to have them perform their post-release checks. Developers should ensure it is easy for the on-call team to see the latest issues that could arise from the release.
The goal of these improvements is not to eliminate failure. As Majors notes, "distributed systems are never "up"; they exist in a constant state of partially degraded service. Accept failure, design for resiliency, protect and shrink the critical path." Instead, the focus should be to enable shipping of small changes frequently and to exercise the process enough that failures become non-events since they are routine and non-impactful. From here you can begin to use the failures as learning opportunities.
Majors then introduces the concept of Observability Driven Development. Christine Yen, CPO at Honeycomb, describes this as "using production data to inform what, when, and how you develop, feature flag, and release system functionality". She recommends incorporating the nouns that are used during development, such as build IDs, feature flags, or customer IDs, directly into your observability tooling. This will facilitate connecting the data to the change that introduced the failure. As Majors notes, "the quality of code is not knowable before it hits production", therefore outfitting your code with the appropriate instrumentation before it ships will assist with debugging the changes in production.
While it may be true that most incidents happen just after a release, as Majors concludes having a strong culture of ownership can help reduce the frequency and impact of these incidents. Having developers who can debug their own code in production will improve ownership over their releases. According to Majors, greater ownership over the release process will encourage developers to ship more frequently, resulting in higher quality, smaller releases.