Blameless Post-Mortems and On-Call Gamification at 1st DevOpsDays Portugal (Day 2)

Ten years after the first DevOpsDays conference in Ghent, the evolution of DevOps and organizations trying to adopt it was at the forefront of the first DevOpsDays conference in Portugal. On the second day, a mix of local and international speakers covered topics such as learning from incidents without blame, gamifying on-call, modern continuous delivery, and more.

Pranjal Deo's keynote on blameless post-mortems set the tone for the day. Deo shared her experiences learning from incidents without blame, both professionally at Google and in her personal life (for instance, when facing signs of burnout). A key takeaway was that missing the chance to learn from a serious incident makes it even more costly. To avoid that, Deo recommends having pre-defined measurable criteria to trigger post-mortems, like a certain number of users affected, amount of revenue lost, or even a canary release showing significant regression. Besides consistently running post-mortems, other critical factors for a culture of transparency include keeping blameful speech out, celebrating the discovery of vulnerabilities, establishing psychological safety (to speak up), and a continuous improvement approach to work (where failures are seen as learning opportunities). Finally, Deo warned that having all the above but failing to provide adequate time to complete follow up actions could derail the entire approach.

Pedro Torres talked about the challenges to scale on-call at Talkdesk, from everyone being on-call (even the CEO) for all systems in the early days (and some people nearly burnt out) to having on-call engineers off sprint work to increase resiliency and reduce toil around the systems they work on (following Google's SRE model). Today there is a flat fee compensation plus some time off for on-call work (versus no compensation early on). Further, performance reviews are not linked to on-call participation. Playbooks for on-call procedures are now put to test with weekly fire drills where incidents that resulted in postmortems are reproduced to check if the playbook provides sufficient guidance for an engineer to fix it. The final ingredient so far was the introduction of gamification (with "prizes" such as personalized mugs, stickers and notebooks), with MTTR improving by 12% compared to pre-gamification (which had already substantially improved from early days).

Ken Mugrage's presentation on modern Continuous Delivery (CD) started with a recap on the history of CD from Thoughtworks' point of view (resulting in the books "Continuous Delivery", "Building Microservices", and "Infrastructure as Code", each (co-)authored by a current or ex-Thoughtworker). Mugrage then mentioned how container images have become the new, ubiquitous artifact in modern pipelines and Kubernetes is becoming the de facto platform for managing environments, both for production and testing or staging. Other critical pieces for modern software delivery, according to Mugrage, include feature toggles, trunk-based development, as well as supporting multiple deployment strategies, from canary releases to blue-green deployments and rolling updates depending on application context and purpose of changes being deployed. Finally, releasing database changes separately from application code changes (but via the same pipeline), using dynamic environments (also called ephemeral environment), vulnerability checking and secrets management are also important for a modern Continuous Delivery approach.

Speaking of deployment strategies, Pierre Vincent stressed that there is a difference between real downtime and perceived downtime during application (and database) updates. Vincent was talking about how Poppulo decided to go for a zero downtime strategy as the employee communications software they created became used worldwide and maintenance windows became unpracticable for customers. Vincent said that even legacy systems with traditional databases can benefit from patterns like expand/contract for databases to reduce downtime, but real zero downtime is nearly impossible to achieve. However, "zero downtime is a user perception" doesn't mean every service needs to be up during a migration, but rather that end users don't notice it, according to Vincent.

Other talks on this day focused on the challenges to Kubernetes adoption in production by João Vale and André Ferreira, and a tour of the Financial Times' microservices "legacy" and how they used graph databases (neo4J with GraphQL) to create a shared model of the services and corresponding ownership, by Rhys Evans.

Videos for most talks will be published in the conference's Youtube channel. According to the organizers, the next edition of the conference will take place in 2020 in the city of Oporto.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter