InfoQ Homepage Articles The Way to No-Hotfix Deployment

The Way to No-Hotfix Deployment

Mar 11, 2016 16 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

We, my team and I, dreaded re-deployment! Imagine, instead of working on a feature, you have to re-deploy the system: versioning, asking the scrum master to write the change request, packaging the system, go to the IT operations team to ask for their help for deployment. At the end of the day the burndown chart looks bad and stakeholders become unhappy.

The unfortunate thing about being a software engineer is that software is intangible. By its very nature, it's hard to see what holes it has. And when those holes are uncovered in a production environment, patching it, or hot fixing it, may cause an exhausting re-deployment. This article shows ready-to-use tips on how you could reduce the need for hotfix deployment.

But Bugs are Everywhere

During my career as a Software Engineer, I have witnessed many kinds of error, or what engineers call "bugs." They could be anywhere: in the server, in the client, in the browser, in the networking equipment!

Not only are bugs everywhere, they are costly too. It is expensive to find them, to emulate their occurrences, and to fix them—a process called debugging. NASA once experienced one of the most costly bugs ever found, vaporizing $125 million into the thin atmosphere of Mars.

But, "people make errors," Tom Gavin the JPL administrator of the NASA project said. Unfortunately, the very nature of software engineering makes it hard to detect potential bugs early during development.

Several measures can be taken to reduce the likelihood of unwanted bugs creeping into our software. This will minimize the risk of hotfix deployment. This article discusses those measures in three categories:

Development time
Pre-deployment time
Post-deployment time

Development Time

It all begins with a clear Story

A good story includes a good definition of done, and a good story breakdown. A good story will help the team to:

Reduce ambiguity.
Help the team understand the purpose of the story.
Make development process faster.
Support better quality of delivery.

A good story begins with a good title, usually in the format of:

As a [role] I want [activity] so [goal/value]

Next, it has a good definition of done. The definition of done tells the engineering team about each testable scenario that have to be accomodated for the story to be considered deliverable. The definition of done is written by the product owner, and will be checked by her.

Engineering team owns the tasks breakdown. The list comprises the technical steps needed to build the story. Bear in mind that each item should be achievable within a day, at maximum—if not, break further. It will make it easier for the product owner to track the progress, and also for the team to feel some sense of achievement.

Bad stories have some common characteristics:

Unclear; what is the purpose of this story?
Lacking a wireframe example. Even a simple one: just the expected output.
Ambiguous definition of done. Not sure what to expect.

For instance, story about change in system A, but failed to mention that the change will affect system B maintained by the same team. Therefore, a bug! But how do we prevent this if system B is maintained by a different team? We explore this further below.

Do-not-rush Culture

This is the second most important thing after having a good story definition. Let me begin with telling you a story, how by not rushing, we could avoided a hotfix.

Once upon a time, there was a sense of urgency whenever we were about to deploy. That is, we felt pressured to include as many stories as possible on deployment day. This resulted in inatention to detail and we often needed to re-deploy. Once, we had to deploy three times for hotfix—so embarrassing.

But when we begin to work to a steady rhythm, we found we rarely need a hotfix deployment, except for the occasional security issue. We take it calmly: if it’s not possible to include the story thoroughly and fully tested, we don’t include the story for deployment, we wait for the next cycle. As we religiously apply this rule, we now rarely have re-deployment.

It’s better to deploy once successfully, rather than deploy in rush and fail many times.

The key is to know how to assign a priority. How?

First things first. Don't let customers, merchant solution or customer service guys, or anyone out there to tell you that "this feature has to be on the production very very, very soon." Always find a way to say no. It’s quite likely that when the feature does ship, it’s likely not going to be used until some months later.

There are features, fixes, as well as bugs that do demand rapid response, the key is to prioritize! We normally have four layers of priority: P0, P1, P2 and P3.

P1 is assigned whenever there is an urgent story that needs to be deployed soon. For instance, if after modifying a code in the sign up page a prospective user cannot register, then consider the bug as P1.

Higher than P1 is P0. P0 is the very, very alarming one. It will render the system completely useless. It could make many customers angry. It could bankrupt your company. If it requires you to jump out of your bed, so do it.

One example of P0 bug is the condition after deployment where the system exponentially consumes memory space, causing the performance to drag, and all users could no longer reliably access the system. This needs an “all hands on deck” response, fix the problem quickly (without panicking) and deploy a hotfix as soon as humanly possible.

P2 is for the current sprint. All P2 are all stories that were thought of well before the sprint commenced and were planned for inclusion in this sprint.

Lower than P2 is P3, it can wait until a future sprint to be planed. How do you know if a story is P3? Well, if there is some other means to achieve the outcome without having the requested story deployed, then the story could wait. Make P3 the default. Always try to prioritize to P3 all requests for unplanned features/bug fixes.

Following these guidelines, I have a much better state-of-mind and the team write less buggy code.

Good coding practice

If you have a team, do pair programming at least once in a while. Pair programming allows knowledge exchange. New hires pair up with a much experienced coder on the team. By doing so the new hire learns about the system and can begin implementing features on their own. Some companies go further, making pair programming the norm, providing just one computer for two coders, enforcing pairing.

Write good documentation and good commit messages. There are many more techniques that lead to better code: don’t hard code value, use clear-named variable to make code easier to read, etc. Find a good reference for coding practices and make sure the whole team understands why and how to use them.

Take into account physical aspect as well. Allow people to work in the way that works best for themselves – if someone feels they do better work at night, let them work evenings. Take care of the whole person – a good physical environment, a safe working culture and healthy habits, such as getting a good night’s sleep, result in less stressed, more fulfilled people who make better products.

Quality Control

The most important aspect of quality control is testing. There are many kinds of testing: integration, unit test, black box, white box, user acceptance testing (UAT). Testing shouldn’t be negotiable. Sadly often this is the case.

Ignoring tests mean accumulating technical debt. As time goes by the development process will deteriorate, as they have to worry about their code breaking other parts of their own code. Even with tests passing we still find bugs, how on earth without! Djikstra has said:

Testing never proves the absence of faults, it only shows their presence.

Hopefully we aware what testing is for and how software development is incomplete without it. It is good to measure code coverage. Code coverage allows you to examine what parts of the code are tested, and are untested. Mathematically, code coverage is expressed as:

A good rule of thumb is 85% or higher coverage; meaning, your tests should cover at the very minimum 85% of your code.

We cannot ensure that 85% code coverage is a guarantee of a bug-free system, I do not claim that code coverage is a measurement of fewer bugs. Don’t miss the point. 99.99% code coverage doesn’t mean a 99.99% less buggy system.

As important as code coverage, you should integrate a continuous build tool suite such as Travis CI, Drone, or Codeship into your codebase. That way, any changes to the branch should trigger the tests. When the tests pass, a package is built for you. CI strives to maximise your efficiency.

As the product owner, please do not compromise testing! By doing very rigorous testing, we were able to catch a P0-bug when testing our package in the production environment! That is after 2 to 3 times running the tests on both the staging environment, and also on the production environment as well.

Synchronize work

Synchronizing your work is very important for all the stakeholders: product owner, developers, scrum master, and other teams. One of the way to achieve synchronized work is by having daily standup meeting, where at certain time of the day, everyone said what they did yesterday, and what they plan to do today.

If you have a remote team member, it may be a good idea to conduct stand up meeting twice daily, one in the morning and also another in the evening. By doing so, you can track what are your peers in the other parts of the world are planning to do, and their progress is readily known by evening.

Still related is a cross-team synchronization meeting, which involves members of related teams. This bigger meeting could be conducted biweekly, rather than daily. For example of such a meeting is between the API team, and the administration portal team that make use of the API.

The standup meeting should also be used to raise any crucial discussion; it is very very important to inform as many members of another relevant team if something is going to effect them. For instance, deprecating a column in a database that is shared by multiple applications, requires the deprecator to discuss the issue with other related teams.

Hunt the older bugs

Your systems may have already been buggy and have accumulated technical debt. Perhaps it is slowing development down, or customers are complaining. Consider assigning a whole sprint/weeks to find out about the parts that make up the mission-critical features, and then learn how to debug and remove the bugs for good.

After you found out features that should not be buggy, jot them down on a to-do note, assign each item a priority. Use the sprint to work on it one piece at a time. If your team consists of an adequate number of developers, it may be good to divide and conquer: those who will do the chores (the debugging), and those who will work on normal stories.

We used to get a call from the finance department about “bugs” related to a feature, that was actually caused by their own mistake, such as wrongly click the cancel button instead of the done button, after money is transferred to the merchant’s account. To undo such a click, we need to perform multiple stressful manual changes to the data that ultimately sapped our energy. However, after dedicating two good sprints for debugging, we have never had the need to do that nasty manual work ever again! Nice! We’ve forgotten how stressful that was. We became happier, for sure.

Autonomy

Last things here. Forcing developer to work on a feature is probably the surest way to introduce a bug. A developer should be free to choose whether to work on a feature, research, a bug fix—any story—that they are interested in.

A tool like Trello, Pivotal Tracker, Asana or even a good old whiteboard can be used to list all of the features to be developed in the current sprint, and then let anyone to assign him/herself on a story. As long as the story aligns with the company's current goal within the team’s context, there should be no need to assign work by a manager – let people self-assign.

PS. If you are not interested in working on any story at all, then, that's another ‘story’ for you, and HR will be (un)happy to meet and to see you.

But for instance if by having autonomy, no single person wants to develop a certain feature; then the product owner should have a good discussion to solve the problem. If no one wants, then have people pair up for the story. I have never experienced a condition where no one wants to pick up a story in the name of autonomy.

Pre-Deployment Time

Pull Requests and Code Review

After you feel that you have completed all the tasks for your story, you then submit a merge request or pull request. Pull requests are even more crucial to a company with multiple, interconnected teams.

After pull request are in place, the status of your story has now become “In Code Review.” You’re asking your friends to review your codes. When everything looks good, then it get merged, after that, if testable, the product owner will manually test the story on the staging environment.

Code review shouldn’t be a part of the definition of done, just like testing shouldn’t be a part of it either. Code review and testing should be second nature to software engineers. No story could be deployed without code review, nor testing! So writing it in the defnition of done is redundant.

Reviewers should not only be concerned with minutiae mistakes such as variable misnomer, or missing spaces. Reviewers should focus on logic flaws, how to improve the current code, performance, maintainability, security aspects, and so on.

If you did pair programming for the story, then code review should be done by those who are not pairing with you. But do have in mind that when you do pair, it’s embarassing to have logic flaws in your codes as the fact that two brains still produced a buggy code is something really embarrassing.

Deployment Time

Deploy small, deploy often

Defragment your team’s deployment schedule. Rather than deploying a panoply of features bi-weekly, you shall consider deploying a smaller chunk per week, or even more frequently if you can.

Deploying a big chunk of stories at once will make it more time-consuming to do user-acceptance testing, and users tend to feel like “ok, ok this feature is good” without thoroughly testing it for themselves.

Also, when there is some bug in that fat deployment package, it is very hard to make an educated guess regarding the root cause.

Test on staging. Test on public.

It is arguably better and safer not to deploy straight into the production. You should firstly perform UAT on a staging environment. When doing testing on staging and you find a buggy feature, exclude it from the deployment package.

After all is fine, then do a dry-run on a production machine. Dry-running allows new features to be run in a production environment, yet without it affecting a lot of users. The network engineer usually channels the system to a select machine. When you have failed a dry run don’t deploy the package!

When you are confident that everything is working as expected, ask the network engineer to deploy it to all machines, to make it public.

After it gets deployed on all of the machines, then invite an end user for yet another round of testing. Yes, that is perhaps not doable in a lot of cases, and that’s okay. But if your users are internal to your company, then, do a final testing with one of them.

After following all those rules we rarely find bugs in the production environment, but there does exist those rare kinds of bugs that only occur in a production environment—we have witnessed that. So yes, last minute testing is necessary.

Post-Deployment Time

Use bug-tracking and monitoring system

Don't let your customer to be the first to let you know that there is a bug in your software. Nowadays we could easily implement our own bug-tracking system, for example, by sending out an email to developers whenever an exception is raised.

Modern tools like AppSignals, Raygun, NewRelic are readily available, usually at a cost. A lot of good, open sourced tools are also available though; although you may need to wire them up yourselves: Munin, Telegraf, InfluxDB, and Grafana.

By using those tools, we get to know that a bug is present without the user need to fill in a report form beforehand.

Retrospective

By this time, most if not all (also, can be none) of the stories are in the production. It is a good time now to have a quick retrospective. Discuss with your team, including the product owner, scrum master, and any other related parties, about what has gone well and what may have gone wrong with the latest deployment, how to minimize errors in the future and how to improve the way of working for the team.

It can be useful to implement a system whereby developers can submit pieces of retro to be discussed, when they were sprint-ing. By doing so, when in the retro, people don’t waste time thinking about what the retro points are, and instead they can focus on discussing the submitted point.

Sweet Candies

As a product owner, how about giving a treat to your team; take them to a cafe, to a restaurant, or give them a one-day off if, (for example) after three deployments in a row, there is no single hotfix. It will help motivate and propel them less buggy code. There is a mantra to this: happy tummy, happy engineers.

Stay classy and have an open culture

Bugs happen, that’s it, and bug discovery might go this way:

The business guy is calling you "hey yo, I found a bug."
"No, that's impossible. That cannot possibly happen on earth!"
That shouldn't happen.
Why does it happen?
Oh, I see.
How does it pass the test?
Sorry, I did not write that feature!

It is important to have an open culture at your company and your team.

What we can do is try to minimize the chances of error, not remove it altogether. If you feel that other things could lessen the likelihood of a bug creeping into a system, and therefore, avoiding hotfixes, please discuss in the comments section.

I hope you can be like us, have forgotten about when the last time we had to do a hotfix deployment. In essence: fewer bug calls, and better sleep at night.

About the Author

Adam Pahlevi takes pleasure in producing readable and high performing code. Currently he works as a Software Engineer in an Indonesian/Japanese payment gateway company: Veritrans. He has been publishing books and articles, he has also been featured as a speaker for workshops/conferences. He speaks Ruby

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?