BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Agile is at a crossroad: Scale or fail?

Agile is at a crossroad: Scale or fail?

Leia em Português

Risk management is the hottest topic in IT. Processes for effective risk management and investment decision making will allow Agile techniques to scale beyond projects to the enterprise. Without them, Agile will be confined to the ghetto of development.

IT Risk Management is a very underappreciated subject despite the many references that can be found. In years to come we will look back on what we consider to be “best practice” as quaint at best. The rise of Agile has brought risk management to the forefront of study as an area that needs to be improved. With the practice based focus of Agile, huge breakthroughs will occur in IT Risk Management over the coming decade.

When considering risk management on IT projects the only view that matters is the Business Investor’s[1] view. All other perspectives are subservient to those of the investor. The project team, IT management and user are only formed into a team in order to deliver a return for the investor. Nothing else matters from an IT perspective.

Without a proper understanding of this fact some forms of risk management may create issues. Minimizing risk for a part of the team most likely results in shifting the risk to another part of the team, also known as local optimization. An example of this is “change control” which protects the delivery team from the investor changing his mind but at the same time increases risk to the investor as he can no longer easily and quickly redirect his investment.

Types of risk

From the business investor’s perspective, there are three categories of risk[2].

  1. Delivery Risk. The risk that the software is NOT delivered on time, on budget, and to the required quality.
  2. Business Value Risk. The risk that the project does not deliver the value expected.
  3. Existing Business Model Risk. The risk that the project actually damages the existing organisation.

Agile and Lean Software Development Techniques address the first category of risk. Feature Injection, Lean Start Up techniques and Business Value are starting to address the second category of risk. Real Options[3] can be used to manage all three types of risk.

Most established IT Risk Management literature focuses on delivery risk. The change control example above shows that delivery risk is intertwined with the other two. All three must be considered in a holistic manner. While addressing each risk individually brings real benefits to a company, the true benefits are in a holistic strategy towards risk. An IT Risk Management System is necessary rather than a single tool that is considered in isolation. For the system to work, a number of groups will need to participate and share their understanding of each of their risks. It is important that the IT Risk Manager implements a system with feedback loops and supportive tools.

This article mainly focuses on the most important, yet mostly underestimated risk category first: the risk of damaging your existing business model. The other types of risk will be covered in subsequent posts.

Existing Business Model Risk

Existing business model risk is the risk that delivering on a commitment may cause damage to the existing business model. One of the most famous examples is Hoover’s free flight promotion[4]. Hoover announced a special offer of “Buy a washing machine and get a free flight to the USA” (from the UK). This promotion turned out to be so successful that it nearly destroyed the company. Although this article does not discuss these business commitments, the article does focus on similar risks resulting from commitments made in IT systems that result in damaging the organization. In some tragic cases, these risks were so severe that cases have been documented where failures have resulted in deaths, like at the London Ambulance Service in 1992. [5]

One of the first things we learn from Real Options is to identify when we are making a commitment. If you’re about to enter into a commitment. consider whether that commitment can be reversed, and more importantly, how long it takes to reverse the commitment or fix it. In addition, consider how long the organisation can survive in a broken state. Given that it is impossible to know in advance how long it takes to fix an unexpected problem, the only consideration is how much time it takes to reverse the commitment.

Risk Assessment Questions

  1. Can the commitment be reversed?
  2. How long does it take to reverse the commitment?
  3. How long can the organization survive in a broken state?

If the time it takes to reverse out the commitment is less than the time the organisation can survive without the system, then you do not have a problem. If the time to reverse the commitment is greater than the time available, then you need to create some options.

Commitments versus reversible decisions

When implementing change into a system, consider whether the change is a commitment or whether it is a reversible decision. Let’s clarify this with some examples:

  1. Consider you are driving around an unfamiliar area. If you make a wrong turn:
    • a) It’s a reversible decision when you are on a two-way street. You can always turn around and retrace your steps
    • b) It’s a commitment when you are going down a one way street as you can no longer retrace your steps.
  2. Consider you are lost in the woods and you come across a twelve foot drop. You can get down but there is no way back up. Players of video games like Tomb Raider will be familiar with this situation.
    • a) It’s a reversible decision when you can set up a rope/vine/branch to help you climb back up[6].
    • b) It’s a commitment when you jump down and have not prepared a way of going back up.
  3. Consider changing the structure of the database so that the old code will not work against it.
    • a) It’s a reversible decision if you have backed up your tables / database which can be restored in the event of a problem.
    • b) It’s a commitment if the structure upgrades are permanent and you are unable to create proper back-ups to restore.
  4. Consider you are about to pull your front door closed.
    • a) It’s a reversible decision when you have your keys with you
    • b) It’s a commitment when you forgot your keys.

So the simple rule is: is this action reversible? If it is, no problem. If it is a commitment, make sure you have an effective path backwards. If your deployment means pointing a URL at one executable instead of another and the rollback is to point back to the original, your rollback is trivial and does not require much effort. If your deployment is more than that, you should plan the rollback carefully. It is when we plan the rollback that we identify those one way streets and twelve foot drops.

Rollback strategies

We create options to find our way back or scale the drop. Creating the option does not mean we have to use it. The cost of creating the option is the premium we pay for the ability to reverse our decision in case it is needed. Our primary concern when rolling back is time, cost is secondary. Risk Management is about time. Risk Management means creating (Real) options to help us do stuff that we do not intend to do but may HAVE to do. The time available to reverse your steps, to rollback, dictates your option strategy.

Consider the upgrade to the database. The time available to rollback before disaster happens may be one of the following:

  • A few milliseconds. You need more than one system running concurrently if you are going to avoid that missile hitting the French Embassy in Tripoli by mistake. In reality, you should never be in this situation.
  • A few seconds. The only way to achieve this would be to create a duplicate database that has new transactions replicated into it and then switch back to it
  • A few minutes. Similar to a few seconds but you may be able to reapply the new transactions.
  • A few hours. As above, plus you could take a back up and restore the original database.
  • A few days. As above, plus you probably have enough time to fix the problem rather than rollback.

The roll-back strategy is contextual.

A couple of key points: Always prepare a roll back plan even though you will probably not need it. Prepare the roll back plan as part of the release planning exercise. At each stage of the release, ask “Is this a commitment? Is this a twelve foot drop” If it is, then make sure you have the option to reverse the commitment. The time to ask is not after you have jumped down the twelve foot drop. The time is not even when you arrive at the drop because you probably won’t consider all the factors.

The time to think about roll back plans is before you start the release. At that time all is calm and you are not on the critical path for the release. Have you ever noticed that you always lock yourself out of your home when it is least convenient? That’s because we make more errors when we are time pressured and are forced to make decisions. The time to make decisions is when we are not time pressured. Make sure you prepare the rollback plan when you prepare the release plan. That way you will not need to make any decisions when implementing the rollback.

Timing a rollback

So now the question becomes: “how do we know when to roll back?” One of the biggest risks is that software is in a production environment and no one is aware of a serious issue. To mitigate this risk, the system is tested as soon as it is released to production to prove that it works correctly. The appropriate tests should be run. If the tests take too long it is better to run a small subset of the tests (also known as smoke test) to at least test the core functionality is working correctly. . If the smoke test fails, it provides more time to fix or rollback the commitment.

A more sophisticated strategy that many leading edge companies now adopt is to phase the production roll out of changes. Release the change to a part of the production environment affecting a small sub-set of “customers” and only commit to a full roll out if it is successful. That rollout may be gradual, especially if performance is a key consideration and the overall performance is not known. The major web sites are past masters at this approach. They have even extended it to assess the preference of “customers”. [7]

Preventing Damaging IT Systems

There are a couple of ways you can damage[8] your existing systems. The first is that an undetected bug or unforeseen circumstances requires a rollback. The second is that the release was messed up and the software released to production is not the same as the software released to the test environment. In order to minimise the risk of releasing the wrong software to a production environment, the following policies should be strictly adhered to.

  1. Use the same process to release the software to testing and production environments. Ideally this process should be automated to reduce the risk of human error.
  2. Release the same software to test and production environments.
  3. Keep differences necessary for the different environments (environment variables, configuration) to a minimum in a minimum number of places.
  4. Focus with careful scrutiny on any differences between the testing and production environments as they are the riskiest aspects of the system.
  5. Keep the testing environment as close as possible to the production environment. Some technologies allow testing to be performed on the production environment with the test and production software coinciding... [9]

Be aware that the risk associated with a release into production is certainly not constant. The risk to the business changes depending on the current context. As such, the business investor and other stakeholders must always be involved in the choice of when a release takes place.

Release timing

When the timing is critical for the investor and other stakeholders and failure is more expensive to them, the time to reverse the change is also significantly reduced. A non-exhaustive list of times when the risk of a failure for the investor is increased:

  • At month end/year end for financial businesses.
  • Before Christmas /Eid for a retail or holiday company.
  • Before or during peak times.
  • At a time when key staff are unavailable.
  • When other significant change is going on in the organisation.
  • When the organisation is under increased scrutiny of some kind.

Releasing is an involved business decision

To mitigate this risk of releasing at the wrong time, a process should be established to ensure that all affected stakeholders are notified and “sign-off” on the commitment. Ideally this should be an automated process with all affected parties identified in advance. It is useful to audit this function to ensure that all affected parties are engaged and that the sign-off process has not degraded. Obviously the sign-off requirements will be more extensive for a complex enterprise environment where there are many inter-related systems. A simple independent system may only require a nod from the business investor. Effective communication is obviously crucial and any break down of communication between individuals or groups causes a risk to the organisation.

The decision to release software is a business decision that weighs the benefit of the release against the risk of the release. The time to decide who has sign-off authority is at the very start of the development process when the business investment is made. The sign-off representatives need to understand the breadth of their responsibility and as a result the number of individuals that they need to engage. Sign off should ALWAYS be based on agreed conditions. It should never rely on gut feel or be a result of peer pressure. Therefore sign off should be based on a role rather than an individual. To ensure effective communication of this responsibility, agree on the roles and assign them at the time that the business investor commits to the investment. The business investor needs to ensure all affected businesses are represented.

Unplanned changes and risks

Not all changes to the production environment are planned. Many years ago I remember my father telling me that a travel company close to where I had grown up had had a fire at its computer centre. The company did not have a back-up site and as a result six month later they had gone bankrupt. This was thirty years ago before computers were ubiquitous with “Business Continuity Planning” and “Disaster Recovery” becoming part of the standard infrastructure of organisations.

Disasters can come in many forms. The questions you need to ask are the same two questions that you need to consider for rolling back a bad release. “How long will it take me to activate my back up plan?” and “how long can my business survive without my systems?” As with the roll back, the answer will dictate the disaster recovery solution needed.

Most organisations do not trust the power supply and have a UPS system (Back-up generators and batteries). In London where there is continual building work and a non-too well mapped power supply, we are subject to power outages caused by builders with power drill and mechanical diggers. In the North East of the United States where they are all too familiar with the value of a reliable power grid and the cost of not investing in one. In those instances, it is necessary to have a UPS system that can survive until the main power source is back. The dependency on power means that many data centres are now located where power grids overlap. When it comes to power, it would seem that “real options” are important.

With risk, the issue is always time. I remember following the aftermath of September 11. My company’s offices were located near to ground zero. An infrastructure team needed to go into the offices and fill up the generator which only had four days of fuel. No one imagined the power would be out for that long.

Disaster Recovery and Business Continuity is expensive which is why it is necessary to consider the criticality of systems in terms of time to recovery. Normally, the faster the recovery, the more the cost. Once again, the response time versus cost decision is a business investment decision.

Summary

In summary, the risk of damaging the existing business is one of the key categories of risk to be mitigated. The recurring theme is that the organisation needs to be able to react within the time available before the business is damaged irreparably. Before making any commitments, the key question should be “Do I have the option to turn this commitment into an option?”

A key skill for an IT Risk Manager is to be able to identify commitments, and then facilitate the creation of the options that are needed to reverse them. An IT Risk Manager turns commitments into options using real options. Before you climb down a twelve foot drop, create the option to climb up again.

About the Authors

Chris Matts is a business analyst and project manager who builds trading and risk management systems for investment banks. His aproach to IT risk management is based on what he has learnt from investment banking risk management. Chris tweets at @papachrismatts. He blogs here and here.

 

 

Olav Maassen works at Xebia as an agile consultant. He is interested in new developments and new ideas that can help others improve themselves. He knows how to inspire people to optimize their capacities and skills so that they can get the best results for both themselves and the company. Olav tweets at @olavmaassen.

 


[1] The business investor the “individual”[1] who is paying for software.

[2] “Everything you know about IT Risk Management is wrong” by Steve Freeman & Chris Matts. Agile Times Vol5 - The article gives famous examples of organisations that have suffered from all three types of risk.

[3] Real Options are a simple set of rules based on Financial Option Mathematics and Applied Psychology. The rules are: “Options have value”, “Options Expire” and “Never commit early unless you know why”. The Real Options mentioned here should not be confused with the Real Option approach that directly and incorrectly applies the Black-Scholes-Merton option valuation framework outside of financial markets.

[4] Hoover free flights promotion

[5] Case study: The London Ambulance Service Despatching System

[6] You’ll be surprised of the options that you have in such a situation... but only if you take the time and open your filters of perception. If people are feeling stressed, people have a tendency to propose solutions before all decision alternatives have been considered (page 6). Your filters of perception will narrow and you will only be able to see the thing you seek. To see what you seek, you need to focus on the goal and not the path.

[7] A-B testing will be covered in the upcoming Business Risk article.

[8] It does not have to be intentional. Incompetence and bad luck are often more dangerous to system than evil intent.

[9] This creates its own risk that the testing interferes with production environment.

Rate this Article

Adoption
Style

BT