BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles 12 Places to Intervene - Rethink FinOps Using a Systems Thinking Lens

12 Places to Intervene - Rethink FinOps Using a Systems Thinking Lens

Bookmarks

Key Takeaways

  • FinOps, aka Cloud Financial Management, is a socio-technical system that needs a Systems Thinking approach to address the problem holistically with lasting impact.
  • Your interventions (i.e., FinOps initiatives) to address the cost overrun problems need to be prioritized based on their effectiveness.
  • Number Targets that are set to control stock-flows are the least effective intervention in affecting the behavior of the system.
  • The flow of information, such as cloud bills with team-level split-up, plays an influential role in improving the behavior of teams involved.
  • The goal is a much higher intervention than the stock-flows, feedback loops, or self-organization because a wrong goal will lead to a very different outcome even when other interventions are in place.

Systems Thinking provides ways to extract simple but useful models of complex things, but I think only a few people think that way. Could be nature or nurture, but we don’t try to teach very often.

- Adrian Cockcroft’s tweet on 11 Sep 2018.

A recent study on FinOps by McKinsey reveals that 69% of the organizations prioritize tactical initiatives over higher-impact strategic initiatives.

One reason could be that most of those 69% organizations are unaware of those strategic initiatives or unaware of their importance.

So, in continuation to my article on Medium (that explores cloud cost management problems from a socio-technical perspective), here I will analyze via a System Thinking lens how to prioritize the efforts to address those problems.

Why Systems Thinking for Cloud Financial Management?

Any organization, or a part of it, can be considered a socio-technical system because any organization employs people with a certain skill set, who work to achieve set goals, follow laid down processes, use particular technology, operate on a foundational infrastructure, and share certain cultural norms.
Thus, Cloud Financial Management is a socio-technical system that needs to be analyzed from a social and technical angle via Systems Thinking. But before diving in, let me introduce some terminology of Systems Thinking.

Cloud Financial Management via Systems Thinking lens

  1. Stock - accumulation of information or material over a period of time. Here, it is the cost of the cloud.
  2. Flow - stocks change over time because of the actions of a flow. While In flow adds stock, out flow subtracts it. Here, it is the usage and optimization of cloud resources.
  3. Feedback Loop - whenever stock changes, the flow changes over time. While the Balancing Feedback Loop stabilizes the stock level, the Reinforcing Feedback Loop strengthens the change.

Let’s agree that "Cloud Financial Management" is a socio-technical system. But how do we bring in a change?

When we analyze a problem, we often try to look for certain points or places where we can focus our efforts to gain maximum leverage on the system to achieve our desired outcome. For example, if you are in a situation to lift a motorbike that’s fallen down, you don’t pull up from every part that you can hold on to. Instead, you find the best part and a way to hold (based on the bike design and your strengths) to pull up with the least effort and damage.

Similarly, for complex socio-technical systems, Donnella Meadows, in her book Thinking in Systems, proposes 12 places where you can "intervene" to achieve maximum impact. These are known as leverage points or points of intervention.

So, what is an intervention w.r.t a system?

A system intervention is a deliberate effort to change or improve a system’s behavior, processes, or outcomes. It involves identifying problems and implementing changes to improve the overall functioning of the system.

I will explore each of the 12 interventions and elaborate on how each intervention correlates with cloud financial management along with corresponding measures to optimize the spend. 

The System Iceberg - seen and unseen interventions

Donella Meadows introduces the 12 leverage points in the increasing order of leverage (from least to highest). We can classify these 12 points into four, namely, Parameters, Feedbacks, Social Design, and Mental Models, based on their area of impact.

While Parameters classification is concerned with relatively tangible parts of the system, Feedbacks are about the system’s internal dynamics. And Social Design relates to the social structure within the system. The last one, Mental Models, refers to the values, goals, and mindset of the people involved.

12 Leverage points (from least to highest)

Parameters

12. Constants and Parameters

Most organizations are operated on the assumption that setting targets is sufficient. But the reality is that setting targets is the least effective place to intervene. And in some situations, setting targets will become counter-intuitive. For example, number of resources, duration of usage, and price of resources are a few of the parameters that can determine the cost of the cloud system. Setting targets on those parameters is a leverage point.

Optimization measures in this category are enforcing

  1. 100% of Cost Tagging resources.
  2. 0% of idle/zombie resources.
  3. 0% test environments during non-business hours.
  4. 100% cost budgets defined.
  5. 80% utilization for all resources.
  6. 100% reserved instances for production.

Parameters usually are not an efficient leverage point that can bring change in the system’s behavior unless they contribute to pushing some other higher leverage points like goals. We’ll cover more of that in the goals section.

11. Buffers

Buffers are all about increasing the capacity of the system to handle changes in the in/out flows.

For example, improving the skill level and providing dedicated time or additional staff can increase the buffer capacity of the team.

Optimization measures that can increase the capacity of the buffers are

  1. Introducing additional staff to work on cost optimization initiatives.
  2. Introducing incentives to teams that optimize cost.
  3. Allotting dedicated time (like optimize Fridays) for the team to work on cost optimization.
  4. Inculcating forecasting skills.
  5. Introducing a FinOps expert to consult product teams.

Creating most of the buffers takes time and isn’t easy to change, so this leverage point is kept down on the list.

10. Stock-and-Flow Structures

The structure of the stock-and-flow in a system plays a vital role in how the system operates. This leverage point refers to building or modifying the structure of a system (like infrastructure, products, or processes) to lower the effort of acting on the problem.
For example, the structure of teams that manage the cloud, the structure of the cloud, and the structure of accountability among the teams come under this category.

Measures that fall under this leverage point are

  1. Establishing account strategy (aligning to business) to set accountability and ownership.
  2. Introducing a FinOps tool to access cost information, idle/under-utilized resources information, automation, etc.
  3. Performing Capacity/Cost planning and assessment.
  4. Performing Cost Forecasts based on historical data insights.

The structure of stock-and-flows is rarely quick or simple to change, so this leverage point ends up at the bottom of the list.

Feedbacks

Feedback loops - the basic operating units of a system - Donella Meadows

Feedback Loops

9. Delays

Delays refer to the length of time it takes to correct the system in relation to the rate at which the system changes. It plays a vital role in the behavior of the system. Delays will create oscillations between the desired target that you want to achieve and the actual state.

For example, if the cost overrun information is available to your team only at the end of the month, then optimization actions will happen only in a monthly cycle. Similarly, response to the feedback information should also be timely without any delay.

Measures that will reduce delays in the system are

  1. Define cost thresholds and alerts so that the respective teams are alerted instantly when there is a breach.
  2. Set up resource utilization dashboards.
  3. Set up cost-tracking dashboards.
  4. Employ automation to reduce the response time to act on cost alerts.

If the systems’ delays are changeable, changing them can cause high impacts.

8. Balancing Feedback Loops

Balancing feedback loops are the ones that keep the system state within safe boundaries based on the difference between the actual and desired level of stock. Strengthening the balancing feedback loops is primarily improving the system’s self-correcting abilities.

Any balancing feedback loop consists of a goal, an observer to check on deviations from the goal, and a response act.

Here are some measures that can strengthen the system’s self-correcting abilities:

  1. Automate housekeeping (orphan or untagged) resources.
  2. Automate cost anomaly detection and its response act.
  3. Set up Cost Budget alerts.
  4. Automate enforcing tagging policy.
  5. Automate rate optimization (reserved, savings plans, spot instance).
  6. Automate spend forecasting.
  7. Automate right-sizing recommendations, instance purchase recommendations.
  8. Define scheduling policies and auto-scaling policies.
  9. Automate cloud spend dashboard and spend reporting.

A FinOps tool can enable most of the automation mentioned above.

7. Reinforcing Feedback Loops

Reinforcing feedback loops are the ones that keep the system growing or collapsing. The more it works, the more it grows to work even more. There are two kinds of reinforcing loops - Vicious cycles, and Virtuous cycles.

In Cloud, a Vicious cycle occurs during a DDoS (Distributed Denial-of-Service) attack, where the attack traffic triggers auto-provisioning of resources which invites more attack traffic that in turn restarts the cycle as more resources are provisioned. This exponential resource usage ends up in a huge bill. Even a performance testing script mistake can cause such a situation.

Steps that can limit the growth of these reinforcing loops are

  1. Setting up quota limits (the maximum number) for resources that can be provisioned.
  2. Identify and protect resources from DDoS attacks.
  3. Setting up reasonable limits in Auto scaling policies.

Implementing effective cloud financial management can trigger a Virtuous cycle where cost savings leads to increased investment in cloud services, which increases its overall value, leading to more motivation to optimize spending and realize even more cost savings.

Any system that has uninterrupted growth will self-destruct. A better way to stabilize the system is to weaken the reinforcing loops.

Social Design

6. Information Flows

The flow of feedback information to the right set of people can result in very different outcomes, which are different from parameter adjustment or strengthening/weakening an existing feedback loop. This is about cascading feedback information to those who can act on it immediately and appropriately.

Here are some of the measures you can take to improve the structure of the information flow:

  1. To start with, split the cost at the individual team level and send the cost report to respective teams at regular intervals. This will increase the cost visibility and create accountability among the teams.
  2. Later, enable access to the cloud spend dashboard to individual members of the product teams.
  3. Set up an operating rhythm between finance, applications/operations, and business teams to improve collaboration and establish expectations.

Since this intervention creates accountability by providing the missing feedback information, Donella observes that this intervention is always popular with the masses rather than with the powerful.

5. Rules

Rules form a high leverage point. Standards, guidelines, and policies can be referred to as rules here. For example, the Architecture Board must sign off on the design and architecture beforehand for the coding to start.

A few of the measures that you can employ to make use of this leverage point are

  1. Define a clear account and tagging strategy.
  2. Define a hosting policy for choosing between many hosting models like IaaS, PaaS and SaaS.
  3. Establish a FinOps CoE to frame standards and guidelines.
  4. Strategize to define cost budgets for each application.
  5. Define the procedure for showbacks or chargebacks

4. Self-Organization

Self-organization is a powerful intervention that enables the system to evolve. It amounts to changing any aspect of the system that has less leverage than this, such as rules, physical structure, information flow, etc.

A few of the measures you can do are

  1. Form a FinOps CoE where product team members contribute to the standards and guidelines.
  2. Empower product teams to decide on their optimization initiatives with consultation (rather than audit) from FinOps CoE.
  3. Empower product teams to define the cost budget for their product.

Framing those rules that help develop and maintain self-organization within the system is a powerful intervention. And the power to frame those rules should lie with the team.

Mental Models

3. Goals

Goals are much higher interventions than the stock-flows, feedback loops, or even self-organization because a wrong goal will lead to very different outcomes even when other aforesaid interventions are in place.

So, aligning the system with higher goals will lead to better outcomes.

One of the measures that can improve the purpose of the cost management system is to pursue improving the value of cloud consumption rather than controlling cloud spend.

Based on this higher goal, a change that can be introduced in the hosting strategy is choosing the hosting model based on the role played by the workload in the organization’s value chain. For example, a core business process application that is customer facing and is a differentiator shall be hosted on IaaS (even if it incurs more cost). But an application that belongs to a supportive business process in the value chain shall be hosted on PaaS/SaaS (incurring less cost).

Unit economics is a way to measure the value of cloud spend. For example, in a logistics organization, the cloud cost for processing one parcel can show the real cost of the parcel processing even if the number of parcels increases or decreases. Unit cost metrics like per-parcel cost can be leveraged as a north star for optimization initiatives towards improving the value of cloud consumption.

2. Paradigms - The mindset

Everything, from goals and rules to delays and parameters of a system, emerges out of a mindset, aka paradigm about that system. Thus, this leverage point, a shift in mindset, can initiate a change in the whole system that lies underneath.

Paradigm change is often viewed as hard to accomplish, but all it takes for mindset change is a moment of realization.

Cloud spend management needs a shift in mindset (that was there while managing on-premises infrastructure).

Here are the measures you can take to bring in the shift in mindset

  1. Introduce a Cloud/FinOps expert/team (who has the new mindset) in a highly visible position and power.
  2. Leaders should talk about the new ways of infrastructure management where cost is a fitness function.

And some of the misconceptions (mindset) that should be dealt with are

  1. The way of hosting/designing workloads is the same in Cloud as on-premises.
  2. Cloud is always cheaper than on-premises infrastructure.

1. Transcending Paradigms

Paradigms aren’t constant. So, you need to look beyond your current paradigm. In this case, Cloud is your current paradigm, but it’s not final.

You need to think beyond cloud spend optimization. You need to think of it as infrastructure spend optimization. If you start on that note, you will explore other available options, like what Dropbox did back in 2016 (its present state) - Dropbox shifted its majority of workloads from the public cloud to co-location facilities saving nearly $75M over two years which contributed to increased gross margins from 33% to 67%. For DropBox, the public cloud was cheaper early on but costlier later on when the company grew.

Conclusion

One point to note is that as the effectiveness of the leverage points increases, so does the resistance to change, meaning, an initiative at a higher leverage point will face higher resistance from the system.

In this article, I have talked about where you can intervene to change the course of your Cloud Financial Management system and the effectiveness of each intervention point. But how to intervene is up to you because each system is unique and constantly evolving. All the best!

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT