InfoQ Homepage Articles Talking about Sizing and Forecasting in Scrum

Talking about Sizing and Forecasting in Scrum

Aug 04, 2022 30 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Avoid story points, counting non-valuable product backlog items, counting unDone work as Done, use of averages
Consider historical reference items but beware of accidental complication
Try probabilistic forecasting based on counting valuable product backlog items to Done
Try #NoEstimates and “rolling wave forecasts” of valuable product backlog items to Done
For complex work, promote managing expectations about uncertainty over managing expectations about dates

The Scrum world has changed a lot in the last few years. Nowhere is that more true than estimating; Scrum folks have struggled with the practice since the framework's origin. Other folks also struggle with it, but this article focuses on Scrum. Many don't see the value in estimating. The 2020 Scrum Guide replaced the word estimate with size or sizing. Some people often forget that estimates are, after all, just estimates.

Community opinion on the whole area of sizing and forecasting is fraught. My intent is not to add fuel to the fire but to add some thoughts to further the discussion. In my view, the Scrum world needs an article like this. It is a theoretical piece assuming the reader has a reasonable amount of Scrum knowledge, informed by practice. I hope you find it helpful.

If your forecasts are routinely correct, you're a freak of nature. Forecasting is rarely perfect due to the following:

Waiting time due to dependencies is a huge factor in how long work takes and is affected by many unpredictable events.
Even in straightforward work environments, people overestimate how efficiently their day will go.
Often, people doing complex work in the pursuit of speed leave work behind them that is untidy and potentially embarrassing (accidental complication). Even worse, people compound the untidiness of the work in follow-up work, which takes longer than intended whether the untidiness is removed. Some call it "broken window syndrome."
Complex work involves many unknown variables.
Lack of focus, e.g., "have you got 10 seconds" or "throwing in the kitchen sink" due to prolonged release cycles or over-planning (too much work) around the flow bottleneck.
Changing priorities, and the lack of updated sizing based on the changed priorities, e.g., after a problematic forecasting session where it's clear that the odds are against us, another priority is thrown into the mix.

It's often assumed that forecasts are requests to tell with some confidence, "when will it be done?" Often there is another question behind the question, such as:

How can I transfer worry to someone else?
What progress is being made?
What risks remain?
When will we get some return on this investment?
What trade-offs can we tolerate regarding which work can discover/deliver the potential value, e.g., the 80:20 rule?
What trade-offs can we tolerate in terms of reducing some or all of effectiveness, efficiency, and predictability, e.g., running some experiments?
What progress trade-offs can we tolerate in terms of required "dead work" to avoid execution bias, such as laboratory setup?
How much investment will go into acquiring skills, e.g., education or apprenticeship?

A meta-question of "what does winning the game mean?" is well worth considering. Is the team being given a game it can win? And if the team can win, what are the odds? Probabilistic forecasts can help, e.g., Monte-Carlo simulations.

Despite the hazards, sometimes teams still feel the need to forecast because people fear that stakeholders will make up arbitrarily fixed undoable dates in a vacuum. Sometimes teams want to attain a ballpark date range to get ahead of stakeholder expectations. Interestingly, most of us can accept a weather forecast that gets updated regularly based on the latest information, even if we know it's still inaccurate.

Monte Carlo simulations

Troy Magennis does an excellent job of explaining Monte Carlo simulations; simply put; they allow people to model a future based on data and assumptions. Probabilistic forecasts can tell you how much work you can get done, not whether the work at hand fits that forecast. Although probabilistic forecasts are an illusion for complex work, one can redo them regularly as new information comes to light, making them somewhat more reliable over time.

Monte Carlo When forecasts play out scenarios using 500 to 1 million random number generations for throughput (delivery rate of valuable work items per time period, let's say per Sprint) within min/max limits per time period (real data or 90% confident guesstimates) combined with random number generations for the number of items within a range; each forecast returns a range of dates with associated probabilities. For example, in the results shown below, after 10k simulations, 50-75 items have an 85th percentile probability of delivery on 22nd November or sooner.

Screenshots of TWiG and ActionableAgile

Monte Carlo probabilistic How Many forecasts use random number generations for throughput within limits per time period to forecast how many items would be done for given dates; the output is a range of dates with a number of items done by that date plus a probability percentage. For example, below is an 85th percentile number of items that would be delivered by date or sooner.

Screenshots of TWiG and ActionableAgile

In either case, using the wrong limits results in a low-quality forecast. For example, let's say we used a seven-sided die of 1, 2, 3, 4, 5, 6, 7 when throughput has only ever and is only likely to be 3, 4, 5, 6, or 7. We've allowed the possibility of 1 or 2 when that should likely be outside the limits. That said, 1, 2, 8, or 9 are still possible, just not probable. Prateek Singh goes into a lot more detail about how random the number generation would be using various models. Also, if throughput is erratic or mostly zero per time period, the quality of a Monte Carlo simulation reduces considerably. Forecasting, at its essence, is about risk management. It answers the question - How much risk is contained in our current plans? Lower quality forecasts also mean inadequate risk management.

Communities are not aligned on this approach. One project is only executed once. While probabilities may help inform decisions, the problem is that they don't make the decision any easier. Estimation is often used as a proxy for a decision (do we do this project or not?). In that regard, does probabilistic forecasting offer any answer, or does it only offer more questions? They are important questions, but the reasons for using estimates differ from probabilistic forecasting. I have seen many probabilistic forecasts based on guesstimates and a lack of history, yet they were not far off in the end. If you'd like to dig more into that, check out when I spoke with Troy Magennis about it.

Let's examine in more detail the options for sizing and forecasting to help inform the lesser evils in a Scrum context.

Sizing Techniques

When using Scrum, the Scrum Team self-manages, and the Developers carry out sizing because they are the people closest to the work.

The most popular sizing techniques are either based on data or educated guesses. When dealing with complexity, know that these techniques are almost always inaccurate.

I find that most people are poor at estimating. Still, people perceive value in the conversation because assumptions can surface while we come to a common understanding of the work and its challenges. Understanding 'the work item' and 'Estimating the work item' are separate activities that should not be conflated. My view is that estimates for individual product backlog items don't provide a solid foundation for forecasting the time it will take to complete the work.

The most popular sizing techniques using estimation for forecasting are as follows:

Relative estimation
- Time reference - Comparing current work items to the time it took to complete historical reference items.
- Assigning numeric values - Examples include using story points based on the Fibonacci sequence and often carried out collaboratively with playing cards (planning poker).
- T-shirt sizing - Assigning s, s/m, m, m/l, xl, xxl, xxxl, xxxxl to Product Backlog Items instead of numeric value, sometimes simplified by leaving out every second size without losing scale S=1, M=3, L=8, XXL=20 (rounded up), XXXXL=60, rounded up), ?=100 (rounded up)
- Wall estimation - Assigning numeric values by collaboratively placing and moving cards on a wall, also referred to as magic estimation or silent estimation.
Flow metrics or counting items
- For a goal or time period - Identify/count (as best you can) a number or range of Product Backlog items to get to Done.
Right-sizing - Identify small enough items for intake — often using a flow-based approach — it's less about whether an item is small or medium. The practice involves figuring out if the team can complete a Product Backlog Item according to the Definition of Done comfortably within one Sprint.
#NoEstimates - one interpretation being …
- Strive for an even distribution of "ballpark" item sizes throughout a backlog.
- Count running tested stories or running tested features to demonstrate progress in output terms.
- Focus on simplifying the what for the why - a focus on desired outcomes.
- Right-sizing - Identify small enough items for intake.
- Slicing into 24-hour timeboxing of items encourages the creation of experiments that validate assumptions/hypotheses towards a goal, discover to deliver.
- Use rolling-wave forecasts to communicate uncertainty.

Some would argue that counting Product Backlog Items without throughput data is also an estimation, and I agree with that view. Counting based on the throughput of done items is forecasting.

Of the above sizing approaches, only wall estimation considers relative potential value directly. It uses value points for estimating value (not effort). Value is not absolute; it is relative. Customer value gets realized after a release to the customer, and it might be a lot higher/lower than we expect.

All of the above sizing options are devalued by:

Not having caveats associated with the start date, e.g., nine weeks from the date we start.
Not recognizing the amount of work in progress and the progress (or not) of that work.
The severity of impediments.
Not ordering items higher up the Product Backlog according to delivery risk.
A sub-optimal approach to handling dependencies.
Confusing outputs with outcomes; a customer/end-user outcome is a change in customer/end-user behavior.
Not engaging in discovery activities when the risk of not harvesting potential value is high, compounded by assuming that every item moves from discovery to delivery.
Delusions of accuracy and pursuing more accuracy.

Suboptimal trends include

Size per skill - typically caused by focus on resource efficiency over flow efficiency
Size inflation - typically caused by pressure for more "velocity.” No prizes for guessing which size a team might use for a borderline case, out of higher or lower; they're likely to pick higher. In extreme cases, I refer to this as (sizing) bingo.
Not taking quality seriously, e.g., lack of conformity with the definition of done - typically caused by pressure for more "velocity." Make no mistake - quality via the Definition of Done commitment is a big deal.
Not taking the customer seriously, e.g., delivering outputs without measuring if they made a difference - typically caused by an excess of "thinking inside the building" or distance from the customer.
Size normalization across teams is typically caused by the assumption that people are replaceable machine parts. In my experience, given many situations with the same product backlog, the same reference items for comparison, and the same scale, teams have never come up with the same sizes.
Counting complete but fake product backlog items, items that don't deliver value, as throughput - typically caused by the use of work breakdown structures that lose sight of value
Not focusing, not finishing. Focus is a big deal in Scrum primarily via the Product Goal in the Product Backlog, the Sprint Goal in the Sprint Backlog, and the Scrum values. And how about focusing on finishing what we started, even if it means splitting or canceling work?
Delusions of predictability for work that is uncomparable with work from the past
Lack of discovery to find the items we maybe should not build, some say 70%. If we run low-cost experiments, we might fall upon better ideas. For example, insofar as I can tell when Bose discovered noise-canceling headphones, they were looking for something else, higher sound fidelity. Lack of discovery is typically caused by leaders committing to work on behalf of teams; the view becomes "there is no point in finding better ways, we have a deadline.”

This article is not designed to help people sustain such trends but to shed light on better ways. If you want help getting better at estimation with story points, you're looking at the wrong article.

The main sizing options this article draws attention to are estimation, right-sizing, and #NoEstimates. Neither of those options is perfect. I've included some other options to consider, but regardless of the approach, the level of exactness is low.

About estimation:
- It comes in different flavors.
- If you estimate, the best thing that can happen is the estimates are correct. How long things take has little to do with the processing time - see an article by Troy Magennis on this topic.
- Estimates are prone to the "flaw of averages" (Sam Savage). Is 50:50 an excellent way to set expectations?
- The average of independent blind assessments can be near enough to the truth (Dave Snowden on Xagility podcast). Estimates are rarely blind, though.
- If you don't estimate at all, you don't waste time; hopefully, you will discover/deliver outcomes sooner.
About right-sizing:
- How much time could you save caring more about whether the team can complete an item within the Sprint and less about making that item infinitely smaller?
- Think of the reduced cognitive load on the Product Owner resulting from fewer PBIs.
- Counting the number of (valuable right-sized) PBIs delivered to Done per Sprint is valuable for Sprint Planning and forecasting goals.
- If throughput is sporadic or irregular, we have more significant problems than forecasting; we have a "plumbing problem."
- Using average throughput also pursues the "flaw of averages." Monte Carlo probabilistic forecasting is preferable.
About #NoEstimates, one interpretation is:
- Counting the number of (valuable right-sized) PBIs delivered to Done per Sprint is valuable for Sprint Planning and forecasting goals.
- If throughput is sporadic or irregular, we have more significant problems than forecasting; we have a "plumbing problem."
- "Rolling Wave Forecast” based on throughput with variance limits is preferable.

Whether one uses estimation, right-sizing, or #NoEstimates, sometimes it's difficult to break down Product Backlog Items (PBIs), so they're still valuable (and not just subtasks). There is a variety of guidance on how to split Product Backlog items. Example mapping is a powerful approach; the Scrum Team can pick just one example for this Sprint and get feedback. When evidence (we can harvest value) is lacking even if the potential value is high, it’s sensible to experiment to validate assumptions through hypotheses; #NoEstimates and Scrum with UX support this approach. Splitting can result in discovery or discover-to-deliver. For example, customer interviews and UX research might be triggered by splitting.

Potential upsides of some sizing approaches

Historical time reference	Assign numeric values, e.g., story points	T-shirt size to be Done	Wall estimation	Number/range of product backlog items to be Done (for a goal)	Right-sizing	#NoEstimates - one interpretation
Does not require that much Scrum education apart from Definition of Done and Product Backlog Refinement.	Developers value discussions on relative size, as the perception is it leads to a better understanding of the work.	Developers value discussions on relative size, as the perception is it leads to a better understanding of the work.	Developers value discussions on relative size, as the perception is it leads to a better understanding of the work.	Uses continual forecasting as a basis for continual negotiation of the budget and/or the number of items based on empirical value delivery	Simple. Developers assess if a Product Backlog Item fits comfortably within a Sprint or not, and if not, break it down at some point.	The point is that estimation is used to inform decisions. If the base information (estimation) is so wrong that it causes catastrophic failure in the decision-making process, then the technique must be replaced.
Usually includes waiting time.	Developers feel a better sense that they have discovered relative complexity/risk.	Developers feel a better sense that they have discovered relative complexity/risk.	Developers feel a better sense that they have discovered relative complexity/risk.	Developers can “ballpark” this with a range.	Less “analysis paralysis.” For scaling, a general rule in LeSS is 4 Product Backlog items per team per Sprint for the next 2-3 Sprints.	Focus on discovery/delivery, and split items as necessary.
Bas Vodde in the LeSS community advances that this method is less harmful than story points to Done (see point 40 in this article)	People say the conversation is important (although, perhaps there is a better trigger for such a conversation).	T-shirt sizes can be converted to story points if there is a T-shirt size to a numerical value scale.	Can use T-shirt sizes / numerical values.	Useful for sizing a chunk of the backlog, e.g., the Product Goal, the Sprint Goal.	Can possibly use for probabilistic forecasting for a selection of Product Backlog.	Small batch is still a goal so less likely to take on “elephants.”
Useful for “slice of cake” teams (can deliver value without depending on other teams, all layers of the cake are represented on the team).	A way to limit work in progress in a Sprint but there might be a better way.	People say the conversation is important (although, perhaps there is a better conversation).	Quick when using in combination with T-shirt sizes and T-shirt size to numerical value scale (can size most backlogs in under 1 hour).	Can use across multiple teams.	Product Backlog items are likely to be more valuable due to less item splitting.	Forecasting using data assumes/accepts uncertainty and imperfect information
Speaks in the customer’s language (i.e., time), or does it? (timely value?).	Time-consuming if using planning poker - see this article for more:	Requires very little detail for Developers to take a view of the relative size.	See this article by By Mitch Lacey. Owner, Mitch Lacey & Associates, Inc that considers potential value and effort.	Can be used for probabilistic forecasting.	When used with probabilistic forecasting, speaks in the customer’s language about when it might be done (or does it?), as long as we state the caveat: “We’ll have a better forecast next week/Sprint/month.”	Uses continual forecasting as a basis for continual negotiation of the budget and/or the number of items based on empirical value delivery
				When used with probabilistic forecasting, speaks in the customer’s language (or does it?) about when it might be done, as long as we state the caveat: “We’ll have a better forecast next week/Sprint/month.”	Forecast using data assumes/accepts uncertainty and imperfect information.	Categorization - comparing the scale of current investment with similar previous investments helps to win over “the dressing room;” similar in terms of three to five characteristics, e.g, business domain, technology, and item types. team, process, client type, being careful not to overfit (deterministic rather than probabilistic mindset).
				Forecast using data assumes/accepts uncertainty and imperfect information.	Uses continual forecasting as a basis for continual negotiation of the budget and/or the number of items based on empirical value delivery	Throughput is based on running tested stories- no ambiguity.
						Independent stories slice the investment vertically ( you get a “slice of cake” rather than a “layer of cake”)
						Attempts to have a mixture of item sizes throughout the running tested stories can be trusted more as a metric of progress (outputs at least).
						Moves the focus to value; get creative and simplify the what to deliver value, then deliver regularly as Scrum expects.
						Encourages slicing of Product Backlog items into experiments that validate assumptions / hypotheses

Potential downsides of some sizing approaches

Historical time reference	Assign numeric values, e.g., “story points”	T-shirt size to done	Wall Estimation	Number/range of product backlog items to Done (for a goal)	Rightsizing	#NoEstimates - one interpretation
Requires a reference item of that size (and perhaps type) from the past. But a reference item might still be next to useless due to accidental complication, changed circumstances, etc.	Even the person who claims to have created story points, Ron Jeffries, regrets story points.	Only for the Scrum Team, not even for other Scrum Teams on the same product (but maybe it could be?).	Opens the possibility for size normalization across multiple teams on the same product - comparing teams is almost always a bad idea.	Prone to use of “epics” as containers for Product Backlog items rather than PBIs themselves.	The number of items in the Product Backlog is less useful for forecasting purposes as many could be “elephant-sized.”	Psychologically difficult for some stakeholders - people prefer to be wrong than uncertain
Prone to abuse by people with a “utilization” mindset. See.	Only for the Scrum Team, not even for other Scrum Teams on the same product.	Prone to size inflation or normalization across teams. Some people like to cap the sizes, unknowingly hiding effort.	Often one and done: It should be revisited frequently.	Counterintuitive: even if the usefulness of throughput is demonstrated, people are strongly biased towards relative sizing	Misunderstood that all PBIs need to be of equal size – if we’re not making widgets, if this is not manufacturing – we’re probably doing complex knowledge work and we expect items to be different from one another.	Most teams are “layer of cake” teams, and hence don’t produce Running Tested Stories” unless they use coping strategies such as Nexus (although Nexus does not share Product Backlog Items across teams). But this is not so much a downside of #NoEstimates to be fair. In #NoEstimates INVEST stories tend to get sliced.
Would not be used for probabilistic forecasting, as it’s not output-based.	Prone to inflation or normalization across teams. Some people like to cap the sizes, thus hiding effort.	Prone to abuse by people with a “utilization” mindset. See.		Misunderstood that all PBIs need to be of equal size – for knowledge work we expect items to be different from one another.	Disconnect in lean/agile community whether PBI split rate is a useful input. See.
	Prone to abuse by people with a “utilization” mindset. See.	When converted to story points, could be used for probabilistic forecasting, but should you? In any case, if the Product Backlog is not fully sized, what limits would you use for backlog size (min/max)?		People push back because, in their minds, PBI types seem like mixing apples with oranges.	Probabilistic forecasts will be less useful if the team does not deliver any PBIs most days.
	Could be used for probabilistic forecasting, but should you? In any case, if the Product Backlog is not fully sized, what limits would you use for backlog size (min/max)? See.			Probabilistic forecasts will be less useful if the team does not deliver any PBIs most days
	Fibonacci is the trend – that might not be exponential enough.			Often misunderstand that we should use rolling/moving averages for long-term forecasting - heads or tails anyone?

Other considerations for sizing approaches

	Historical time reference	Assign numeric values, e.g., “story points”	T-shirt size to done	Wall Estimation	Number/range of product backlog items to Done (for a goal)	Rightsizing	#NoEstimates - one interpretation
Typically combined with	T-shirt size	T-shirt size Three-point method	Story points	Assign numerical values T-shirt size	Monte Carlo Probabilistic Forecasting	#NoEstimates Number(or range of numbers) of PBIs to Done Monitoring of WIP Cycle time and work item aging	Number(or range of numbers) of product backlog items to Done, monitoring of WIP, cycle time and work item aging, reference class sizing, story mapping, impact mapping
Usefulness in complexity	Low	Low	Low to Medium	Low to Medium	Medium	Medium	medium
Devalued by	Capping the size.	Converting to time for people utilization purposes, story points per skill, capping the size, and time-consuming approaches such as planning poker.	Converting to time for people utilization purposes, T-shirt size per skill, capping the size, and time-consuming approaches such as planning poker.	Capping the size.	Lack of focus on PBI aging, the delusion of accuracy in complexity. Also, even in a Scrum context, focusing on throughput while ignoring impeded/forgotten PBIs is a fool’s errand.	Stuffing items into a Sprint, definition of Done de-emphasized in the wrong hands, sub-optimal approach to handling dependencies.	Lack of discipline with product backlog item size being suitable for a Done increment within a Sprint If communicating a forecast for a range of items using an interval/span of dates - stakeholders might just take the most optimistic end of the interval/span Splitting of “stories” could happen to the extent that stories are not valuable in and of themselves. In both Scrum and Kanban, items should be valuable. The danger is the team delivers activity instead of outputs that trigger outcomes.
Worst thing that could happen (if done badly)	Delusion that what happened in the past dictates the future.	Story points per skill, adding too much contingency, activity measurement as opposed to output measurement for work that is not done.	T-shirt size per skill, adding too much contingency, activity measurement as opposed to output measurement for work that is not done.	Developers wall estimating in groups without cross-checking.	The delusion that items have to be the same size, the definition of ready complementary practice used as a gate.	The delusion that items have to be the same size, Size is not considered at all, and “elephants” get taken in the Sprint, definition of ready used as a gate.	Size is not considered at all and “elephants” get taken into the sprint

Forecasting

When forecasting effort, the most popular methods tend to be based on:

Past performance of the number/range of Product Backlog Items (PBIs) to Done, based on averages or a range – past performance does not predict the future, but it is acceptable for a short period such as a Sprint.

Screenshots of TWiG and ActionableAgile - This team is delivering stuff, but they deliver nothing to Done for many days.

Past performance of the number/range of numerical values or PBIs to Done
- Past performance does not predict the future, but it is usually acceptable for a short period, such as a Sprint.
- Beware of using arithmetic on numeric values, as complexity rises exponentially with Product Backlog Item size, and putting an "elephant" in a Sprint is problematic.
- Beware of the number of PBIs already in progress.
- Sometimes based on a burnup/burndown chart using "exact"/relative sizing of remaining work over time.

Burnup/burndown charts are based on averages and often misinterpreted.

Probabilistic forecast with a percentage probability that a number (or range) of items with varying probabilities for a range of dates; can be used for short-, medium- and long-term forecasts.
- Consider using a parametric scale or random sampling to model based on previous work. For example, you might think: this investment feels like two other similar past investments, but we'll need to add 15 items.
- Consider also a number or range of items for pulling everything together as work doesn't always knit tightly and easily in the end – but isn't that risk well-managed through iterative, incremental, and continuous delivery anyway?

Screenshots of TWiG and ActionableAgile

Probabilistic forecast for how many PBIs can be completed by a specific time - with a percentage probability, a date, and a number (or range of numbers) of items by that date; it can be used for short-, medium- and long-term forecasts, but it's more useful for output orientation than goal orientation.

Screenshots of TWiG and ActionableAgile

A rolling wave forecast demonstrates the art of the possible and uncertainty at the same time

A statement something like, "We're using an empirical approach operating one Sprint at a time; the Sprint Goal is not even a guarantee; the real answer is we don't know, but let's start and learn quickly." You might not even use Now?, Next ??, Later ???.

The most offensive part of forecasting is that it often does not consider how many PBIs we're working on simultaneously. Alas, all too often I have also seen teams of teams bringing in more items than their usual capacity.

As mentioned, teams can use different approaches to size the effort to deliver a Sprint/Product Goal. The forecast will almost certainly be wrong, so they should convey uncertainty, reminding stakeholders that they're using an empirical approach, moving one Sprint at a time, and refreshing forecasts frequently. The Product Backlog is (hopefully) a living and breathing emerging artifact.

Some would say it's better not to provide a forecast at all. They would say, "Discover and deliver capabilities—review outcomes with the customers and end-users. Learn what can be learned. Act on what we have discovered. Don't manage expectations."

Takeaways

Avoid story points. If you must use story points, use them as a temporary crutch. The story points approach is a popular complementary approach for teams, but it is not part of Scrum. I am not a fan. When people use story points, I usually observe destructive behaviors. I see story point inflation, story point bingo, story point normalization, and story point velocity used as a proxy for performance measurement. For example, senior leaders look for more velocity leading team members to split items that are not Done to claim some kind of story point velocity. Teams are not wholly in control of how much work they can deliver. So, if a team is under pressure for more velocity, the chances of a medium item being defined as large or extra-large increase.
Avoid average throughput. The throughput approach is better, and while it's not perfect, it's a more accurate basis for forecasting when used with probabilistic forecasting and counting valuable items within throughput. Throughput lacks credibility without value or right-sizing. Though not part of Scrum and only optional in Kanban Guide, for non-software, consider separate throughput for separate product backlog item types, e.g., packaging vs. communications. Using averages to forecast time/points/throughput is not a great way to manage one's career. "Boss, we have a 50/50 probability of on-time delivery; is that ok?"
Consider historical reference items. Historical reference items include "waiting time" and depend on remembering how long that took but don't consider accidental complication. Accidental complication is the untidiness of the work we leave behind and its knock-on effect on the quality of estimates; it explains why reference items still don't work well for time estimation.
Try throughput of valuable items as a basis for probabilistic forecasting. For non-software, consider separate throughput for separate product backlog item types, e.g., packaging vs. communications. It's still "smoke and mirrors" for complex work, so you'd best add the caveat that you'll provide a better forecast next week and the week after that once you learn more. Beware of probabilistic forecasting based on fake throughput (throughput of fake items - items lacking value), sporadic throughput, and particularly when most time periods selected have zero throughput—probabilistic forecasts based on erratic throughput lower the quality of forecasts considerably.
Try #NoEstimates - rolling wave forecasts communicate uncertainty well. To avoid breaking Scrum or Kanban(for example, using the Kanban Guide for Scrum Teams), ensure that all (product backlog) items are valuable.
When asked when the team will have something done, try responding by saying you don't know. Explain that you're using an evidence-based empirical approach and operate one Sprint/week/month at a time. Many items require discovery/spikes; you don't know what feedback you'll get, you don't know what you'll learn, and you don't know what you'll have to respond to. Beware that in many cultures, vacuums tend to get filled, and if one does not provide a forecast, others eagerly await to offer one, probably with an undoable deadline.
Favor leading by example with agility. Understand the problem space. Discover and deliver capabilities—review outcomes with the customers and end-users. Learn what can be learned. Act on what we have discovered—set expectations of uncertainty and bets, not dates. Don't feed the desire to know something we don't know. As a well-known agilist said to me, forecasting makes us feel we are doing something useful; that feeling or unmet need is what needs to be addressed.

We get lost during sizing and forecasting, and we often forget value. Measure with people affected if the work made a difference and tweak.

Acknowledgments

Many thanks to members of the Design, UX, Product Management, #NoEstimates, Scrum, and Kanban communities for either inspiring ideas for this article, or reviewing this article in its entirety or in part. Reviews of the article added perspectives that were acted upon; even so, reviewers might disagree with the article. I want to offer special thanks to Glaudia Califano of Red Tangerine and Christian Neverdal for reviewing in detail a much earlier iteration of this article, I am in their debt. I am in debt to many other un-named reviewers who also put in a lot of effort. I especially want to thank Arpad Piskolti for inspiring me with his uncertainty-embracing catchphrase “and I’ll give you a better forecast next time (week/Sprint/month).”

Appendix

Flow metrics

Flow metrics are not just about throughput. Following the Kanban Guide for Scrum Teams, one can also look at WIP, Cycle Time, and Work Item Age; in that context, looking at throughput alone would be myopic. Even in a Scrum context, focusing on throughput but ignoring impeded/forgotten product backlog items is a fool’s errand.

The Kanban Guide for Scrum Teams introduces four metrics to Scrum, namely:

Work in Progress (WIP): The number of work items started but not finished. The team can use the WIP metric to provide transparency about their progress towards reducing their WIP and improving their flow. Note that there is a difference between the WIP metric and the policies a Scrum Team uses to limit WIP.
Cycle Time: in a Kanban Guide for Scrum Team context, the elapsed calendar time between when a work item starts and when a work item finishes (rounded up).
Work Item Age: The elapsed calendar time between when a work item started and the current time; this applies only to items still in progress.
Throughput: The number of product backlog items finished per unit of time.

See How can Scrum with Kanban help people solve complex problems?

Some techniques to inform/challenge forecasts

Guidance on Sizing - other approaches

Upsides of other sizing approaches

Time estimation - estimation of work in hours, person-days, perfect days	Cost estimation	Three Point Method
Does not require Scrum education	Does not require Scrum education	Developers value discussions on relative size, as the perception is it leads to a better understanding of the work
Speaks in the customer’s language (time), or does it?	Speaks in the customer’s language (money cost), or does it? (timely value?)	Developers feel a better sense that they have discovered relative complexity/ risk
	Estimated usually in sprints (useful for a product goal) or time converted to cost	Optimism, pessimism, realism - everyone gets a voice
		Creates a range rather than a number, which is great for communicating uncertainty, and getting stakeholders used to uncertainty

Potential downsides of other sizing approaches

Time estimation - estimation of work in hours, person-days, perfect days	Cost estimation	Three Point Method
Prone to abuse by people with a “utilization” mindset (keeping people busy). See.	Less useful for sizing individual product backlog items that would go into a sprint. More useful for rough order of magnitude for say an entire product goal.	Might be used to expedite Planning Poker, losing the value of the conversations to understand opinions more
Would not be used for probabilistic forecasting, as it’s not output-based.		When converted to story points, could be used for probabilistic forecasting min/max, but should you?
When was the last time you had a perfect day?

Other considerations of other sizing approaches

	Time estimation - estimation of work in hours, person-days, perfect days	Cost estimation	Three Point Method
Typically combined with	T-shirt size	T-shirt size, Time estimation, Historical time reference	Assign numerical values, T-shirt size
Usefulness in complexity	low	low	low to medium
Devalued by	Capping the size	Capping the size, the use of function points, use of Gantt charts	Capping the size
Worst thing that could happen (if done badly)	Delusion that everyone works full days, full months to maximum efficiency, sandbagging	Delusion that what happened in the past dictates the future	Easily gamed

About the Author

John Coleman

Show moreShow less

InfoQ Software Architects' Newsletter

Talking about Sizing and Forecasting in Scrum

Write for InfoQ

Key Takeaways

Related Sponsors

Monte Carlo simulations

Sizing Techniques

Potential upsides of some sizing approaches

Potential downsides of some sizing approaches

Other considerations for sizing approaches

Forecasting

Takeaways

Acknowledgments

Appendix

Flow metrics

Some techniques to inform/challenge forecasts

Guidance on Sizing - other approaches

Upsides of other sizing approaches

Potential downsides of other sizing approaches

Other considerations of other sizing approaches

About the Author

John Coleman

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter