BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles The Estimation Game - Techniques for Informed Guessing

The Estimation Game - Techniques for Informed Guessing

Bookmarks

 

A big problem with being numbers-driven is that you often don’t have the numbers. This is especially true when contemplating designs for systems that don’t yet exist. The more cheaply and quickly you can make reasonable estimates of those numbers, the more designs you can consider, the better your final design will be. Fast and cheap answers allow you to explore a larger space of possibilities, kind of like how you can now research the specs and reviews of different cars before you step onto the lot.

This kind of exercise is often called a “Fermi estimation”, after a legendary incident in which Enrico Fermi dropped bits of paper into the shockwave of a nuclear test to estimate its yield. There are other old chestnuts like the (now well-retired) interview question about how many left-handed piano tuners collect Social Security.

The thing left out of puzzle stories is how to actually solve them. There are lots of examples and books, and even some sites that let you play and get feedback, but formal instruction is surprisingly thin. Students seem to be expected to pick it up via some form of osmosis. Estimation is a game, but it’s a game of skill. Games of skill have rules for play and strategies to win. In this article we’ll try to tease out what some of those strategies are.

Case 1: How many computers does Google own?

Well, what is the biggest thing Google does? Probably web search. Ok, so how much computer power is involved in a web search? How many searches are done per day? You can make a guess at each, multiply them together, and from there you’ll fail to account for lots of important things. What about crawling the web in the first place, or the processing involved in building the search index? What about YouTube? What about other products like maps and email? What about other countries?

You can keep going in this vein, but the more unanswered questions you stack up, the more sources of error you’ll have. Step back a bit. Addition and multiplication are poor tools for estimation if you don’t have a small number of concrete cases. Instead, try subtraction and division. The trick is to find a simple, large, inclusive quantity that completely includes the quantity you want to estimate, and then carve away at it.

Technique 1: Use fundamental units, then subtract and divide.

What do all computers require? Electricity. So if you knew how much electricity Google uses, you’ll have completely captured the quantity you’re estimating, plus other stuff. So then subtract. Does Google use electricity for things other than computers? Sure they do, for office lighting and such. But it’s likely that the energy used for people is tiny compared to the energy for computers. So let’s ignore it. What about the energy used for the care and feeding and cooling of their computers, versus the computers themselves? A reasonable guess is 20%.

Technique 2: Define a small number of cases. Ideally, one.

How much electricity does a “Google computer” consume? They are famous for making their own hardware. We can expect they are more efficient than the kind of stuff you can buy in the store. Also, we can expect a high degree of standardization. It’s unlikely that the electricity used by their hungriest computer is very different from their leanest. No battalions of three-ton water-cooled mainframes next to piles of itty-bitty Raspberry Pi’s. So we’ll assume that an “average” Google computer is representative.

Technique 3: Make equivalences to concrete things.

How much electricity does a “typical” computer use? My laptop’s charger is rated to 85W. On the opposite range, the power supply on my old gaming computer is rated to 400W. It’s reasonable to guess that a powerful but efficient datacenter server, sans graphics card and stuff that consumer PCs come with, drinks 200W of power. That’s a nice round number. 5,000 of these 200W computers would consume exactly 1 megawatt (MW). So the formula becomes:

(MW * 0.8) * 5000 = Number of Google Computers

Technique 4: Consider real-world effects appropriate to the scale you’re working at.

We’re trying to estimate the size of a world-spanning computer cluster used by a large fraction of humanity. At world-wide scale, world-wide effects kick in. For example, people sleep. That means that the demand for “public utility” resources like power, water, telecoms, and yes, even Google, will tend to exhibit a diurnal demand curve: a peak in the afternoon local time, and a trough in the wee hours of the morning.

Furthermore, geography being what it is, the majority of the human race is concentrated in only a few time zones. When the sun is over the Pacific Ocean, most of the world is sleeping. So this diurnal cycle tends to apply across the planet.

This means we can’t use average MW. We have to estimate the peak MW draw. The ratio between peak and average demand for consumer services, across a surprisingly wide range of industries, is about 2:1. This phenomenon has all kinds of implications for infrastructure planning. You have to build your dam for the expected peak of the river, not the average. Otherwise, your dam will overflow with every rainstorm.

((Avg MW * 2) * 0.8) * 5000 = Number of Google Computers

This technique of including real-world effects is harder to practice, because it requires specific domain knowledge and an awareness of scale. But it’s often very important to have knowledge about various layers of infrastructure at your fingertips, if you are going to design things that fit into that infrastructure. When Randall Munroe tackled the same question of counting Google’s computers, he made the mistake of taking just the average power consumption, and lowballed his number by a large factor.

All that’s left now is to find out that average MW number. You used to have to trawl local newspaper archives for capacity requests to the local power utilities, and then make guesses about actual present consumption. On top of that, Google is fond of using fictitious names obscure their operations. Today it’s all on their website for PR purposes. For 2013, Google says they consumed 3.7 trillion watt-hours of power. This averages to 422 MW of power consumption over the year.

3.7 trillion watt-hours per year / 365 days per year / 24 hours per day = 422,374,429 watts

Assuming that peak-hour consumption is twice that, their peak draw is around 800 MW. Now let’s plug it in.

800 * 0.8 * 5000 = 3,200,000

3.2 million servers. Tell that to a Googler and see if they blink.

Technique 5: Make a list, check it twice.

A good estimation uses as few inferential leaps or assumptions as possible. A different info-tease claims that Google’s “continuous” power consumption in 2010 was 260 MW. The word “continuous” is ambiguous. Perhaps Google uses servers off-hours to do other kinds of work, like indexing the web or processing video uploads. It’s possible they have no peak, but that would be pretty remarkable. However, I think it simply means “average”. Checking the original version of their PR page in the Internet Archive shows a claim of 2.26 TWh total power consumption over 2010, which averages out to 257 MW. QED.

But we don’t really know, so let’s list the assumptions:

  1. People power usage is negligible compared to computers.
  2. The overhead of cooling is 20%.
  3. An average Google computer draws 200W.
  4. Peak power draw is 2:1 of average power draw.

Any error in any of these four assumptions can affect the overall number. If you want to refine your estimate, you can do further research in isolation. Munroe, for example, estimates server draw at 215W. It’s also possible that the overhead of cooling is too high. On the other hand, errors in subestimates tend to cancel each other out. You just have to hope you’ve made an even number of mistakes.

Technique 6: Avoid anchoring effects

This is a hard one to talk about because if I had said something earlier, it would have triggered the phenomenon I wanted to warn you of. It’s very easy to bias people’s estimates of a number, simply by exposing them to other numbers. When you first read the sentence, “How many servers does Google own?” you probably had a vague feeling of some large quantity, but nothing specific. If instead I had said, “How many servers does Google run, to the nearest thousand?” It’s very likely you would have started thinking in terms of thousands, not millions.

The only advice I have here is to avoid, for as long as possible, thinking of a “reasonable” number for your estimate. (While you’re at it, never think about pink elephants.) If you can’t not think about a number, write it down and all the assumptions you can imagine went into it. At least then you’ll be able to revisit and revise.

Technique 7: Use bounding and cross-checking

It also helps to find rough upper and lower bounds on your target number. This gives you a gut-check for the range of the number, and also gives you multiple ways to arrive at the same answer. This is somewhat in conflict with #6. I generally don’t look for bounds until I’ve already made one full chain of reasoning.

What you’re looking for is convergence, some kind of agreement between different angles of attack. For example, computers cost money, so follow the money. It appears that Google spent about $7 billion on “infrastructure” in 2013. We’ll assume half of that money ($3.5 billion) went to computers and the rest went to buildings, land, fiberoptics, rocketships, robot armies, etc. We’ll also assume that each computer costs $2,000, so about 1.8 million computers were bought. Then assume they keep their computers running for 3 years. That suggests they don’t have more than 5 or 6 million servers.

What about the lower bound? 3.2 million sounds like a lot. Are we in the ballpark? Well, Microsoft’s server fleet is probably smaller than Google’s, and they are upfront about their number: one million.

Technique 8: Don’t forget time and growth

All of the numbers for our main chain of reasoning come from 2013. How many servers does Google have now? Luckily, we have two data points for power draw, 2010 and 2013. The growth over those three years was 3.7 / 2.26, or about 63%. Your first guess might be that works out to 21% per year, but growth rates are more tricky. When you say “it grows 21% per year for three years” you really mean:

1.21 ** 3 = 1.77

...which is too high. The real number is closer to 18%:

1.18 ** 3 = 1.64

To find the multiple to now (mid-2015), you plug in 18% to the 1.5th power.

3.7 TWh * (1.18 ** 1.5) = 4.7 TWh

3.2 million computers * (1.18 ** 1.5) = 4.1 million computers

But all of that’s assuming constant annual growth. Looking again at that infrastructure spend chart, Google spent 7 billion in 2013 and nearly 11 billion in 2014. That’s a big bump. Assuming that all computers bought in 2012 or earlier are mostly retired, and that 2015 looks similar to 2014, we can take our cross-check formula and plug it in:

(7 + 11 + 5.5) $billion / 2 / 2,000 = 6 million computers

Technique 9: Don’t add false precision.

This second estimate has more fuzzy assumptions in the subtraction and division phases. Maybe Google builds servers for cheaper than $2,000. Maybe they spend more or less than half on computers. It assumes a lot about both the growth and decay rates, and projects away from concrete numbers. That’s why I didn’t bother with the decimal point. In fact, I feel an urge to fuzz the range to 6 million plus or minus 1 million.

The 3.2 million number is also suspect. The power consumption number we used was totaled over the entire year. Presumably they ended 2013 running more servers than when they started. But there’s not a lot to go on guessing the slope of the curve. For that hidden assumption alone we should drop the 0.2 and fuzz the range.

Unless you have a really short chain of assumptions, the best you can hope to get out of these kinds of estimations is an order of magnitude and one significant digit. Two digits would require serious confidence. That’s the standard you should hold yourself and others to. If someone says “My spreadsheet predicts 44,815 widget sales in month 6!”, get in the habit of mentally editing that to “40-50 thousand”. You’ll be surprised less often when reality casts its vote.

Case 2: Look at a map!

Estimation isn’t just a parlor game, and you never know what domain knowledge will turn out to be useful. I was once responsible for a stream of performance data collected from a very large number of servers scattered around the world. The stream suddenly had a weird bug; about 2% of data was leaking away off into the aether. Worse, it was important experimental data. The control data was coming in fine, funnily enough. Other than that there wasn’t much of a pattern.

After lots of tracing and debugging at both ends of the pipeline, I found a little gem. Buried in a few layers of code was something like:

try {
...
start_experiment($stat_server, $data);
...
} catch {
// todo
}

This system was typical of what you get when you let scientists build large software systems: solid theory, reasonable methodology, but horror-show coding. Stuff like the above snippet wasn’t unusual. Errors get swallowed, sure, but if so they’d have been swallowed for years. I repeated the Programmer’s Lament: “But nothing’s changed!” I couldn’t see why the problem of intermittent data loss should start happening out of the blue.

I looked at the implementation of start_experiment() and saw that it used some funky homegrown HTTP client function, which in turn used some funky homegrown network connection function, which had an optional third argument, $timeout. It was never used. However, it reminded me about something I’d noticed subconsciously but hadn’t really thought about. The only datacenter with a large amount of data leakage was the new one way up in northern Europe.

The light came on. How much time does it take to send a packet? After some hurried hunting I found the other gem:

$default_conn_timeout = 100; //msec

Our leaky datacenter was about 8,000 kilometers away, as the penguin flies.

However, fiber-optic cables between the US and Europe don’t travel that route. (Also, there are no penguins in the North Pole.) For complex reasons, a packet from San Francisco to Stockholm travels across the continental US, exits New York, dives under the Atlantic, enters southern England, pops out the other side to Belgium, and generally rattles around Europe before kicking up north to its final destination. Eyeballing from a cable map that the networking team had in their office, the real route looked to be a bit less than 10,000 kilometers, or 20,000 round-trip.

The speed of light through fiberoptic cable is 200,000km per second. The maximum distance that a TCP packet can travel round-trip in 0.1 seconds was just about the distance we were attempting to span. The internet is impressive, but few things in the world can reliably operate at the edge of physical reality. I changed the timeout to 500 milliseconds, accompanied it with a 500-word explanation, and the problem was fixed.

While bad error-handling compounded this bug and made it harder to find, the root problem lay in a miscalculation of the time it takes to send data over long distances in the real world. The final irony is that the bug's author had a PhD in laser physics.

And that’s why you should practice estimation.

About the Author

Carlos Bueno works at the database company MemSQL. Previously he was a performance engineer at Facebook, Yahoo, and several startups. Carlos is the author of "Lauren Ipsum," a popular children's novel about computer science, and "Mature Optimization," Facebook's manual on performance profiling.

Rate this Article

Adoption
Style

BT