Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Shaping Big Data Through Constraints Analysis

Shaping Big Data Through Constraints Analysis


Computing system design is a game of constraints. A well-designed system is like a custom-built carrying case: on the outside there are convenient clasps and handles, and the inside is moulded into a negative image of the thing it contains. To design an elegant container you focus on the factors that matter most: size, weight, balance, and movement. Those factors and how they interact produce the operative constraints on possible designs.

In this article I'll describe a method for analyzing constraints on the shape and flow of any data system, and then put it through its paces with real-world examples. The constraints-based approach is similar to Goldratt’s theories in the field of logistics. In my experience, the most successful systems designers and capacity planners tend to think about computing systems in terms of concrete constraints, even if they may not say so out loud.

A constraints exercise is more useful than, say, extrapolating from a benchmark or arguing about the definition of “concurrent user.” It reveals probable bottlenecks and their dependencies before any code is written. That knowledge is essential for making rational choices about proposed designs. With practice you can even make shrewd guesses about someone else’s system, like your competitor’s, simply by observing how it operates.

The trick is to establish the size and heft of the data, and then focus on how it flows. Computers really do only two things: read data in and write data out. Performance is a function of how much data must move, and where, to accomplish a task. That’s not a facile slogan; it’s a consequence of the fundamental theorem of computing. Every computer is equivalent to a Turing Machine, and all a Turing Machine does is move symbols around a tape. Its throughput is bounded by how fast it can move symbols. This consequence holds true from the micron-sized guts of the CPU on up to world-spanning distributed databases. Luckily, the math is straightforward.

The ability to quickly discard bad solutions, without actually building them, is a lot of what good systems programming is. Some of that is instinct, some experience, but a lot of it is algebra.

— Keith Adams

Here are the eight factors I find most useful for characterizing a system:

  1. Working set size: This is the set of data a system needs to address during normal operation. A complex system will have many distinct working sets, but one or two of them usually dominate. In stream-like applications such as email or a news feed, the working set can be much smaller than the total set. People rarely access messages more than a few weeks old; they might as well be considered a different system. More often, life is not that simple and the line between “hot” and “cold” data is not clear. It’s most useful to think in probability bands: over a given period of time, what is the probability of various pieces of data being used? How many bands are there? What is the distribution?
    For the initial analysis you can focus on the rough size of the working set, as opposed to the detailed characteristics. However, those details often come back to bite you later.
  2. Average transaction size: This can be thought of as the working set of a single transaction performed by the system. How much data does the system have to touch in order to serve a transaction? Downloading a photo and running a web search involve similar-sized answers sent to the client. However, the amounts of data touched in the background are very different. Note that I’m using the word “transaction” to mean a distinct piece of work. This idea equally applies to big analytical jobs.
  3. Request rate: How many transactions are expected per hour / minute / second? Is there a peak hour, or is demand steady? In a search engine you may have 5 to 10 queries per user over a period of minutes. An online ebook reader might see constant but low volumes of traffic. A game may require multiple transactions per second per user. In short, the expected throughput. The combination of throughput and transaction size governs most of the total data flow of the system.
  4. Update rate: This is a measure of how often data is added, deleted, and edited. An email system has a high add rate, a low deletion rate, and an almost-zero edit rate. An ad auction use case has ridiculously high rates for all three. A useful way to gauge how much to worry about the update rate is to compare it to the read throughput.
    The growth rate of the data also ties into the working set size or retention policy. A 0.1% growth rate implies a three-year retention (365 * 3 is about 1,000), and vice-versa. A 1% rate implies 100 days.
  5. Consistency: How quickly does an update have to spread through the system? For a keyword advertising bid, a few minutes might be acceptable. Stock trading systems have to reconcile in milliseconds. A comments system is generally expected to show new comments within a second or two, with frantic work backstage to provide the illusion of immediacy to the commenter. Consistency is a critical factor if the update rate is a significant portion of the request rate. It is also critical if propagating updates is especially important to the business, e.g. account signups or price and inventory changes.
  6. Locality: What portion of the working set does one request need access to? How is that portion defined? What is the overlap between requests? On one extreme you have search engines: a user might want to query bits from anywhere in your system. In an email application, the user is guaranteed to only access their inbox, a tiny well-defined slice of the whole. In another instance, you may have a deduplicated storage for email attachments, leaving you prey to hot spots.
  7. Computation: What math do you need to run on the data? Can it be pre-computed and cached? Are you doing intersections of large arrays? Are you bringing the computation to the data, or the other way around? Why?
  8. Latency: How quickly are transactions supposed to return success or failure? Users seem to be ok with a flight search or a credit card transaction taking several seconds. A web search has to return within a few hundred milliseconds. An API that outside systems depend on should return in 100 milliseconds or less. It’s also important to think about the variance. It’s arguably worse to answer 90% of queries in 0.1 seconds and the rest in 2 seconds, rather than all requests in 0.2 seconds.

Finding the bottlenecks

Pick a system you want to analyze. Fill in the constraints above. Sketch out a solution that can satisfy them all. Now ask the final question: what is the dominant operation? In other words, where is the fundamental bottleneck? What part can’t be hurried?

The answer may sound mundane when you say it out loud, but identifying the true bottleneck of a system clarifies things immensely. Incidental bottlenecks tend to stack up. Fixing one activates another, but they don’t threaten to upend the basic premises of your design. Fundamental bottlenecks are harder to fix, and harder to see, because they are caused by some fact of nature or a strong assumption the system is built around. An infinitely-fast website is still subject to the speed of light. Rockets are limited by the need to lift the weight of their own fuel. Keep the thought experiment going until you understand what the most fundamental bottleneck is, and why.

For example, let’s say you own a pizza shop and want to make more money. If there are long lines to order, you can double the number of registers. If the pizzas arrive late, you can work on developing a better rhythm, or get vehicles for the delivery people. You might even try raising the oven temperature a bit. But fundamentally, a pizza shop's bottleneck is the size of its oven. Even if you get everything else right, you won’t be able to move more pizzas per day without expanding your oven’s capacity. Or by buying a second one. And then you have to figure out where to put it.

If you can’t clearly see a fundamental bottleneck, change a constraint and see what shifts in response. What happens if you had to reduce the latency requirement by 10X? Halved the number of computers? What tricks could you get away with if you relax the constraint on consistency? It’s common to take the initial constraints as true and unmoving, but they rarely are. Creativity in the questions has more leverage than creativity in the answers.

If you want to understand something, take it to the extremes or examine its opposites.

— Col. John Boyd

Movie Streaming Use Case

Let’s apply our method to something more complex than pizza: a video streaming service. Imagine a smaller-scale Hulu or Netflix. It will have a library of 100,000 videos, on average 60 minutes long, at a bitrate of 500kbps. At the peak hour there will be about 1,000,000 people watching.

  1. Working set size: 100K videos * 60 minutes * 60 seconds * 500kb per second / 8 bits per byte works out to a bit over 20TB for the total set of data. The most sensitive number in the equation is the bit rate. Changing the rate can shrink or balloon the total size.
  2. The probability bands of the working set depend on the popularity distribution of videos, which is (probably) long-tail. Let’s assume that the top 100 videos account for 10% of all plays, the top 1,000 do 20%, and the top 10,000 account for 35%. The bottom third doesn’t get enough views to move the needle much at all. It’s tempting to ignore the long tail completely. However, as we’ll see later, that would be a mistake.
  3. Average request size: 3600 seconds * 64KBps, or a couple of hundred MB. In practice, smooth streaming depends on being able to shove data down slightly faster than it is consumed, and is probably done in smaller chunks. To simplify things, we’ll ignore the large and complex issue of regulating throughput down to the client.
  4. Request rate: Fairly low, compared to a system with a smaller request size. Concurrent requests will be high. You’d expect users to have short bursts of browsing and long periods of streaming. At peak the system will be pumping out half a terabit of data per second.
  5. Update rate: Nearly nil compared to the request rate, as the videos are professionally produced. If this were YouTube it would be a different story entirely.
  6. Consistency: As this is largely a read-only system, it can be ignored.
  7. Locality: Any user can view any movie, though of course only one movie at a time. You will have the opposite problem of many users accessing the same movie in different places.
  8. Computation: This depends on whether the videos are being encoded on the fly or not. Assuming not, the main work is shoveling bits onto the pipe.
  9. Latency: This is an interesting one. The worst case is channel surfing or skipping around. In real-world movie services you may have noticed that switching streams or skipping around within one video takes a second or two. Anything longer would be unacceptable.

Now let’s sketch a system that can serve these constraints. 20TB doesn’t seem like a lot of data to hold. 500gbps feels like a lot of data to serve. A modern (as of 2015) 16-core file server can comfortably push 7gbps of useful data out to the network, so we’ll need at least 50 of them. Throw 128GB of RAM into them. 2TB of storage each is easily enough to hold the working set.

So we can hold it, but can we move it? Let’s look again at that popularity curve. The top 100 videos account for 10% of the demand. You’ll want to have a copy on each server. In fact you’ll probably want to do that for the top 10,000 videos, which is a third of the bandwidth but only 10% of the storage. With sufficient RAM and a little luck, the popular stuff will be served almost entirely from memory.

That leaves the other 90,000 videos. The 30,000 of them on the right-hand side of the long tail don’t get watched enough to matter. But in the middle of the tail are the 60,000 that make up the remaining 2/3 of your traffic. How do you serve them while keeping to the latency constraint? Your RAM is already taken up by the popular stuff, so it’ll come down to how many concurrent 500kbps streams your storage layer can deliver. The specific design would only come out after running tests, but you may very well end up with several hundred disk drives working in parallel.

From this we conclude that the fundamental bottleneck of a video service is random seek times, which in this design means how fast those little metal arms can move back and forth. The controlling constraint is the popularity curve of what videos people watch, and that’s something you can’t change.

There are other designs which might get around the problem. For instance, 1TB of RAM in each server instead of 2TB of disk. The SSDs of today or tomorrow just might be able to pull it off too. The numbers here may be off, and will certainly change as time goes on. The point is discover which concrete details matter most, and focus the discussion on them.

Face Recognition Use Case

Sometimes, a particular operation dominates a system so heavily that almost nothing else matters. An interesting case of that is face recognition. Not face detection, which is just finding a face in an image, but determining who is in a given photo by comparing against a library of other photos.

Face recognition starts by preprocessing photos of known people in order to generate a kind of abstract description of each person’s essential features, known by the wonderfully weird name “eigenface”. An eigenface is a fixed-size chunk of data, usually less than 100KB, that can be produced by any number of clever algorithms. The main benefit of eigenface generation is that a similarity score between any two of them is relatively cheap to calculate. The higher the score, the more affinity between the two face profiles and the more likely it is that the person is who you think it is.

Suppose you have a library of 50,000,000 eigenfaces corresponding to individual people. Maybe it’s from a national passport or driver’s license database, or the world’s largest yearbook. There is a stream of query photos coming in, hundreds per second, live from passport control at the border. The top 10 highest-scoring matches must be found for each photo in 1 second or less. Go.

  1. Working set size: 50 million * 100KB is 5TB. The division of “hot” and “cold” data is not immediately clear. There may not be any.
  2. Average request size: The data coming in is a photo, which is reduced to a fixed-size 100KB eigenface. The amount data touched during a request is potentially very high.
  3. Request rate: 300 per second.
  4. Update rate: Unknown, but probably infrequent and batched due to the preprocessing involved.
  5. Consistency: Ignore for now, given the low update rate.
  6. Locality: Naively, we’d have to scan through all 50 million profiles in the library and rank the matches. We’ll need to find a way to elide that work as much as possible.
  7. Computation: The algorithms for actual face detection are fascinating to read about, but not important for this exercise. Assume that a full comparison between two eigenfaces requires 10 milliseconds of CPU time.
  8. Latency: 1 second or less, hard requirement.

The only way to hit the latency requirement is to drastically reduce the number of full eigenface comparisons we perform in a given search. But just as a gut check, is it within the realm of possibility to throw enough compute at the problem? Probably not. 50 million comparisons * 10 msec * 300 would require roughly 5,000 CPU-years of computing power per second. I’ve seen some large clusters, but not that large.

So we need clever ways to reduce the work. A quick search for [eigenface indexing] shows active research in this area, but a look at the abstracts suggests that the speedup is linear, not logarithmic. Google does have a generic image similarity search so it’s maybe not impossible, but the technology is not necessarily available to us. Drop that line of investigation for now and try something else. Perhaps we can eliminate large swathes of candidates based on eye color, age, or gender. No sense matching a 30-year-old woman against photos of male children.

Assume that through heuristics and cheap tricks we can shrink the number of candidates to 1,000 for a given query photo. Throw the equivalent of 10 CPUs at the problem to deliver 10,000 msec of compute in the allotted 1 second of latency, add a couple more for overhead, and you’re done. 300 * 12 means we need 3,600 CPUs to handle the request rate. That’s about 250 servers, or a half-dozen racks. Three metric tons of computer power is a lot by any standard but this is an important project.

So we can compute it, but can we hold the data? 5TB is quite a lot of RAM. On the other hand we have all these servers to play with. So distribute the library of eigenfaces around the cluster RAM in something like Memcached.

Ok, now set it in motion. Oops. It doesn’t work.

We can store it, but can we move it? We forgot to update the average request size. If a single request touches 1,000 eigenfaces, and each one is 100KB, and they are randomly distributed, that’s 100MB of data to move from wherever the data lives in Memcached on one server to the CPU on another server doing the comparison. Take that times 300 requests per second, and we’re staring down the barrel of 240gbps of network consumption. That’s uncomfortably close to 1gbps per server, and the CPUs are already busy comparing faces.

We need to bring the computation to the data, not the data to the computation. Before we start the comparisons, we know exactly which of the 1,000 eigenfaces we’re comparing to. Memcached hashes keys in a straightforward manner, so it’s possible to know which server holds which eigenface. We could route specific requests to specific servers. Each server does 4 or 5 comparisons against their local data, returns their similarity scores, and the full list can be assembled and sorted easily. The only network traffic would be the 100KB eigenfaces of the query photos, times 250 machines, times 300. 60gbps is high but doable. With more clever arrangement of the data you can avoid a full fanout.

If it isn’t obvious by now, the fundamental bottleneck is scanning faces. It’s such a heavy problem that we barely have a plausible solution, and we haven’t even considered whether the recognition will be accurate enough to be useful. This analysis of face recognition is almost certainly wrong in important details, but given the comments here from people who should know, it’s in the ballpark. I deliberately picked a use-case I don’t know much about to demonstrate how constraints analysis can help even if (especially if) you’re not an expert.

Imagine an expert comes along and tells you, “Why not use GPUs? They can compare faces ten times faster!” you can answer, “That’s interesting. How does that help the network bandwidth problem?” (It actually might help, if it reduces the number of machines.) Or if someone says, “I have a way to index eigenfaces!” you can reply, “That’s interesting, but only if it helps me perform less than 1,000 full comparisons. By the way, you and the GPU fellow should talk.” You’ll also know to pay close attention to someone who says, “I can accurately represent a person’s face much more compactly than 100KB.”

That is the real value of a constraints-based approach. A better understanding of the underlying physics of your problem shows you where there’s room to be creative. And, like the eigenface trick itself, it allows you to render complex pictures into a standard form that can be accepted or rejected quickly.

Science isn't about WHY. It's about WHY NOT.

— Cave Johnson


There are several skills worth practicing for this kind of work. Being able to do quick estimations with incomplete data is essential. For instance, how many servers does Google own? (Follow the electricity consumption.) Another good skill is being able to dump a flawed thread of reasoning and flush your mental cache. Jeff Dean has many great talks online about “numbers you should know” and how to think about the design of large systems.

Running a constraints exercise on an existing system, and then looking up the answer, is a great way to hone your skills. In this article I avoided examples with high update rates or consistency constraints. But take an hour or so and work through what it probably took to run AdWords circa 2004, YouTube in 2005, or Flickr in 2006. There are great write ups and talks available, written by the people who built them. But don’t peek until you’ve got an independent answer.

About the Author

Carlos Bueno works at the database company MemSQL. Previously he was a performance engineer at Facebook, Yahoo, and several startups. Carlos is the author of "Lauren Ipsum," a popular children's novel about computer science, and "Mature Optimization," Facebook's manual on performance profiling.

Rate this Article