Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Pragmatic Performance - Tales from the Trenches

Pragmatic Performance - Tales from the Trenches



Ramesh Subramonian shares techniques used to improve the performance of an existing system.


Ramesh Subramonian is a Principal Engineer at Target in the High-Performance Computing Group. He likes to build high-performance systems with a focus on efficient usage of computing resources.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Subramonian: Welcome to the mechanical sympathy track at QCon. I'd like to share our experiences in optimizing production systems. Like most jobs, but especially software engineering, 80% of the time is grunt work, but 20% of the time we get to step out and see how we're doing with respect to other things. These are those experiences that I would like to share. My name is Ramesh Subramonian. I'm a Principal Engineer at Target. I work for the High Performance Computing Group located in Sunnyvale. Our group's charter is to drive compute efficiency throughout the company, both in the products we build and support and as advisors to the company at large. The reason I chose this title, pragmatic performance, is to underscore the constraints under which we often find ourselves: rarely enough time or resources, limited ability to perturb a system in production, and the general reluctance of IT departments to adopt bleeding edge software or hardware.

Key Learnings

What's the one thing I'd like to leave you with at the end? Is that despite the seemingly bleak picture I just painted, there's often hope buried in the specifics of your particular problem. General solutions are great, because they are general. They work in all situations. This generality comes at a cost. What I'm going to do is to encourage you to embrace your inner snowflake, find out what's really your problem, and solve just that. As one who has been responsible for more bugs than I'd care to admit, I'm highly biased towards picking solutions with low complexity, both in terms of implementation and maintenance.

Two Simple Examples

I'm going to illustrate this point with two simple examples that are as old as the hills. The first is evaluating a Boolean expression. As an example, A and B, or not C and D, where A, B, C, D, are expressions that evaluate to true or false. The other is the problem of determining whether and where a string s1 is contained in a string s2. Solutions to this problem have existed for over 50 years in C's standard library, and pretty much every language provides some form of solution to this problem.


Let me set the business context that motivates optimizing the performance of Boolean expression evaluation. Our group is responsible for the development and support of Thalamus, a real-time monitoring and alerting system deployed at Target. As its popularity has grown, so has its footprint and its usage. Anything we can do to speed it up has a disproportionate impact. Thalamus touches over a petabyte a week of data traffic. This is truly a firehose of data to process, and requires us to be extremely conscious of extracting every ounce of performance.

How Boolean Expressions Get Used in Thalamus

How do Boolean expressions get used in Thalamus? Let's say that you are a user of Thalamus, the first thing you do is to write a filter. A filter is a way of expressing whether a packet is of interest. The good news is that in most cases, most packets are not of interest. That's because you are looking for deviations from the norm. This is also why optimizing filters becomes important. It is the gatekeeper to subsequent computations. How do you go about writing a filter? One is to write it in Lua. Lua is an interpreted language designed as an embedding and embedded language. Hence, it fits perfectly well into our C++ framework. The other is to use a domain specific language, in our case we call it TQL. For today, think of TQL as the equivalent of writing a select statement in SQL.

TQL Filter as a Binary Tree

The details and the syntax of TQL are not important. What is important is that we have a compiler that turns a TQL filter into a tree. Leaf expressions A, B, and C in this example, perform computations based on values of keys in the JSON packet, and return true or false. Examples are regex matches, substring matches, membership and sets, numerical comparisons, range checks, and so on. The tree is generated at compile time. The key decision at runtime is do we start down the left branch of the route or the right branch?

Navigating the Tree

When it comes to making decisions like this, we can look for inspirations from a variety of sources ranging from Yogi Berra, to the Matrix. For the artists among us, we could take the lead from Robert Frost. Seriously, we need a disciplined decision making process. Before we answer the question, in what order should the tree be evaluated? Let's see why order matters at all. It matters because of short-circuiting. In other words, if you had to evaluate the expression A and B, and you evaluated B, and it turned out to be false, you know that you don't need to evaluate A. What if I told you that A is false 99 times out of 100, whereas B is false only 50 times out of 100? You might be tempted to pick A first. What if I also told you that A costs 1000 times as much as B? A could be a complex regex evaluation, B could be a simple integer equality comparison. Would you still pick A first?

Intuition behind the Solution

The intuition behind the solution is as follows. In the tree that we just saw, we maintain statistics for the costs and the outcomes, the outcomes being true or false, of each node. Assume the root node is AND, we have two choices, we could go left or right. If we go left, we definitely pay the cost of evaluating the left child. If the left child evaluated to true, we would pay an additional cost of the right child. Similarly, if we went right first, we are definitely on the hook for paying the cost of the right child, and we may end up having to pay the cost of the left child if the right child evaluated to true. The strategy is to pick the minimum average cost. The proof that this strategy is in fact, optimal, has some interesting twists that are beyond the scope of this talk. You can recognize the cost of going left versus the cost of going right.

Simulation Results

How does this perform in practice? We ran a simulation where we generated the structure of the tree, the outcomes and the costs at random. The x-axis shows the progress of the simulation. The y-axis shows the total cost of the tree evaluation. In the beginning, we focused on learning the statistics of the distribution without taking advantage of it. After that, we started exploiting what we had learned, and relatively soon, the algorithm converged on to the optimal solution.


In summary, optimizing Boolean expressions evaluation is important because filters are intensively invoked in our real-time monitoring and alert system. The context that I have been asking you to look for, but the context in this case was that the cost and outcomes of the individual items of the Boolean expression could be learned. Retrofitting this knowledge into the existing system was not onerous. We did need to maintain a few additional counters with each node and a protocol to age out old information. The reward for our pains was a 2x improvement. This is non-trivial compared to the minimal amount of disruption that we had introduced.

Needle in a Haystack: The Business Context

Let's look at another example, also motivated by Thalamus, the monitoring and alerting system. Here, the underlying packet distribution is effected using Kafka. In an ideal world, each Kafka topic would be a logical unit, for example, the logs from a specific business application. This allows easy horizontal scaling. Each application or topic would get its own virtual server, whose underlying hardware resources could be scaled independent of others. However, our IT department, in their infinite wisdom, merged several applications into a single topic, threatening to overwhelm even a well provisioned server. I don't mean to pour any doubt on this wisdom of our IT department, but this reflects the fact that in large enterprises, one is often optimizing for several things. In this case, they just didn't happen to be optimizing for us.

Outline of Solution: Take Advantage of Context

What we ended up having to do was to build a disaggregator. We needed to undo what the IT department had done. We needed to break the raging torrent into manageable rivulets. This is a rough schematic of what things look like. What the disaggregator needs to do is to route packets with minimal delay. This would have been an impossible task if it weren't for some valuable context that we could leverage. Each packet contains a key-value as shown. In this case, if we knew the value of the key application ID was Joe's application, we route it to the Joe topic, and we're done. However, we simply don't have time to JSON decode the packet. What if we could find the location of the key, we knew exactly where it exists in the packet? We could grab the value and be off and running. This is where the function strstr came to the rescue. What challenged us given the size of the packets was whether we could do better than a naïve implementation of strstr.

In this example, the haystack is a sentence, the quick brown fox jumped over the lazy dog. The needle is the word fox. Normally, we would have started our search at position zero. Assume that we had three experts who offered the following hints. Expert zero tells us to start searching at 16. Expert two tells us to start searching at position 10. Expert three tells us to start searching at position 19. How does this affect scan length and the cost of the search? The answer is as follows. In the first case, if you look at it, we had no hint at all, so we would start the offset at zero and our cost is 16. In the second case, we are told exactly where the word fox starts from, we have put virtually zero cost. If we listen to expert two, we would start at position 10. Not exactly where we'd like to be, but definitely better than starting from zero. Three is an interesting case, because here, actually, we get hammered. We are told to start looking from position 19, which means we would scan until the end of the string, find nothing there. Then we'd start at position zero again, and then we would find fox.

How to Build a Predictor

How do you build a predictor? This is made difficult by the fact that you don't have time to examine the packet, all you know is its length. On the bright side, you don't need to be that good. You can be hopelessly wrong quite a few times, you just need to be better on average when starting from scratch. What we did was for every possible packet length, we modeled the offset, which is the true location of the needle as a normally distributed random variable. That's not exactly accurate, since offsets cannot be negative, but the random variable can. As the adage goes, all models are wrong, some are useful. Next, we estimated the mean and standard deviation from historical data. Note one key difference between this and the previous example. In that case, learning was done online in real-time. In this case, it is done offline and uploaded to the server periodically as a configuration parameter.

The next thing we do is given that we have mu and sigma, we'll start a search from a little bit to the left of the mean. Remember from our previous example where we were looking for the fox, that it is better to undershoot than to overshoot, while guessing the offset. We call the start position y, and right y is mu minus alpha times sigma? We had mu and sigma from the data, the question is, how do we go about finding alpha? To do so, let us model the cost of the search as a fixed cost C plus a variable cost which is proportional to the length of the string that has to be scanned. The cost is C plus beta times L for some constant beta. We estimate C and beta by performing a number of searches and fitting a straight line to the observed costs. Let S of x comma y be the cost when x is the true offset and y is where we start the search. The formula for S depends on whether we undershoot or overshoot. If we undershoot, it is C plus beta times x minus y. If we overshoot, we pay twice. The first cost is C plus beta times L minus y. That takes us to the end of the string, then we start again from the beginning and pay C plus beta times x. Given that y is some fixed starting position, and x is a random variable, we can compute the expected cost of S using the integral ∫. Lastly, if we minimize S with respect to y, and we invoke some high school calculus, we're actually quite lucky to end up with a nice closed form solution for alpha.


Summarizing this example, the idea was just look for context. We found it in the fact that the location of the needle in the haystack could be learned. Note that we use learning in a very weak sense over here. We don't need to be that accurate, we just need to be better than starting from zero. Learning can be done either online or offline. In the two examples we looked at, we have one of each. The cost of the implementation involve periodic offline calculations of C and beta, and for every packet length, mu and sigma. The benefits were unexpectedly good, we got an average speed-up of 5x.

Key Takeaway

What's the key takeaway? It is that general purpose algorithms are great. They work always. You don't have to think too much about plugging them, and they will work just fine. What I do want to make us recognize is that generality comes at a cost. Specializing algorithms means taking advantage of something that's particular to your product, to the local context in which this algorithm is going to function. The advantage is that we can get significant speed-up without undue cost, at least from our experience. The cost for us is the cost of implementation, of maintenance. The downside is that it does require us to look for the context, which brings the question up, is the benefit of specialization worth the cost? Unfortunately, the answer, like in many of these questions is it all depends. However, one thing that we can state unequivocally is that it is worth some time investigating and finding out whether there is local structure that can be exploited.


See more presentations with transcripts


Recorded at:

Jun 16, 2021