How big should our thread pool be?
Some time ago a friend pinged me on Skype to ask me some question about a cluster of JVMs running on a 64-way box that was launching 30 odd thousand threads a few times a day. With 300,000 plus threads running, the kernel was spending so much time managing them that the application was completely destabilization. It was quite obvious that this application needed a thread pool so that they could throttle their clients instead of having their clients throttle them.
While the example is an extreme one, it highlights why we use thread pools. But even with a thread pool in place we can still inflict pain on our users by losing data or having transactions fail. Or we might find our application completely destabilize if that pool is either too big or too small. A properly sized thread pool should allow as many requests to run as the hardware and application can comfortably support. In other words, we don’t want requests waiting in a queue if we have the capacity to process them but we also don’t want to launch more requests than we have the capacity to manage. So, just how big should our thread pool be?
If we are to follow the “measure don’t guess” mantra we need to first look at the technology in question and ask what measurements make sense and how can we instrument our system to obtain them. We also need to bring some mathematics to the table. When we consider that a thread pool is just one or more service providers combined with a queue, then we see that this is a system that can be modeled using Little’s law. Let’s dig deeper to see how.
Little’s Law
Little’s law says that the number of requests in a system equals the rate at which they arrive, multiplied by the average amount of time it takes to service an individual request. This law is so common in our every day lives that it’s surprising that it wasn’t proposed until the 1950s and only proven in the 1960s. Here’s an example of one form of Little’s law in action. Have you ever stood in a line and tried to figure out how long you’d be standing in it? You would consider how many people are in line and then have a quick glance at how long it took to service the person at the front of the queue. You would then multiply those two values together, producing an estimate of how long you’d be in line. If instead of looking at the length of the line, you observed how often new people were joining the line and then multiplied that value by the service time, you’d then know the average number of people that are either in the line or being serviced.
There are numerous other similar games you can play with Little’s law that will answer other questions like “how much time on average does a person stand in the queue waiting to be served?” and so on.
Figure 1. Little’s Law
In a similar vein we can use Little’s law to determine thread pool size. All we have to do is measure the rate at which requests arrive and the average amount of time to service them. We can then plug those measurements into Little’s law to calculate the average number of requests in the system. If that number is less than the size of our thread pool then the size of the pool can be reduced accordingly. On the other hand if the number is larger than the size of our thread pool, things get a little more complicated.
In the case where we have more requests waiting than being executed the first thing we need to determine is if there’s enough capacity in the system to support a larger thread pool. To do that we must determine what resource is most likely to limit the application’s ability to scale. In this article we’ll assume that it’s CPU, though be aware that in real life it’s very likely to be something else. The easiest situation is that you have enough capacity to increase the size of the thread pool. If you don’t you’ll have to consider other options, be it tuning your application, adding more hardware, or some combination of both.
A working example
Lets unpack the previous text by considering a typical workflow where a request is taken off of a socket and executed. In the process of execution it needs to get data or other resources, until a result is returned to the client. In our demo a server simulates this workload by executing a uniform unit of work that vacillates between performing some CPU-intensive task and retrieving data from a database or other external data source. For our demo let’s simulate the CPU intensive activity by calculating a number of decimal places for an irrational number such as Pi or the square root of 2. Thread.sleep is used to simulate the call to the external data source. A thread pool will be used to throttle the active number of requests in the server.
To monitor any of the thread pooling choices in java.util.concurrent (j.u.c) we’re going to have to add our own instrumentation. In real life we might add this by instrumenting the j.u.c.ThreadPoolExecutor using aspects, ASM or another byte code instrumentation technique. For the purposes of our example we’ll avoid those topics by hand-rolling an extension of j.u.c.ThreadPoolExecutor to include the necessary monitoring.
public class InstrumentedThreadPoolExecutor extends ThreadPoolExecutor {
// Keep track of all of the request times
private final ConcurrentHashMap<Runnable, Long> timeOfRequest =
new ConcurrentHashMap<>();
private final ThreadLocal<Long> startTime = new ThreadLocal<Long>();
private long lastArrivalTime;
// other variables are AtomicLongs and AtomicIntegers@Override
protected void beforeExecute(Thread worker, Runnable task) {
super.beforeExecute(worker, task);
startTime.set(System.nanoTime());
}@Override
protected void afterExecute(Runnable task, Throwable t) {
try {
totalServiceTime.addAndGet(System.nanoTime() - startTime.get());
totalPoolTime.addAndGet(startTime.get() - timeOfRequest.remove(task));
numberOfRequestsRetired.incrementAndGet();
} finally {
super.afterExecute(task, t);
}
}@Override
public void execute(Runnable task) {
long now = System.nanoTime();numberOfRequests.incrementAndGet();
synchronized (this) {
if (lastArrivalTime != 0L) {
aggregateInterRequestArrivalTime.addAndGet(now - lastArrivalTime);
}
lastArrivalTime = now;
timeOfRequest.put(task, now);
}
super.execute(task);
}
}
Listing 1. The essentials bit of InstrumentedThreadPoolExecutor
This listing contains the non-trivial portions of the instrumentation code, which our server will use in place of ThreadPoolExecutor. We’ll collect the data by overriding the three key Executor methods: beforeExecute, execute and afterExecute, and then expose that data with an MXBean. Let’s look at how each of those methods work.
The executor’s execute method is used to pass a request into the Executor. Overriding that method allows us to collect all the initial timings. We can also use this opportunity to track the interval between requests. The calculation involves several steps with each thread resetting the lastArrivalTime. Since we are sharing state, that activity needs to be synchronized. Finally we delegate execution to the super Executor class.
As the name suggests, the method executeBefore gets executed just before the request is executed. Up to this point all of the time that a request accumulates is just the time waiting in the pool. This is often referred to as “dead time”. The time after this method ends and before afterExecute gets called will be considered “service time” which we can use for Little’s law. We can store the start time on a thread local. The afterExecute method will complete the calculations for “time in the pool”, “time servicing the request", and registering that the “request has been retired”.
We will also need an MXBean to report on the performance counters collected by the InstrumentedThreadPoolExecutor. That will be the job of MXBean ExecutorServiceMonitor. (See listing 2).
public class ExecutorServiceMonitor
implements ExecutorServiceMonitorMXBean {public double getRequestPerSecondRetirementRate() {
return (double) getNumberOfRequestsRetired() /
fromNanoToSeconds(threadPool.getAggregateInterRequestArrivalTime());
}public double getAverageServiceTime() {
return fromNanoToSeconds(threadPool.getTotalServiceTime()) /
(double)getNumberOfRequestsRetired();
}public double getAverageTimeWaitingInPool() {
return fromNanoToSeconds(this.threadPool.getTotalPoolTime()) /
(double) this.getNumberOfRequestsRetired();
}public double getAverageResponseTime() {
return this.getAverageServiceTime() +
this.getAverageTimeWaitingInPool();
}public double getEstimatedAverageNumberOfActiveRequests() {
return getRequestPerSecondRetirementRate() * (getAverageServiceTime() +
getAverageTimeWaitingInPool());
}public double getRatioOfDeadTimeToResponseTime() {
double poolTime = (double) this.threadPool.getTotalPoolTime();
return poolTime /
(poolTime + (double)threadPool.getTotalServiceTime());
}public double v() {
return getEstimatedAverageNumberOfActiveRequests() /
(double) Runtime.getRuntime().availableProcessors();
}
}
Listing 2. ExecutorServiceMonitor, the interesting bits
In listing 2 we can see the non-trivial methods that will provide us with an answer to the question, how is the queue being used? Note that the method getEstimatedAverageNumberOfActiveRequests() is our implementation of Little’s law. In this method the retirement rate, or the observed rate at which requests are leaving the system is multiplied by the average time it takes to service a request yielding the average number of requests in the system. The other methods of interest are getRatioOfDeadTimeToResponseTime() and getRatioOfDeadTimeToResponseTime(). Lets run some experiments and see how these numbers relate to each other and to CPU utilization.
Experiment 1
The first experiment is intentionally trivial to provide a sense of how it’s working. Settings for the trivial experiment are: set “server thread pool size” to 1 and have a single client repeatedly makes the request described above for 30 seconds.
(Click on the image to enlarge it)
Figure 2 Results from 1 client, 1 server thread
The graphics in Figure 2 were obtained by taking a screenshot of the VisualVM MBean view and a graphical CPU utilization monitor. VisualVM (http://visualvm.java.net/) is an open source project built to house performance monitoring and profiling tools. Use of this tool is outside of the scope of this article. Briefly, the MBean view is an optional plugin that gives you access to all JMX MBeans that are registered with the PlatformMBeansServer.
Returning to the experiment, the thread pool size was one. Given the looping behavior of the client we might expect the average number of active requests to be one. However, it does take the client some time to reissue the request, so the value should be slightly less than 1, and 0.98 falls inline with that expectation.
The next value, RatioOfDeadTimeToResponseTime sits at just over 0.1%. Given that there is only ever one active request and it matches the thread pool size, one would expect this value to be 0. However it is likely that the non-zero result is due to errors introduced by timer placement and timer precision. The value is fairly small so we can safely ignore it. CPU utilization tells us that less than 1 core is being used meaning we’ve got loads of headroom to handle more requests.
Experiment 2
In the second experiment we leave the thread pool size at one while the number of concurrent client requests is bumped up to 10. As expected, CPU utilization didn’t change since there is still only one thread working at a time. However, we were able to retire some more requests, most likely since the active number of requests was 10, allowing the client to not have to wait for the client to reissue. From that we can conclude that one request was always active and our queue length was nine. What is more telling is that the ratio of dead time to response time is sitting at close to 90%. What this number is saying is that 90% of the total response time seen by the client can be attributed to time spent waiting for a thread. If we want to reduce latency, the elephant in the room is dead time and since we have loads of spare CPU it’s safe to increase the size of the pool to allow requests to get to it.
(Click on the image to enlarge it)
Figure 3, Results from 10 client requests, 1 server thread
Experiment 3
Since our core count is four lets run the same experiment but with the thread pool size set to four.
(Click on the image to enlarge it)
Figure 4. Results from 10 client requests, 4 server threads
Once again we can see that the average number of requests is 10. The difference this time is that number of request retired has jumped to just over 2000. There is a corresponding increase in CPU saturation but we’re still not at 100%. Dead time is still a healthy 60% of total response time, which means there is still some room improvement. Can we soak up the extra CPU by upping thread pool size again?
Experiment 4
In this final run the pool has been configured to eight threads. Looking at the results we can see that just as expected, the average number of requests is just under 10, dead time now clocks in at just under 19%, and the number of requests retired has again had a healthy jump, to 3621. However with CPU utilization close to 100% it looks as we’re close to the end of the improvements we can achieve under these load conditions. It also suggests that eight is an ideal pool size. Another conclusion we can make is that if we need to eliminate more dead time the only way to do it is to add more CPU or improve the CPU efficiency of our application.
(Click on the image to enlarge it)
Figure 5. Results from 10 client requests, 8 server threads
Theory meets reality
One of the varied criticisms that could be made about this experiment is that it’s over simplified; large applications rarely have only one thread or one connection pool. In fact, many applications will have one more pool for each communication technology used. For example, your application may have HTTP requests handled by a servlet, along with requests from JMS, and possibly a JDBC connection pool. In this case you will have a thread pool for the servlet engine. You will also have a connection pool for the JMS and JDBC components. One thing to do in such a case would be to treat them all as a single thread pool. This implies that we’d aggregate the arrival rates along with the service times. By studying the system using Little’s law as an aggregate we can determine the total number of threads. The next step is to partition that number between the various thread groups. The partitioning logic could use one of a number of techniques, which would certainly be influenced by your performance requirements. One technique would be to balance thread pool size based on individual demand for hardware. However it might be important to give priority to web clients over requests coming in via JMS and therefore you might balance the pools in that direction. Of course as you rebalance you’ll have to adjust the size of each pool to take into account the difference in hardware demanded.
Another consideration is that this formulation focuses on the average number of requests in the system. For many reasons you may want to know the 90th percentile queue length. Using this value will give you more headroom to deal with natural fluctuations in arrival rates. Getting to this number is a lot more complex and it’s probably just as effective to add a 20% buffer on top to the average. When you do this just make sure that you have enough capacity to handle that many threads. For example, since in our examples eight threads maxed out the CPU, it is likely that adding more threads would start to degrade performance. Finally, the kernel is a lot more efficient at managing threads than a thread pool is. So while we guessed that adding more than eight threads may not be beneficial, a measurement might tell us that we do indeed have more headroom in our system that could be potentially soaked up. In other words, try running beyond capacity (i.e. stress test) to see how it affects end user response times and throughputs.
Conclusion
Getting a proper configuration for a thread pool may not be easy but it’s also not rocket science. The mathematics behind the problem are well understood and are fairly intuitive in that we meet them all the time in our every day lives. What is lacking are the measurements (as witness by j.u.c.ExecutorService) needed to make a reasonable choice. Getting to a proper setting will be a bit fiddly as this is more like bucket chemistry than a precise science but spending a little bit of time fiddling can save you the headache of having to deal with a system that has been destabilized by higher than expected workloads.
About the Author
Kirk Pepperdine works as an independent consultant specializing in the field of Java performance tuning. In addition he has authored a Performance tuning seminar that has been well received world wide. That seminar presents a performance tuning methodology that has been used to improve teams effectiveness in troubleshooting Java performance issues. Named a Java Champion in 2006, Kirk has written about performance for many publications and has spoken about performance at many conferences including Devoxx and JavaONE. He helped found the Java Performance Tuning, a site well known as a resource for performance tuning information.