This item in japanese

Lire ce contenu en franÃ§ais

## How big should our thread pool be?

Some time ago a friend pinged me on Skype to ask me some question about a cluster of JVMs running on a 64-way box that was launching 30 odd thousand threads a few times a day. With 300,000 plus threads running, the kernel was spending so much time managing them that the application was completely destabilization. It was quite obvious that this application needed a thread pool so that they could throttle their clients instead of having their clients throttle them.

While the example is an extreme one, it highlights why we use thread pools. But even with a thread pool in place we can still inflict pain on our users by losing data or having transactions fail. Or we might find our application completely destabilize if that pool is either too big or too small. A properly sized thread pool should allow as many requests to run as the hardware and application can comfortably support. In other words, we don’t want requests waiting in a queue if we have the capacity to process them but we also don’t want to launch more requests than we have the capacity to manage. So, just how big should our thread pool be?

If we are to follow the “measure don’t guess” mantra we need to first look at the technology in question and ask what measurements make sense and how can we instrument our system to obtain them. We also need to bring some mathematics to the table. When we consider that a thread pool is just one or more service providers combined with a queue, then we see that this is a system that can be modeled using Little’s law. Let’s dig deeper to see how.

## Little’s Law

Little’s law says that the number of requests in a system equals the rate at which they arrive, multiplied by the average amount of time it takes to service an individual request. This law is so common in our every day lives that it’s surprising that it wasn’t proposed until the 1950s and only proven in the 1960s. Here’s an example of one form of Little’s law in action. Have you ever stood in a line and tried to figure out how long you’d be standing in it? You would consider how many people are in line and then have a quick glance at how long it took to service the person at the front of the queue. You would then multiply those two values together, producing an estimate of how long you’d be in line. If instead of looking at the length of the line, you observed how often new people were joining the line and then multiplied that value by the service time, you’d then know the average number of people that are either in the line or being serviced.

There are numerous other similar games you can play with Little’s law that will answer other questions like “how much time on average does a person stand in the queue waiting to be served?” and so on.

Figure 1. Little’s Law

In a similar vein we can use Little’s law to determine thread pool size. All we have to do is measure the rate at which requests arrive and the average amount of time to service them. We can then plug those measurements into Little’s law to calculate the average number of requests in the system. If that number is less than the size of our thread pool then the size of the pool can be reduced accordingly. On the other hand if the number is larger than the size of our thread pool, things get a little more complicated.

In the case where we have more requests waiting than being executed the first thing we need to determine is if there’s enough capacity in the system to support a larger thread pool. To do that we must determine what resource is most likely to limit the application’s ability to scale. In this article we’ll assume that it’s CPU, though be aware that in real life it’s very likely to be something else. The easiest situation is that you have enough capacity to increase the size of the thread pool. If you don’t you’ll have to consider other options, be it tuning your application, adding more hardware, or some combination of both.

## A working example

Lets unpack the previous text by considering a typical workflow where a request is taken off of a socket and executed. In the process of execution it needs to get data or other resources, until a result is returned to the client. In our demo a server simulates this workload by executing a uniform unit of work that vacillates between performing some CPU-intensive task and retrieving data from a database or other external data source. For our demo let’s simulate the CPU intensive activity by calculating a number of decimal places for an irrational number such as Pi or the square root of 2. Thread.sleep is used to simulate the call to the external data source. A thread pool will be used to throttle the active number of requests in the server.

To monitor any of the thread pooling choices in java.util.concurrent (j.u.c) we’re going to have to add our own instrumentation. In real life we might add this by instrumenting the j.u.c.ThreadPoolExecutor using aspects, ASM or another byte code instrumentation technique. For the purposes of our example we’ll avoid those topics by hand-rolling an extension of j.u.c.ThreadPoolExecutor to include the necessary monitoring.

public class InstrumentedThreadPoolExecutor extends ThreadPoolExecutor { // Keep track of all of the request times private final ConcurrentHashMap<Runnable, Long> timeOfRequest =         new ConcurrentHashMap<>(); private final ThreadLocal<Long> startTime = new ThreadLocal<Long>(); private long lastArrivalTime; // other variables are AtomicLongs and AtomicIntegers @Override protected void beforeExecute(Thread worker, Runnable task) {   super.beforeExecute(worker, task);   startTime.set(System.nanoTime()); } @Override protected void afterExecute(Runnable task, Throwable t) {   try {     totalServiceTime.addAndGet(System.nanoTime() - startTime.get());     totalPoolTime.addAndGet(startTime.get() - timeOfRequest.remove(task));     numberOfRequestsRetired.incrementAndGet();   } finally {     super.afterExecute(task, t);   } } @Override public void execute(Runnable task) {   long now = System.nanoTime();   numberOfRequests.incrementAndGet();   synchronized (this) {     if (lastArrivalTime != 0L) {       aggregateInterRequestArrivalTime.addAndGet(now - lastArrivalTime);     }     lastArrivalTime = now;     timeOfRequest.put(task, now);   }   super.execute(task);  } }

Listing 1. The essentials bit of InstrumentedThreadPoolExecutor

This listing contains the non-trivial portions of the instrumentation code, which our server will use in place of ThreadPoolExecutor. We’ll collect the data by overriding the three key Executor methods: beforeExecute, execute and afterExecute, and then expose that data with an MXBean. Let’s look at how each of those methods work.

The executor’s execute method is used to pass a request into the Executor. Overriding that method allows us to collect all the initial timings. We can also use this opportunity to track the interval between requests. The calculation involves several steps with each thread resetting the lastArrivalTime. Since we are sharing state, that activity needs to be synchronized. Finally we delegate execution to the super Executor class.

As the name suggests, the method executeBefore gets executed just before the request is executed. Up to this point all of the time that a request accumulates is just the time waiting in the pool. This is often referred to as “dead time”. The time after this method ends and before afterExecute gets called will be considered “service time” which we can use for Little’s law. We can store the start time on a thread local. The afterExecute method will complete the calculations for “time in the pool”, “time servicing the request", and registering that the “request has been retired”.

We will also need an MXBean to report on the performance counters collected by the InstrumentedThreadPoolExecutor. That will be the job of MXBean ExecutorServiceMonitor. (See listing 2).

public class ExecutorServiceMonitor             implements ExecutorServiceMonitorMXBean { public double getRequestPerSecondRetirementRate() {   return (double) getNumberOfRequestsRetired() /     fromNanoToSeconds(threadPool.getAggregateInterRequestArrivalTime()); } public double getAverageServiceTime() {   return fromNanoToSeconds(threadPool.getTotalServiceTime()) /    (double)getNumberOfRequestsRetired(); } public double getAverageTimeWaitingInPool() {   return fromNanoToSeconds(this.threadPool.getTotalPoolTime()) /    (double) this.getNumberOfRequestsRetired(); } public double getAverageResponseTime() {   return this.getAverageServiceTime() +      this.getAverageTimeWaitingInPool(); } public double getEstimatedAverageNumberOfActiveRequests() {   return getRequestPerSecondRetirementRate() * (getAverageServiceTime() +      getAverageTimeWaitingInPool()); } public double getRatioOfDeadTimeToResponseTime() {   double poolTime = (double) this.threadPool.getTotalPoolTime();   return poolTime /      (poolTime + (double)threadPool.getTotalServiceTime()); } public double v() {   return getEstimatedAverageNumberOfActiveRequests() /      (double) Runtime.getRuntime().availableProcessors(); }}

Listing 2. ExecutorServiceMonitor, the interesting bits

In listing 2 we can see the non-trivial methods that will provide us with an answer to the question, how is the queue being used? Note that the method getEstimatedAverageNumberOfActiveRequests() is our implementation of Little’s law. In this method the retirement rate, or the observed rate at which requests are leaving the system is multiplied by the average time it takes to service a request yielding the average number of requests in the system. The other methods of interest are getRatioOfDeadTimeToResponseTime() and getRatioOfDeadTimeToResponseTime(). Lets run some experiments and see how these numbers relate to each other and to CPU utilization.

## Experiment 1

The first experiment is intentionally trivial to provide a sense of how it’s working. Settings for the trivial experiment are: set “server thread pool size” to 1 and have a single client repeatedly makes the request described above for 30 seconds.

(Click on the image to enlarge it)

Figure 2 Results from 1 client, 1 server thread

The graphics in Figure 2 were obtained by taking a screenshot of the VisualVM MBean view and a graphical CPU utilization monitor. VisualVM (http://visualvm.java.net/) is an open source project built to house performance monitoring and profiling tools. Use of this tool is outside of the scope of this article. Briefly, the MBean view is an optional plugin that gives you access to all JMX MBeans that are registered with the PlatformMBeansServer.

Returning to the experiment, the thread pool size was one. Given the looping behavior of the client we might expect the average number of active requests to be one. However, it does take the client some time to reissue the request, so the value should be slightly less than 1, and 0.98 falls inline with that expectation.

The next value, RatioOfDeadTimeToResponseTime sits at just over 0.1%. Given that there is only ever one active request and it matches the thread pool size, one would expect this value to be 0. However it is likely that the non-zero result is due to errors introduced by timer placement and timer precision. The value is fairly small so we can safely ignore it. CPU utilization tells us that less than 1 core is being used meaning we’ve got loads of headroom to handle more requests.

## Experiment 2

In the second experiment we leave the thread pool size at one while the number of concurrent client requests is bumped up to 10. As expected, CPU utilization didn’t change since there is still only one thread working at a time. However, we were able to retire some more requests, most likely since the active number of requests was 10, allowing the client to not have to wait for the client to reissue. From that we can conclude that one request was always active and our queue length was nine. What is more telling is that the ratio of dead time to response time is sitting at close to 90%. What this number is saying is that 90% of the total response time seen by the client can be attributed to time spent waiting for a thread. If we want to reduce latency, the elephant in the room is dead time and since we have loads of spare CPU it’s safe to increase the size of the pool to allow requests to get to it.

(Click on the image to enlarge it)

Figure 3, Results from 10 client requests, 1 server thread

## Experiment 3

Since our core count is four lets run the same experiment but with the thread pool size set to four.

(Click on the image to enlarge it)

Figure 4. Results from 10 client requests, 4 server threads

Once again we can see that the average number of requests is 10. The difference this time is that number of request retired has jumped to just over 2000. There is a corresponding increase in CPU saturation but we’re still not at 100%. Dead time is still a healthy 60% of total response time, which means there is still some room improvement. Can we soak up the extra CPU by upping thread pool size again?

## Experiment 4

In this final run the pool has been configured to eight threads. Looking at the results we can see that just as expected, the average number of requests is just under 10, dead time now clocks in at just under 19%, and the number of requests retired has again had a healthy jump, to 3621. However with CPU utilization close to 100% it looks as we’re close to the end of the improvements we can achieve under these load conditions. It also suggests that eight is an ideal pool size. Another conclusion we can make is that if we need to eliminate more dead time the only way to do it is to add more CPU or improve the CPU efficiency of our application.

(Click on the image to enlarge it)

Figure 5. Results from 10 client requests, 8 server threads

## Conclusion

Getting a proper configuration for a thread pool may not be easy but it’s also not rocket science. The mathematics behind the problem are well understood and are fairly intuitive in that we meet them all the time in our every day lives. What is lacking are the measurements (as witness by j.u.c.ExecutorService) needed to make a reasonable choice. Getting to a proper setting will be a bit fiddly as this is more like bucket chemistry than a precise science but spending a little bit of time fiddling can save you the headache of having to deal with a system that has been destabilized by higher than expected workloads.

Kirk Pepperdine works as an independent consultant specializing in the field of Java performance tuning. In addition he has authored a Performance tuning seminar that has been well received world wide. That seminar presents a performance tuning methodology that has been used to improve teams effectiveness in troubleshooting Java performance issues. Named a Java Champion in 2006, Kirk has written about performance for many publications and has spoken about performance at many conferences including Devoxx and JavaONE. He helped found the Java Performance Tuning, a site well known as a resource for performance tuning information.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

by Bela Ban,

• ##### Re: Great article

by Kirk Pepperdine,

• ##### Article is too good but reference code is missing

by Harcharanjeet Butalia,

• ##### Re: Article is too good but reference code is missing

by Kirk Pepperdine,

by 徐 庆新,

• ##### Re: Little's law

by oussama zoghlami,

• ##### code reference

by yao liyao,

• ##### Great article

by Bela Ban,

Your message is awaiting moderation. Thank you for participating in the discussion.

Hi Kirk,

great article, really helpful to see thread pool sizing in action, and the effect of the size on performance !
Good input into my own (ergonomic) thread pool sizing in JGroups, which is a bit more complex in that some threads can be executed immediately, whereas others have to be executed FIFO per sender. In the latter case, even if we use Little's law to determine that there are 5 requests in the pool at any time, if those 5 are targetted to the same cluster node, then the max size of 5 would be wasted as only 1 thread could execute the 5 requests (sequentially).
Cheers, hope to see you at Red Hat Summit 2013 !
Bela

• ##### Article is too good but reference code is missing

Your message is awaiting moderation. Thank you for participating in the discussion.

Article is too good but reference code is missing . Is the code available for download.

• ##### Re: Great article

Your message is awaiting moderation. Thank you for participating in the discussion.

Hi Bela,

First, this article is intended to be handwaving introduction so it's far from complete in describing how one might address situations such as the one you've described. But I see that my handwaving description may have confused in that Little (in how it's being used here) doesn't tell you how many requests are in the queue but instead, how many requests are in the system. So, if I can decode what you're saying, it's not that your constraint is hardware base but instead is constrained by a specific requirement that translates to one of, these 5 things are actually 1 request and should be treated as such or, my thread pools should be of size 1. In either case Little still applies. It's the law! ;-)

• ##### Re: Article is too good but reference code is missing

Your message is awaiting moderation. Thank you for participating in the discussion.

Indeed only the interesting bits of the code have been pub'ed. I will put the extended Executor on GitHub after I get the proper supporting artifacts around it.

• ##### Re: Article is too good but reference code is missing

by 徐 庆新,

Your message is awaiting moderation. Thank you for participating in the discussion.

Hi Kirk,

The article's great. As for the source code, have you had a chance to put the code on Github? Or where can I get it?

• ##### Little's law

Your message is awaiting moderation. Thank you for participating in the discussion.

Hi Kirk,

When you mentioned that

"Little’s law says that the number of requests in a system equals the rate at which they arrive, multiplied by the average amount of time it takes to service an individual request"

you mean that it is the total time spent in the system. I think.

Residence time(R) = W(Wait time) + S(Service time)

Para. 2 in Page. 387(seems to be an extract) in homes.cs.bilkent.edu.tr/~tugrul/CS518/Papers/li... seems to mean that Little does not consider service time.

Mohan

• ##### Re: Little's law

Your message is awaiting moderation. Thank you for participating in the discussion.

Hi Kirk,

Great article. But i have some questions :

1/ In case of a JBoss EJB Pool Tuning, can i use a simple EJB Interceptor instead of a custom thread pool (the problem that interceptor does not offer the before/afterExecute methods) ?

2/ If i have multiple pools (in my case i have an HTTP, EJB and a Jdbc pool), is it a good idea to use the same pool size for all ?

• ##### What is this MXBean that I cannot see by JMXing to my tomcat?

Your message is awaiting moderation. Thank you for participating in the discussion.

What is this MX bean that I cannot see in my tomcat JMX connection?

• ##### code reference

by yao liyao,

Your message is awaiting moderation. Thank you for participating in the discussion.

HI:
the article is so useful, but i need reference code to run.can you provide a github url to me?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.