Improve Your Node.js App Throughput One Micro-optimization at a Time
- Try to minimize the amount of syscalls by grouping / batching writes.
- Consider the overhead of issuing and clearing the different timers in your application.
- CPU profilers give you useful information but won't tell you the whole story.
- Control your dependency tree and benchmark your dependencies.
In order to improve the performance of an application that involves IO, you should understand how your CPU cycles are spent and, more importantly, what is preventing higher degrees of parallelism in your application.
While focusing on improving the overall performance of the DataStax Node.js driver for Apache Cassandra, I've gained some insights that I share in this article, trying to summarize the most significant areas that could cause throughput degradation in your application.
- A runtime profiler that tracks how much time is spent running which parts of code and identifies code that could be worth to optimize.
- An optimizing compiler that attempts to optimize the previously identified code.
When the assumptions made by the optimizer compiler were too optimistic, it supports deoptimization (deopt).
You can use this ticket from Google Chrome DevTools team as a guide for patterns which will cause the code to not be optimized by V8, with possible workarounds. Some examples include:
- Functions with try-catch statements.
- Reassigning an argument value while using the `arguments` field.
Even though the optimizing compiler will make your code run significantly faster, in an IO-intensive application most of the performance improvements revolve around how to reorder the instructions and use less expensive calls to allow more operations per second, as we will see in the following sections.
To discover optimizations that could impact the largest number of users, it’s important to define benchmarks using workloads for the common execution paths, simulating real-world usage.
Start by measuring throughput and latencies of your API entry points. You can also benchmark individual internal methods performance, if you need more detailed information. Use
process.hrtime() to get the high resolution real time and to get the duration of the execution.
You should try to create some limited but realistic benchmarks as early as possible in the project lifetime. Start small by measuring throughput of a method call and later you can add more comprehensive information like the latency distribution.
There are several CPU profilers, Node.js provides one out-of-the-box that is good enough for most cases. The built-in Node.js profiler takes advantage of the profiler inside V8, sampling the stack at regular intervals during the execution. You can generate V8 tick file using the --prof flag to run node.
$ node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt
Opening the processed text file in an editor will give you information divided into sections.
Look up for the "Summary" section in the file that will look something like this:
ticks total nonlib name
23548 48.3% 53.5% C++
805 1.7% 1.8% GC
4774 9.8% Shared libraries
356 0.7% Unaccounted
There is an additional section in the processed file of the profiling session, [Bottom up (heavy) profile], that is especially useful. It provides information about the primary callers of each function, in a tree-like structure. Take the following snippet for example:
223 32% LazyCompile: *function1 lib/file1.js:223:20
221 99% LazyCompile: ~function2 lib/file2.js:70:57
221 100% LazyCompile: *function3 /lib/file3.js:58:74
The percentage shows the share of a particular caller in the total amount of its parent calls. An asterisk before a function name means that time is being spent in an optimized function, while tilde means not optimized function.
In the example, 99% of the function1 calls has been made by function2, for which function3 is responsible for 100% of the calls to function2, according to the profiling sample.
CPU profiling sessions and flame graphs are useful tools to understand what is in the stack most of the time and which methods are spending CPU time, in order to spot low-hanging fruit. But it's important to understand that it won't tell you the whole story: you could be preventing higher degrees of parallelism in your application and the asynchronous IO operations could make it hard to identify.
Libuv exposes a platform-independent API that is used by Node.js to perform non-blocking IO and your application IO (sockets, file system, ...) ultimately translates into system calls.
There is a significant cost in scheduling those system calls. You should try to minimize the amount of syscalls by can grouping / batching writes.
When using a socket or a file stream, instead of issuing a write every time, you can buffer and flush the data from time to time.
You can use a write queue to process and group your writes. The logic for a write queue implementation should be something like:
- While there are items to write and we are within the window size
- Push the buffer to the “to-write list”
- Concatenate all the buffers in the list and write it to the wire.
You can define a window size based either on the total buffer length or the amount of time that passed since the first item was queued. Defining a window size is tradeoff between the latency of a single write and the average latency of all writes. You should also consider the maximum amount of write requests to be grouped and the overhead of generating each write request.
You would generally want to flush writes of buffers in the order of kilobytes. We found a sweet spot around 8 kilobytes, but your mileage may vary. You can check out the implementation in the client driver for a complete implementation of a write queue.
Grouping or batching writes will translate into higher throughput thanks to less system calls.
As such, it’s likely that there may be a large amount of timeouts scheduled at any given time in an application.
Similar to other hashed wheel timers, Node.js uses a hash table and a linked list to maintain the timers instances. But unlike other wheel timers, instead of having a fixed-length hash table, it keys each list of timers by duration.
When the key exists (a timer with the same duration exist), it is appended to the bucket as a O(1) operation.
When a key does not exist, a bucket is created and the timer is appended to it.
With that in mind, you have to make sure you reuse the existing buckets, trying to avoid removing a whole bucket and creating a new one. For example, if you are using sliding delays you should create the new timeout (setTimeout()) before removing (clearTimeout()) the previous one.
In our case, by scheduling the idle timeout (heartbeat) before removing the previous one, we make sure the scheduling and descheduling of idle timeouts are O(1) operations.
There are several high level Ecmascript features that you should avoid if you are concerned about performance. Examples of such features are: Function.prototype.bind(), Object.defineProperty() and Object.defineProperties().
The V8 team is working to improve the performance of new language features to eventually reach parity with their naive counterparts, their work on optimizing features introduced with ES2015 and beyond is coordinated via a performance plan, where the V8 team gathers areas that need improvement, along with proposed design documents to tackle those issues.
You can follow progress of V8 implementation on the blog, but consider that it can take a long while for those improvements to get into a Long-term Support (LTS) version of Node.js: Which V8 version lands in a major version of Node.js is normally decided before cutting the branch from master, according to the LTS plan. You will have to wait 6-12 months for a new major/minor version of V8 engine to be included in the Node.js runtime.
A new release within a major version of Node.js will only update V8 engine for patches.
Node.js runtime provides a complete library for IO but, as the ECMAScript specification provides very few built-in types, you sometimes have to rely on external packages to perform other basic tasks.
There is no guarantee that published packages work correctly and in an efficient way, even for popular modules. The Node.js ecosystem is huge and often those third party modules just involve few methods that can be implemented by yourself.
You should balance between reinventing the wheel and controlling the performance impact of your dependencies.
Avoid adding a new dependency whenever possible. Don’t trust your dependencies, period. An exception to this rule is where the project you depend on publishes reliable benchmarks, like bluebird library does.
In our case, async was impacting the request latency. We used async.series(), async.waterfall() and async.whilst() extensively across our codebase. It was really hard to identify it as a culprit of poor performance, as a control flow library is a cross-cutting concern. The async utility is one of the most popular modules and thanks to that, the issues were publicly identified. There are even drop-in replacements like neo-async, that runs significantly faster and publishes benchmarks.
In the case of our client driver where we've applied these optimizations, the result was more than 2X increase in throughput according to our benchmarks.
Considering that on Node.js our code runs on a single thread, how our application spends CPU cycles and the order of the instructions really matter, allowing us to support higher degrees of parallelism and improving overall throughput.
About the Author
Jorge Bay is the lead engineer for the Node.js and C# client drivers for Apache Cassandra and DSE at DataStax. When he is not writing his own bio, he enjoys solving problems and building server-side solutions. Jorge has over 15 years of professional software development experience and he implemented the community Node.js driver for Apache Cassandra that was the foundation for the official DataStax driver.