Netflix Burned by Express.js

Yunong Xiao, a Software Engineer at Netflix, recently wrote in the company's official tech blog about the latency problems that his team faced while working to move Netflix website UI to Node.js. In the post, he described the complex engineering process used to find out the root cause and how it led to the decision of substituting the underlying API framework.

The early signs of problems appeared when the team observed that requests latency for some endpoints in the API were increasing over time (10ms/hour) and that in periods of high latency the application was using more CPU than expected. The initial hypotheses was that some sort of problem (e.g. memory leak) in the team's request handlers was causing the rising latencies. To assert this, the team setup a controlled environment where they measured both the latency of their request handlers and the total latency of a request. Additionally, they increased Node.js' heap size to 32 GB. What they observed was that the latency and the heap size stayed constant across the experiment but the overall request latencies and CPU usage continued to rise.

The next approach was to profile the application’s CPU usage using CPU flame graphs and Linux Perf Events. In a close inspection of the flame graph (shown below), the developers saw lots of references to Express.js’s router.handle and router.handle.next functions.

After digging into Express.js' codebase they found out that:

Route handlers for all endpoints are stored in one global array.
Express.js recursively iterates through and invokes all handlers until it finds the right route handler

As Yunong states:

A global array is not the ideal data structure for this use case. It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers. Each request requires an expensive O(n) look up in the route array in order to find its route handler. Compounding matters, the array is traversed recursively. This explains why we saw such tall stacks in the flame graphs. Interestingly, Express.js even allows you to set many identical route handlers for a route: [a, b, c, c, c, c, d, e, f, g, h].

Requests for route c would terminate at the first occurrence of the c handler (position 2 in the array). However, requests for d would only terminate at position 6 in the array, having needless spent time spinning through a, b and multiple instances of c.

For understanding a little more about how Express.js was storing routes, the team created the following Express.js sample:


 var express = require('express');
 var app = express();
 app.get('/foo', function (req, res) {
    res.send('hi');
 }); 
 // add a second foo route handler
 app.get('/foo', function (req, res) {
    res.send('hi2');
 });
 console.log('stack', app._router.stack);
 app.listen(3000);

Which led to the following results:

 stack [ { keys: [], regexp: /^\/?(?=/|$)/i, handle: [Function: query] },
  { keys: [],
    regexp: /^\/?(?=/|$)/i,
    handle: [Function: expressInit] },
  { keys: [],
    regexp: /^\/foo\/?$/i,
    handle: [Function],
    route: { path: '/foo', stack: [Object], methods: [Object] } },
  { keys: [],
    regexp: /^\/foo\/?$/i, 
    handle: [Function],
    route: { path: '/foo', stack: [Object], methods: [Object] } } ]

This experiment allowed to conclude that Express.js was not handling correctly route duplication as Yunong pointed out: "Notice there are two identical route handlers for /foo. It would have been nice for Express.js to throw an error whenever there’s more than one route handler chain for a route.".

After dig into their source code the team found out the problem. It resided in a periodic function that was being executed 10 times per hour and whose main purpose was to refresh route handlers from an external source. When the team fixed the code so that the function would stop adding duplicate route handlers, the latency and CPU usage increases went away.

In the aftermath of this incident, Yunong summarized the conclusions draw by the team:

First, we need to fully understand our dependencies before putting them into production. We made incorrect assumptions about the Express.js API without digging further into its code base. As a result, our misuse of the Express.js API was the ultimate root cause of our performance issue.

Second, given a performance problem, observability is of the utmost importance. Flame graphs gave us tremendous insight into where our app was spending most of its time on CPU. I can’t imagine how we would have solved this problem without being able to sample Node.js stacks and visualize them with flame graphs.

In our bid to improve observability even further, we are migrating to Restify, which will give us much better insights, visibility, and control of our applications.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter