BT

Netflix Burned by Express.js

| by João Paulo Marques Follow 0 Followers on Dec 04, 2014. Estimated reading time: 3 minutes |

Yunong Xiao, a Software Engineer at Netflix, recently wrote in the company's official tech blog about the latency problems that his team faced while working to move Netflix website UI to Node.js. In the post, he described the complex engineering process used to find out the root cause and how it led to the decision of substituting the underlying API framework.

The early signs of problems appeared when the team observed that requests latency for some endpoints in the API were increasing over time (10ms/hour) and that in periods of high latency the application was using more CPU than expected. The initial hypotheses was that some sort of problem (e.g. memory leak) in the team's request handlers was causing the rising latencies. To assert this, the team setup a controlled environment where they measured both the latency of their request handlers and the total latency of a request. Additionally, they increased Node.js' heap size to 32 GB. What they observed was that the latency and the heap size stayed constant across the experiment but the overall request latencies and CPU usage continued to rise.

The next approach was to profile the application’s CPU usage using CPU flame graphs and Linux Perf Events. In a close inspection of the flame graph (shown below), the developers saw lots of references to Express.js’s router.handle and router.handle.next functions.

After digging into Express.js' codebase they found out that:

  • Route handlers for all endpoints are stored in one global array.
  • Express.js recursively iterates through and invokes all handlers until it finds the right route handler

As Yunong states:

A global array is not the ideal data structure for this use case. It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers. Each request requires an expensive O(n) look up in the route array in order to find its route handler. Compounding matters, the array is traversed recursively. This explains why we saw such tall stacks in the flame graphs. Interestingly, Express.js even allows you to set many identical route handlers for a route: [a, b, c, c, c, c, d, e, f, g, h].

Requests for route c would terminate at the first occurrence of the c handler (position 2 in the array). However, requests for d would only terminate at position 6 in the array, having needless spent time spinning through a, b and multiple instances of c.

For understanding a little more about how Express.js was storing routes, the team created the following Express.js sample:


 var express = require('express');
 var app = express();
 app.get('/foo', function (req, res) {
    res.send('hi');
 }); 
 // add a second foo route handler
 app.get('/foo', function (req, res) {
    res.send('hi2');
 });
 console.log('stack', app._router.stack);
 app.listen(3000);

Which led to the following results:

 stack [ { keys: [], regexp: /^\/?(?=/|$)/i, handle: [Function: query] },
  { keys: [],
    regexp: /^\/?(?=/|$)/i,
    handle: [Function: expressInit] },
  { keys: [],
    regexp: /^\/foo\/?$/i,
    handle: [Function],
    route: { path: '/foo', stack: [Object], methods: [Object] } },
  { keys: [],
    regexp: /^\/foo\/?$/i, 
    handle: [Function],
    route: { path: '/foo', stack: [Object], methods: [Object] } } ]

This experiment allowed to conclude that Express.js was not handling correctly route duplication as Yunong pointed out: "Notice there are two identical route handlers for /foo. It would have been nice for Express.js to throw an error whenever there’s more than one route handler chain for a route.". 

After dig into their source code the team found out the problem. It resided in a periodic function that was being executed 10 times per hour and whose main purpose was to refresh route handlers from an external source. When the team fixed the code so that the function would stop adding duplicate route handlers, the latency and CPU usage increases went away.

In the aftermath of this incident, Yunong summarized the conclusions draw by the team:

First, we need to fully understand our dependencies before putting them into production. We made incorrect assumptions about the Express.js API without digging further into its code base. As a result, our misuse of the Express.js API was the ultimate root cause of our performance issue.

Second, given a performance problem, observability is of the utmost importance. Flame graphs gave us tremendous insight into where our app was spending most of its time on CPU. I can’t imagine how we would have solved this problem without being able to sample Node.js stacks and visualize them with flame graphs.

In our bid to improve observability even further, we are migrating to Restify, which will give us much better insights, visibility, and control of our applications.

 

 

 

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Problem with Express by Tony Brown

Is this a design flaw in ExpressJS or misuse of the API?

InfoQ Headlines Up in Flames by Evan Cowden

The title of this piece makes it sound like Netflix has been somehow seriously hurt by Express.js. The source blog post, however, is about using flame graphs to diagnose a bug, not serious, insurmountable flaws in Express.js.

Sensational titles like this make difficult discussions about new technology even more difficult. Now we have to stop and deal with the guy saying, "yeah, but did you hear about how Netflix was totally burned by Express.js?"

I appreciate a good pun as much as anyone, but please, please use headlines responsibly.

Re: InfoQ Headlines Up in Flames by Tony Brown

Totally agreed

Burned by Express? by Eric Greene

Or burned by their misuse of Express?

"our misuse of the Express.js API was the ultimate root cause of our performance issue"

You cannot use map's when routing with regexp .. by Rüdiger Möller

look they use regexp to determine route, so one cannot use map's. Real sophisticated optimized software would create a finite state acceptor from the regexp given, but hey .. its js ;)

Eran Hammer had the best response to this post by David Fekke

I encourage everyone to read Eran Hammer's response to the Netflix flames post.

gist.github.com/hueniverse/a3109f716bf25718ba0e

Re: InfoQ Headlines Up in Flames by Charles Humble

Thanks for your feedback on this. We feel the headline is OK; it specifically references using flame graphs to diagnose a bug in Express.js, and it’s both milder and much more specific than the one Netflix themselves used ("Node.js in Flames").

Thanks and regards,


Charles Humble
Head of Editorial, InfoQ.com

Re: InfoQ Headlines Up in Flames by John Collins

But the bug wasn't in Express.js. The bug was in Netflix's code due to the "misuse of the Express.js API [and] was the ultimate root cause of our performance issue."

What do you think about LoopBack framework by Cuong Ngo Vi

@Yunong Xiao , @João Paulo Marques
What do you think about switching next framework ?
Will do you take a look LoopBack framework before switch to Restify?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

9 Discuss
BT