Load Balancing Search Traffic at Algolia with NGINX and OpenResty

Algolia, a SaaS web search product, switched their infrastructure's load balancing model from DNS round-robin in response to uneven distribution of load which led to high latency. They introduced NGINX with OpenResty as a software load balancer, and Redis with a custom Go program to manage the list of backend servers. This solution has given Algolia a new abstraction layer on top of their infrastructure. InfoQ got in touch with Paul Berthaux, site reliability engineer at Algolia, to find out more about this exercise.

An Algolia "app" has a three-server cluster and Distributed Search Network (DSN) servers that serve search queries. DSNs are similar in function to Content Delivery Network Points-of-Presence (POPs), in that they serve data from a location closest to the user. Each app has a DNS record. Algolia's DNS configuration uses multiple top-level domains (TLDs) and two DNS providers for resiliency. Also, each app's DNS record is configured to return IP addresses of the three cluster servers in a round robin fashion. This was an attempt to distribute the load across all servers in a cluster. The common use case for the search cluster is through a frontend or mobile application. However, some customers have backend applications that hit the search APIs. The latter case creates an uneven load, as all requests will arrive at the same server until a particular server’s DNS time-to-live (TTL) expires.

One of Algolia’s apps suffered slow search queries during a Black Friday when there was heavy search load. This led to unequal distribution of queries. The team zeroed down on NGINX as a software load balancer to sit between the client applications and the app servers. While this did solve the general problem of distributing the load, there still existed the issues of making this setup generic and automating operations. The team chose OpenResty which provides Lua scripting support to the request-response lifecycle in NGINX. With this model, NGINX "learns" which backend server to send the request to based on the customer. This information is cached in Redis. A custom Go daemon called lb-helper fetches the list of servers from an internal API.

Responding to a question on whether it's possible to invalidate the Redis cache, Berthaux explained that they do so "using an API endpoint in the lb-helper exposed internally for maintenance purposes." This might be required if the team had to remove a large number of backend servers and did not want the LB clients seeing any difference in response time.

Image used with permission. Original source - https://blog.algolia.com/one-year-load-balancing/

With this change, the load balancer can become a single point of failure. Berthaux explains why this is not something to worry about yet:

We run multiple LBs for resiliency - the LB selection is made through round robin DNS. For now this is fine, as the LBs are performing very simple tasks in comparison to our search API servers, so we do not need an even load balancing across them. That said, we have some very long term plans to move from round-robin DNS to something based on Anycast routing.

The lb-helper also takes care of removing unhealthy servers from the list. In Berthaux’s words:

The detection of upstream failures as well as retries toward different upstreams is embedded inside NGINX/OpenResty. I use the log_by_lua directive from OpenResty with some custom Lua code to count the failures and trigger the removal of the failing upstream from the active Redis entry and alert the lb-helper after 10 failures in a row. I set up this failure threshold to avoid lots of unnecessary events in case of short self resolving incidents like punctual packet loss. From there the lb-helper will probe the failing upstream FQDN and put It back in Redis once it'll recover.

Algolia's search load evened out after this change. They are currently working on further improving the load balancing algorithm.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter