Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Dynein – an Asynchronous Background Job Service from Airbnb

Dynein – an Asynchronous Background Job Service from Airbnb

This item in japanese

The Airbnb engineering team ensure that time consuming and resource intensive tasks are run as asynchronous background jobs in order to improve scalability of their web applications. This can also prevent performance issues, since any potential processing problems with the background jobs are unlikely to disturb the web servers working with user requests. The job scheduling system has become a very important component within this system, and they have therefore built Dynein, a distributed delayed job queueing service, including a highly scalable scheduler. In a blog post, Andy Fang, working with cloud infrastructure at Airbnb, describes the background and challenges designing and building Dynein.

Airbnb has been running a centralized cluster of Resque workers on top of Resque Scheduler. Fang notes that this cluster was built for their monolithic application, and while easy to use, it wasn’t enough for Airbnb’s move to a service-oriented architecture. One issue was reliability — with an at-most-once delivery guarantee jobs could be lost. Other issues included scaling problems and limited scheduling abilities.

After discussing requirements with other teams at Airbnb, and identifying how to address their experience with Resque, they listed several abilities a new job scheduling system should provide, including guaranteed at-least-once delivery of every job, retaining all data after a failure or restart, and be capable of horizontally scaling in order to support a growing business. It should also support timing accuracy with most jobs running within 10 seconds of their scheduled time, and the possibility to unschedule a specific job.

To support the requested abilities they built Dynein, a distributed delayed job queueing service. From a high-level perspective, the service consists of two core components, service queues and workers doing the actual job:

Overview of Dynein

For the queues they decided to use AWS Simple Queue Service (SQS), and Fang thinks that with its set of trade-offs it's a great choice for a job queue. It’s a simple system to reason about, and it offers many properties relevant for job queue use cases. SQS comes with at-least-once delivery, which means there is no need for added functionality in Dynein to ensure message delivery. It also includes other features, such as dead letter queues and individual message acknowledgment, that are used in Dynein.

The Dynein service deals with two categories of jobs: immediate jobs and delayed jobs. Immediate jobs are sent to Dynein who directly transfers the job to a service queue. The primary reason for this wrapping is to allow for an engineer to use the same API irrespective of type of job sent. Delayed jobs are transferred to the inbound queue, which act as a write buffer for the scheduler. The Dynein service then reads the job from the inbound queue at its own rate, creates a trigger for the job, and stores the trigger in the job scheduler.

There are job schedulers available from the shelf, but the Dynein team thought that none of them had a solid scheduling story which made them decide to build their own scheduler for their limited set of features but highly scalable. Fang points out that their query model is quite simple; they just query for jobs that are overdue and then dispatch these job top a service queue, and they therefore could use DynamoDB. To avoid duplicate deliveries of jobs, they use conditional updates in the database and proceed only when successful — basically an optimistic locking strategy. Fang points out that the process they now use is simple but also highly effective, which has resulted in a considerable decrease in cost for running the service.

Dynein is open source with the code available at GitHub.

Rate this Article