Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News How Dropbox Created a Distributed Async Task Framework at Scale

How Dropbox Created a Distributed Async Task Framework at Scale

This item in japanese

Engineers at Dropbox recently published how they designed a distributed async task framework (ATF) that can handle tens of thousands of async tasks scheduled per second.

Dropbox's engineers needed an infrastructure system that would allow them to power all task scheduling requirements in their system, "from dragging a file into a Dropbox folder to scheduling a marketing campaign." They decided to design an in-house system since there were no open-source or off-the-shelf solutions that could meet their scaling needs.

A year after its introduction, ATF currently handles 9,000 async tasks scheduled per second and is used by 28 engineering teams internally at Dropbox.

Arun Sai Krishnan, a software engineer at Dropbox, told InfoQ:

We had two in-house systems that were replaced by ATF. One was a system for immediate task execution built on top of Cape, and another was a system used for delayed task execution. ATF replaced these to get better features (like control over a dedicated execution layer) and system guarantees (like isolation across lambdas). In addition, consolidating these systems, which performed very similar roles, reduced our maintenance overhead.

ATF's main feature is its ability to enable developers to define callbacks and schedule tasks that execute against these pre-defined callbacks. It invokes callbacks in an at-least-once manner while guaranteeing that it does not run the same task instance more than once concurrently. The engineers chose this behavior to avoid requiring their users to design their callbacks to handle concurrency and possible race conditions. Also, ATF is highly available, with 99.9% availability for receiving scheduling requests.

The following diagram describes ATF's architecture.


This diagram depicts the following process:

  1. An ATF Frontend service receives requests to schedule tasks via gRPC.
  2. The Frontend service registers the task in the task store, which is implemented by Dropbox's in-house Edgestore data store.
  3. A Store Consumer service periodically polls the task store. It pushes tasks ready for execution into a queue.
  4. The queue is implemented using AWS Simple Queue Service (SQS). Each callback and priority pair gets a dedicated queue. This setup allows for prioritizing more important tasks per registered callback.
  5. Controllers pull tasks for callbacks for which they are in charge and store them on local queues. This layer also prioritizes the work for the executors.
  6. Multiple executors per controller then poll the controller for work in a message processing loop and execute the required tasks.
  7. Both controllers and executors update the HeartBeat & Status Controller (HSC) regarding their progress. The HSC, in turn, updates the task store with each task's state, thus allowing clients to query various tasks' progress.

Dropbox's engineers designed ATF to be a self-serve framework for all developers at Dropbox. To promote this approach, they decided that all worker clusters (controllers and executors) are entirely owned and managed by callback owners. This clear separation of concerns increases ATF's usability by its consumers and allows the ATF team to focus on the core system parts.

Krishnan said the following regarding the process of getting feedback from consumers:

Feedback was collected from the teams that would be using the system during the design process. It was also widely vetted during our design review process. In addition, we incorporated continuous feedback from our clients on issues such as maintenance costs, alerting, and host management.

Possible future extensions to ATF include periodic task execution, better support for task chaining, and support for dead-letter queues for misbehaving tasks.

Rate this Article