Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Data Workflow Management Using Airbnb's Airflow

Data Workflow Management Using Airbnb's Airflow

Airbnb recently open-sourced Airflow, its own data workflow management framework, under the Apache license. Airflow is being used internally at Airbnb to build, monitor and adjust data pipelines. The platform is written in Python, as are any workflows that run on it.

Airflow is a tool that allows developers of workflows to easily author, maintain, and run workflows (a.k.a. Directed Acyclic Graphs or else DAGs) per a periodic schedule. For Airbnb, this includes use-cases across multiple departments such as data warehousing, growth analytics, email targeting, A/B testing and so on. The platform has mechanisms to interact with Hive, Presto, MySQL, HDFS, Postgres and S3, and hooks are provided to allow the system to be made more extensible. As well as a command line interface, the tool provides a web-based UI which allows you to visualize your pipelines dependencies, monitor progress, trigger tasks and so on.

InfoQ spoke to Airflow’s creator, Maxime Beauchemin, and Agari’s Data Architect and one of the framework’s early adopters Siddharth Anand, to discuss Airflow including where it can be of use and what’s planned for the future.

Could you give us a high level overview of Airflow’s architecture?

In a scalable production environment, Airflow has the following components
- a metadata database (mysql or postgres)
- a set of Airflow worker nodes
- the Airflow scheduler
- a broker (redis or rabbitmq)
- the Airflow web server

All of this can run on a single box, scale at will. More modest installations can use a LocalExecutor and get a fair amount of mileage out of that.

What Maxime mentions above is correct. We (Agari) run a modest installation:
* the Airflow web server
* the Airflow scheduler
* a metadata database (postgres)

The two airflow components (webserver and scheduler) run on a single machine, whereas the database is running on a shared database instance. We run airflow in both QA and Production, which essentially means that the above architecture is replicated in 2 environments. As an early adopter, we were looking for a workflow scheduler that was easy to install, maintain, and run in the cloud. We did not want the hassle of bringing up a distributed infrastructure involving a distributed broker and a set of remote workers. This is an annoyance in a private dedicated datacenter and painful in a public cloud — in the latter, you often have to implement some tooling to handle changing IP addresses whenever EC2 instance restart, be it related to a worker or your distributed broker.

We wanted a simple architecture when starting out, but one that could grow as our needs grew, making the investment in a distributed worker-broker architecture more palatable.

How does Airflow compare against Azkaban (LinkedIn), Luigui (Spotify) and Oozie (Yahoo) ?

A key differentiator is the fact that Airflow pipelines are defined as code (as opposed to a markup language in Oozie or Azkaban), and that tasks are instantiated dynamically (as opposed to creating tasks by deriving classes in Luigi). This makes Airflow the best solution out there for dynamic pipeline generation, which can be used to power concepts as “analytics as a service”, “analysis automation” and computation frameworks, where pipelines are generated dynamically from configuration files or metadata of any form. Examples of that at Airbnb include our A/B testing framework, an anomaly detection framework, an aggregation framework and others.


Refer to the section marked “Why Airflow?” on our recent blog

What are in your opinion as an early adopter and expert in this framework the major shortcomings of Airflow to date?

Airflow is still a young project and moving fast, so it’s compelling to use trunk and hit some bugs in the process. Though the Pypi releases are fairly solid.

Agree with Maxime. There is a list of open issues on the Github site, just as you would expect with any project starting out. None of them are deal-breakers for us and as the community grows and as Airbnb dedicates more engineers to supporting what is turning out to be an emerging contender in the DAG scheduling space, I suspect the bug list to get shorter and the feature request list to grow. One area worth some attention is in making the LocalExecutor a first-class citizen. It feels like it is a transient stepping stone towards using the CeleryExecutor, which is what Airbnb runs. However, it will likely be a while before we (Agari) and others move to Celery. One reason is that it is not simple to set up in the cloud. To increase adoption and to improve the experience of early adopters, I would strongly recommend that the experience with LocalExecutors be as smooth and as perfect as possible. Since Airbnb runs the CeleryExecutors, it is possible they are mostly optimizing for improving that experience. That said, the Airbnb team has been very responsive to the needs of the community and I expect that my mentioning this will make it a reality.

We all know that in this space, ease of installation is not the top priority, e.g. oozie. How easy is Airflow to install and how important do you think this is for developers and devops evaluating it? Is the docker container at a stable state?

It’s extremely easy to get going, anyone is a few commands away from running an Airflow webserver and Airflow examples. Going for a production setup is a bit more challenging but as easy as it gets. I’d think Airflow is one of the easiest solution around to install.

Does Airflow really feel like an overkill for workflow management?

Airflow certainly isn’t overkill at Airbnb. All the features were built because we needed them, and I guarantee you that you’ll grow into using most of them quickly if you have a data team, however small it may be.

Agree. It requires the right amount of time and effort for small data teams. Our data team at Agari, at least the one working on the product that leverages Airflow, is less than a handful of people and it requires very little of their time to manage Airflow, but saves time that would otherwise be burned in manual drudgery.

Is the plugin for highcharts visualizations a worthy feature?

I agree that this may be out of scope for a workflow scheduler, but this feature can be extremely useful to have around, ready to use against all databases registered in your Airflow connections. You may have a better screwdriver in your basement, but this one is right here on your toolbelt.

Future goals: Do you think that the planned YARN executor will drive wider adoption? Do you believe that the community behind AirFlow is vibrant and dynamic?

About Yarn, we haven’t found an obvious way to assign less than a CPU core to a task. In Airflow we typically assign 4 to 16 task slots per CPU since most tasks are executed remotely and just waiting for the external system to report success/failure. As a result, a Yarn slot would be more expensive than a Celery slot, and celery has been working extremely well so far. I think resource containment will become increasingly important as we scale up though.

Airflow’s framework is available in GitHub and an introductory presentation from Hadoop Summit 2015 is available on Youtube. Sid describes in more detail Agari's infrastructure based on Airflow at his blog.

Rate this Article