Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform

PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform

This item in japanese

PayPal recently described how it standardized on Apache Airflow and Apache Gobblin for implementing its next-gen data movement platform.

In a recent blog post on the PayPal engineering blog, Jay Sen, a senior member of technical staff at PayPal, details how the existing data movement platform evolved into many tools & platforms in a complex and unmanageable ecosystem. The figure below depicts the existing platform tools.

Data movement platform at PayPal

Sen elaborates on how this architecture affected PayPal:

We needed a data movement platform that can scale and cover a wide variety of storage ecosystems. (...) Also, as the amount of data being produced increased and consumers demanded more and more real-time experiences, we needed a much faster (i.e., throughput-wise), efficient, and reliable data movement platform to serve the downstream business use cases.

PayPal engineers decided to standardize on open-source components and eventually came up with the following architecture.

PayPal Data Movement Platform piplelines

An Onboarding Service exposes REST APIs to manage and orchestrate the data pipelines in the platform. This service is authored using PayPal's internal Java Spring framework, called Raptor. An onboarding API call results in a Directed Acyclic Graph (DAG) workflow and deploying configurations on Airflow for executions. The DAG Service performs this work. It can create Airflow DAGs as per the requested configurations and template. PayPal engineers built this API since Airflow does not expose a stable API interface to manage the Airflow DAGs.

Apache Airflow is an open-source workflow management platform created at Airbnb. PayPal engineers use Airflow to define and execute the data pipeline DAGs, where each DAG orchestrates the movement of data end-to-end from data sources to data sinks. In turn, PayPal's Airflow implementation uses Apache Gobblin, a distributed data integration framework created at LinkedIn, to move the data itself. Gobblin simplifies common aspects of big data integration and supports both streaming and batching. However, the integration of Gobblin and Airflow did not come out-of-the-box. Sen details:

We use Gobblin as a core data mover component, controlled and managed by Airflow. To achieve this architecture, we developed new components within Gobblin for better service integration — job server to start/stop jobs from Airflow and CRUD APIs to manage jobs by the onboarding service, job metadata persistence over MySQL for better job management, SignalFX integration, etc. These new additions make Apache Gobblin more generic for enterprise use-cases that we also plan to contribute back (ref: Gobblin Improvement Proposal 4).

All components in the architecture provide visibility via a metric store integration with InfluxDB. InfluxDB is an open-source time-series database developed by InfluxData. It is written in Go and optimized to store and retrieve time series data such as operations monitoring, application metrics, IoT sensor data, and real-time analytics.

Rate this Article