BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How Twitter Automated Data Quality Check Process

How Twitter Automated Data Quality Check Process

Twitter engineering has recently shared a blog post on how they architected and developed a quality automation platform using Google Cloud Platform (GCP) and open-source software.

Twitter digests and creates thousands of data sets for different data products and applications. They recently moved to GCP and Big Query as part of their data lake solution. The next natural step in designing and ingesting these huge amounts of data is to ensure the data's quality. These data will be the fuel of Twitter core ads, product analytics, and ML products, to name a few.

It is important to have an automated way for such a large data set for quality checks. It improves confidence in data and decision-making. It helps other teams on Twitter to focus more on solving business problems rather than fixing data. Also, high-quality data will result in good decision-making and reduce the risk of data loss.

They created a platform called Data Quality Platform (DQP), which is based on some CGP services and open sources like Apache Airflow (orchestration platform)and Great Expectations (data quality platform). The following diagram shows DQP architecture.

As it is shown in the diagram the input of the system is a YAML configuration for GCP. It triggers Airflow jobs to test different resources with different cadences and granularities. The results are sent to PubSub Queue. Later the Dataflow job lands the dataset at the correct destination with related quality metrics. 

To measure the impact of this architecture, three main improvements in developing data products are mentioned in the blog post:

  • There was a 20% reduction in the roll-out of new processing features by leveraging DQP for automated validation of the output data.

  • Increased confidence in data being delivered to advertisers through continuous measurement.

  • Prior to DQP we had no automated visibility of deviance between the upstream and downstream datasets. Provides alignment metrics between core served impressions dataset and downstream datasets for over 400 internal customers.

As mentioned earlier, automation in data quality is the key to having high-quality data products. Many cloud providers like AWS and GCP provide solutions for data quality automation. There are lots of open sources related to data quality automation for more research and studies as well.

About the Author

Rate this Article

Adoption
Style

BT