Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Designing for Failure in the BBC's Analytics Platform

Designing for Failure in the BBC's Analytics Platform

This item in japanese

Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session titled Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for failure.

The BBC maintains a data platform to understand their audiences better to help them make the most of what the BBC has to offer. They operate their platform on AWS, and the analytics pipeline is the highest scale data pipeline built by the team. It handles billions of messages per day, and its associated data lake is at Petabytes scale.

As the team designed the pipeline, they planned for two types of failure modes - "known unknowns" and "unknown unknowns." Known unknowns are failures that the team can predict might happen. Metrics, logging, and monitoring dashboards are the primary tools used to handle this type of failure. Unknown unknowns, on the other hand, are failures that they can't control or predict. Garcia-Gil says that it's inefficient to anticipate and plan for each of these failures, so they need to have the tooling to investigate those issues as they happen. This tooling has eventually benefitted Garcia-Gil's team in different ways, such as abstracting detail from those who do not need to be aware of it and significantly reducing the time to solve incidents.

The following diagram depicts the pipeline architecture, which is, in part, an outcome of designing for failure.


The main ingestion pipeline components are pictured in the top row of the diagram. A 3rd-party analytics provider uploads files to S3, where an event capturing their upload is generated and placed in a queue. A small Java application (Job Submitter) picks up the event, performs some minimal validation, and submits the processing job to a map-reduce cluster running on Apache Spark using Apache Livy. Processing results are stored in an internal, intermediate S3 bucket. Finally, an Apache Airflow workflow generates and loads aggregations into an Amazon Redshift big data warehouse and loads the data itself to the S3-based data lake.

Garcia-Gil's team designed the above pipeline to handle several "known unknowns." For example, dividing the pipeline into multiple services with well-defined boundaries (intermediate S3 stores, queues, etc.) helped the team avoid a "big bang" type of failure. Instead, microservices isolate failures inside them and can operate independently of each other. As a result, a failure in one microservice does not affect other adjacent services. Another example is the Resubmitter utility which allows quickly replaying data if a failure is detected and remediated.

To handle the "unknown unknowns," the primary tool used was the lineage store. The lineage store is implemented with PostgreSQL. It keeps track of events in the data processing chain. Doing this helps the team to counter issues where they do not have enough data to investigate problems. Garcia-Gil details one such unknown that surfaced over time.

We're receiving data from a 3rd party supplier. While there was an agreed cadence and format to that data, we still found plenty of late data arrivals and that we can help track data that is missing and inform our downstream business users that things were not running as expected. In this case, we had the lineage store that keeps track of events and what data is missing, but we took it one step forward. We created a serverless function that regularly queries the lineage store for any missing files. If there are missing files, it would immediately alert us and create a ticket with our supplier.

InfoQ Live is a series of online events featuring sessions with software practitioners as well as optional Roundtables. InfoQ will publically publish Garcia-Gil's session video in the upcoming months.

Rate this Article