InfoQ Homepage Presentations Evolving Analytics in the Data Platform

Evolving Analytics in the Data Platform

View Presentation

Speed:

38:12

Summary

Blanca Garcia-Gil discusses the BBC’s analytics platform architecture, the failure modes they designed for, and the investigation of the new unknowns and how they automated them away.

Bio

Blanca Garcia-Gil is a principal systems engineer at BBC. She currently works on a team whose aim is to provide a reliable platform at petabyte scale for data engineering and machine learning. She provides leadership on ensuring that the development team has the correct infrastructure and tooling required for the entire delivery and support cycles of the project.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Gil: Welcome to evolving analytics in a data platform. My name is Blanca Garcia Gil. I'm a principal engineer at the BBC. I'm going to tell you the motivation behind this talk. It's actually a story of migration at an organizational level. How sometimes in engineering teams migrations are not seen in a positive or challenging way, because it's seen as a way of lifting and shifting existing technology into a new platform. In our case, we try to see some positives into it, and we thought, this is an opportunity for us to start all over. The things that we've learned from our previous platform, and the frustrations we had, we wondered, what could we do differently this time? There were many things.

Outline

First, I'm going to introduce you to the BBC. I might surprise you with things you didn't know. Then, why the data platform in an organization like this? Then, we'll look at the architecture of the microservices platform that manages our analytics, and what we came up after the rearchitecture. Then, we'll introduce the concept of failure mode, which is a term used within observability, and the different types of failure modes. Then we'll apply those to this architecture that I introduced earlier, and show you what things we were able to design into it, applying the lessons that we learned. Then, what things came up afterwards, and we had not been able to prepare or forecast for them, and what we did. Then, we'll summarize all the things that we learned together.

The British Broadcasting Corporation (BBC)

First things first, the BBC. If you're an international audience, especially, you might be quite familiar with our news, broadcasting, or our radio. There's a lot more than that to the BBC. We have a very popular sports site, and we do live sport retransmissions. We have a TV cut-up service called iPlayer for users and audiences in the UK. We now have also a podcasting service called BBC Sounds. In the last year, with everyone being at home, there are other services that have come at the forefront of the minds of our users. That has been services like BBC Bitesize, which is always there helping students revise for national exams that run on a yearly basis. It's been popular because we added more content to help students keep up with the lessons that they were missing in school and revise on a daily basis. I always add here, the BBC Weather Service, because if you are like me and have lived in the UK for quite some time, you will have learned that we just love talking about the weather. Then, again, we are a global broadcaster, but we're actually in the UK. We're a public broadcaster. What this means is that we are funded by a license fee that is paid by every UK household. We are meant to be, as part of a remit, reaching all the diverse audiences, backgrounds, communities that make up the UK nations and regions. As part of our mission, we have three core principles, which is, we have to inform, educate, and entertain our audiences.

The Data Platform

Now on to the data platform. Why would an organization like the BBC, which is used to doing almost one way broadcasting have a data platform? We've been wanting to shift from where we were into a place where we understand our audiences better. This is for various reasons. One of them is to offer everyone something that they like, not just a very set few, but also maybe increase the breadth of services that people are aware of. In turn, they make the most of what the BBC has to offer them. After all, they're paying us a license fee. Also, it helps them realize the value that the money they pay us, gives them.

Microservices plus Big Data

On the technical side, what does this mean for the data platform team where I work in? We are in the Big Data space. We are fully on the cloud. We are on Amazon. We design microservices architectures to ingest data from various sources. One of these sources, our highest scale is the analytics data. This is the example that we'll be following on. This ingestion pipeline means that we receive billions of messages every day. Over time, these messages accumulate and are stored within what we call a data lake. This is essentially just cloud storage, where all the data has been validated and has gone through the processing chain itself. This data lake is a historical archive. Right now we are already at petabyte scale.

Our Analytics Ingestion Microservices Architecture

What does the architecture that we migrated into look like these days? At first glance, you might see, actually, it does look quite complicated. We'll be breaking it down, bit by bit, and focus on how we operate, and what things we have that helped us investigate issues. The first row at the top is the ingestion chain, and the main microservices that make up this architecture. If we start with the left-hand side, where we have a red Amazon icon that looks a bit like a bucket, we have cloud storage on S3 where we receive the ingest logs. From our analytics supplier, they deliver to us files on a regular cadence, and we know when we expect these files, and even what the file names are meant to be. Whenever a file gets into that storage, there's an event that gets kicked off. What this event says to the next element in the system is, there's a new file that has been received and is ready for processing. This event goes into a queue. Many events will be lined up in that queue. Then there is a Java application that we call the Job Submitter that will read each one of those events, validate the name of the file, that is something that we actually expected to receive. It's essentially very minimal stuff that it does. Then it passes on to the next one on the chain.

This is not really so much a microservice, this is actually a MapReduce cluster. What it does is it can process multiple files at a time. What it does is it actually goes into the contents of each file and validates them against the model. We do what we call model driven validation. This model defines, what are the different fields that are expected in the file? What are accepted values? If not, it will reject the file and notify us. It also compresses data when it's written to the next chain. It also does enrichments, which are essentially transformations on some of those fields. The fields can contain locations. They can contain postcodes. They can contain dates and times. Sometimes the format that they come in is not the most useful, so they're actually transformed, and then written out. Then these files are stored in the intermediate data lake. We call it intermediate only because it's for use within this microservices architecture. Our users of the data, they don't have access to this. It's still in a raw format, even though it's been validated. It's not in the most optimal way to be queried and analyzed on.

There's another process down the line called the data summaries. What this does is it takes those files, and it knows a bit more about the data and about the expected use that reports, or dashboards, or data analysts will make of it. It does aggregations and summaries. Those aggregations and summaries are then written out to our data warehouse. For that we use Amazon Redshift. That is essentially a big database, and it does perform queries at scale. Then the historical archive of that data ends up in the data lake. In Redshift, we keep a subset of that data, because that's normally the use cases we've had. There will be data analysts, data scientists, dashboards, all sorts of users internally at the BBC that will be using that data to inform decisions, to create reports. There will be machine learning algorithms reading from that data as well. It's become quite core to the BBC now.

When we were having discussions about the design and the makeup of this pipeline, for me, there was a key question that I was looking for an answer. For me was, are the metrics, are the dashboards that we have telling me everything that I need to know? Because my experience with the previous architecture was that operating it was quite painful. It took ages to dive into the issues. Sometimes it was frustrating, because you wouldn't find the data to actually tell you what had been going on. I had learned also from experience and others in the team had learned from experience that not all the bugs are in the code, necessarily. With microservices architectures, the more microservices you have, the more cracks that can appear in your system. Also, because these microservices usually don't operate in isolation, our architecture is connected to our suppliers in a way by them sending us files. Then every point that you can connect to another system, there's another potential crack. You have to be very aware when you are designing.

Failure Modes - Known Unknowns

Now we're going to focus on failure modes. This is a term used in observability. There's various failure modes, and we're going to explain them first. The first one is what is called known unknowns. What are these? These are the failures that we can predict, that we look at a system or we're looking at some graph, thinking, this might fail here. You can provide a safety net for them, or you can log it. You can track it in a dashboard. You can alert on it. These are great, because you know them beforehand, so it makes your life a bit easier by adding them into your system. Then there's another element of failures that will probably happen over time that you won't know about. By catering for these known unknowns, the question is, do you have enough information for the future failures? Are those metrics giving you everything you need? In our experience, not really.

Unknown Unknowns

There's another mode of failures, and these are called unknown unknowns. We know every system will fail at some point, that is, every production system that handles some load. Just think beyond the experience that you have, and you realize that you can't control everything, you cannot predict all the failures. Once you accept that, then it opens up a new way of thinking, and a new way of tackling issues. The question is about, do you have the data, do you have the tooling that is going to allow you to investigate issues that haven't happened yet?

I like to show this image from the BBC TV series, Sherlock. It shows Sherlock and Watson. Sherlock very frequently says to Watson, you see, but you do not observe. I think this very much relates to observability, that you will have the metrics, you will have the data but sometimes you will still need to cater for that unknown thing, or look a bit beyond. What I like here is that most people will focus on both of the men at the forefront of the photo, but if you look in between them, there's actually a hidden picture of a woman in the background.

Known Unknowns We Designed For

Let's go back to architecture and the known unknowns that we learned from and we designed for when we were designing the architecture. I've summarized them in this table. The first one of them was, things used to fail in a big bang fashion, and we used to just lose a whole day processing of data. This was very painful, because when you're working at a scale of billions, then it's harder to recover, it's harder to investigate. Everything is just more difficult. In this case, we said we have to be able to isolate failure. Each one of the microservices that is made up in the architecture if it comes down or we have to investigate, then we have to take it out of the processing chain, then it doesn't affect the others. They can operate independently. That's been a huge bonus.

The second issue was we encounter issues and we find that we just don't have enough data to investigate them. We look at the logs, we look at the metrics, but there's no conclusive thing that we can say about what caused the failure. In the data space, there's a term called lineage store. What this is, is essentially being able to track all the data that you receive, and that goes through your system and the different stages that it goes through, so that if anything happens, you are able to go into this historical archive of events, and see at which point things failed.

The third one goes back into failure and the ability to replay. We have built some utilities in our new architecture that help us replay data. This happens at various stages of the processing chain, and has been tremendously useful and made our lives really easier. Then the fourth one is just being able to answer a simple question in a distributed system sometimes is just not that straightforward. You have to always focus, how do you know that things are running? Do you really know if things are running? In this case, we still have metrics, we still have dashboards. The lineage store has given us that added granularity that we needed to be able to answer that question on a more detailed level.

Let's go to architecture, and see where those four items fit into. The first one is about isolating failure. I've marked the three main microservices of the architecture, and each one of those can fail independently. That's a given. We're able to investigate issues, we're able to deploy them independently. That's been huge for development and trials. The second one is knowing at which state the data is of the processing chain. I've shown in number two at the bottom. That's where the lineage database is. All our microservices are actually writing to this lineage datastore, so that we can keep track and audit those events. Number three, is the ability to replay data. We have a utility that we call resubmitter. If we want to replay some data, we process it. Essentially, what we do is we just add an event to a queue that says the file and the path that we need to process. There's a Java application that just drops the event into the start of the processing chain. Separately, at the end of our processing chain where we have the data summaries, that's an Apache Airflow application. If you're familiar with Apache airflow, it also has out of the box functionality that will help you reprocess and replay data. We use that quite heavily as well. We haven't had to build anything on top of it. Then, the fourth one is being able to answer the question of if your system is really running. I haven't added here the metrics and the dashboards. Don't forget those, you still need them. Then the lineage database does the core thing.

Unknown Unknowns Which Surfaced Over Time

We'll move on to unknown unknowns, and how you can prepare yourself for those. There were two main things that came up for us. One was, we were receiving data from a third party supplier. While there was an agreed cadence to that data and data formats, we still found that there were plenty of late data arrivals, and that we could help by keeping track of the data that was missing. Also, being able to inform downstream to our users and our business when things were not running as expected. In this case, we had the lineage store that keeps track of the events and also what data is missing. We took it one step forward, and we ended up automating it to save ourselves some more work. We created a serverless function, written in Java. What it does is periodically queries the lineage datastore for missing files. If there are any missing files, it will alert us but also it will create a ticket, open an incident ticket with our supplier and give them the exact data that they need to start the investigation. That has been useful for us, and also useful for them.

Then the second thing that has come up over time is when you operate software products, the term service level agreements, or SLAs always comes up. Service level agreements are essentially an expectation that is agreed with your stakeholders about what is the promised availability of your service, but also, when incidents occur, depending on different severity levels, what is your promise for fixing them? In our case, we have that like every other live product, and we have the data in the lineage store. Sometimes having to connect to a database, knowing and understanding the data model within that database, and being able to write the SQL already narrows down the number of people that are able to make sense of that data.

There was an engineer in our team that had a great idea. He said, how about we abstract that, and we create a command line utility that can answer simple business questions? Taking inspiration from the Amazon command line interface, he wrote a utility in Python to be able to answer the service level agreement question. That command line utility has become a huge success in our team, to the point that every engineer onboarded now will install it, and even business analysts will use it at some point. It just produces reports for us, and then it makes being able to access that data very easy, without having to remember how it links together and how things are laid out internally. It's really helpful. Back to our architecture, where are these holding from? The lineage store at the bottom. It's been quite key for us to tackle those unknown unknowns. First, the serverless application, the Java application that queries and alerts for missing data, and then the command line tool, which I personally love, and I recommend that you look into. There's some frameworks in Python that make your life really easy to do these things.

Key Learnings

The key learning has been invest in shared tooling. It's benefited us in many ways, sharing knowledge across our team. We used to have only two or three people that understood the architecture, were able to dive into issues and triage them. That's not the way to have a shared responsibility of your production system. Also, because when you have different microservices architectures, like we have, we might be sometimes working in other systems. When you have to go back into a system that you haven't perhaps worked on for a while, there's a lot of detail and context that has escaped from your head. It helps to have that tooling to help us very quickly get the answers.

Then being able to replay the data is also another core thing. There's always going to be failures. Make your own life easier by automating this or actually building something that allows you to do it fairly quickly. The most important one for me has been the time that it takes now to triage and then solve incidents. Back in the day, when we had big time failures, what that meant was not just that we lost the whole day's data, but that we have more than a billion of records that we had to dive through to understand what has been going on. It was very time consuming, energy consuming. You sometimes didn't really find what was going on, because of the sheer amount of data. In this case, having that lineage store, having the per-field validation has really helped us with it. Then the last one, is the ability to use that data to answer business questions and reports. It's given our team confidence in the work that we're doing. It's also given the business confidence that we have under control the things that we can control.

I wanted to leave you with a question for yourself. I'm sure many of you have been working on software products that are in production. If you are part of an on-call rota, you have a phone that alerts you when things go bad. Try to see, do these alerts have a pattern? Are there things that you could automate? Could you develop some tooling? Measure what the impact of automating that is, and show it to your business and make a case for it. Because it will not only save you from perhaps being woken up, but also it will, in turn, save the business money. There's something to think about there. It will save you from frustration. What are the things that if you could change, you would design differently?

Questions and Answers

Betts: I wanted to start by asking Blanca two related concerns you brought up that I think everyone has encountered, that there are people that just don't have good knowledge about the system that might be somehow involved in an incident. Also, when you get brought back in to help triage and troubleshoot a service you haven't worked on in a while and it fails, and you have to get re-familiar with it again. What did you do to specifically take care of those two problems and to disseminate not just here's the problem that's happening, but also, here's a potential solution, here's what the system can help tell you what you should do to fix it?

Gil: For every product that we release into production, we usually have a set of documents that we put together, and one of them is called runbooks. A runbook has a defined structure. It contains things from architecture diagrams, to who is the team that owns that product, who to contact, and links to the support rota. Also, there are known issues in the system that might crop up now and then, it contains the instructions on how to fix them, and how to go about triaging them. That's almost like offloading from our brains that context to a place that we all have a shared space. We also update them over time as we go. They are incredibly useful.

Betts: I've seen a couple people already ask about the lineage store and idempotence for the replay. Can you talk about the lineage store a little bit?

Gil: The lineage store is just literally a store of events. It's called lineage because in the data space, lineage is about marrying the different stages of your data in the data journey. I've shown like a cascading of microservices. The reason why the lineage is important for us is, I showed at the end of our ingestion pipeline, there is a data warehouse. Where it gets really interesting is that a person, a lot of times will be querying that data and trying to make sense of things. They might come to us and say, "Actually, I'm trying to analyze the performance of a video program in iPlayer, and I don't understand why this data is showing this. This doesn't maybe look right." That is when we're faced, all of a sudden, with a question of explaining to them, what was the raw file that came with that data, and the different stages of transformation. We have to be able to dig through that and join all those points. That's why it's called lineage. In other systems that are not data platforms, you might have just a stream of events or traces. It's the same concept. This is just a very specific example for a data platform.

Betts: Then, when you have to replay those events, how do you ensure the idempotence, that you're not duplicating everything as you replay them?

Gil: There's different strategies in the different steps. I know in some of them, there's some cleanup that has to happen before you can actually replay the data because otherwise you'll be creating duplicate. That's why when you're automating things, sometimes we think, it's just about maybe dropping an extra event, but actually if you sit down and write down that process from start to end, you will realize that there's a lot more to it. What helped us realize that was actually those runbooks, where we had the step by steps. They became very complicated over time. It was like, this is not good. You don't want anyone to be stressed out having to go through such detailed things.

Betts: Do you use that knowledge to say the runbook has gotten too complicated, we can get the system to help us and simplify the steps we have to do to replay or just eliminate it entirely.

Gil: Usually, yes. For I think more than a year now, we have been also doing what we call post-incident reviews. When there's an issue in production, once we go through all the trials and the fixing, the people that have been involved in solving that, get together, and do almost like a retrospective analysis to try to learn what things they did, what things maybe could be done differently, and what things actually could be fixed more permanently. For issues that are maybe a one-off, you would add them in the runbook. You usually don't want to have there, things that happen all the time. That's a smell for something that is crying for automation, really.

Betts: I think someone did want specifically, what was the technology you used for the lineage store? Is it Kafka? Is it EventStore?

Gil: For us, it's not really streaming. It's just a relational database. We started with something really simple, but you can build as far as you would like. It depends on how often and what is the use cases that you're going to have. For us, it started being a place where we could just store things. I think in retrospective, we were not anticipating using it as much, or having maybe like real-time processing applications down the line. It can be anything you would like.

Betts: That's I think typical is that we make this little small thing, and then pretty soon, it becomes a pretty regularly used part of the system.

How does the analytics platform that you guys created and the observability part of that as well, facilitate shared ownership? You have a distributed system, teams all have their own problems, is there something that helps them all see, here's the state of the system, here's how we all work together? When one little problem happens in one part of the system, it can have a catastrophic effect somewhere else?

Gil: We try to have regular knowledge sharing sessions in our team. Our team working in the data platform is about I think, 25 people, including engineers and all the other disciplines that we need. We divide ourselves in smaller groups or squads. People sometimes shift between squads at times, but it's quite important for all of us to understand what the other squads are working on because everything is very interrelated. We do regular demos. We collaborate, not just between our squads, but also with our stakeholders, because sometimes they might have needs, or they might be suggesting an improvement. In the past year, we have said to them, come and join us for a few weeks, and we'll build it together with you. That has been amazing, because we have learned a lot from them, because they're the experts in the data.

Betts: I think that's one of those things I like to see is cross team ownership. The workflow you described was very clearly for the data pipeline, but some of the ideas you talked about are applicable to any other type of system. The idea of you have unknown unknowns in the system, did you explicitly put anything in place to help you surface those as they came up, or was it all reactionary that you saw this problem occurring, and then you had to go back and find a solution.

Gil: The thing with unknown unknowns is that you can get together with a group and just think of all the possible failures, but it's probably not worth your time taking steps to action them until you know whether they're going to happen. Usually, teams go through these exercises in design phases that you get a number of people in a room together with different angles of experience, the diversity of experience is very key for this. People will suggest ways that things might fail. Then it's about putting that list together and then thinking actually, what is the possibility that this will happen? Then you begin prioritizing them, but also derisking some of them because you realize that they're just very unlikely. I think it's about keeping those things in mind. You will not be able to prepare for everything, so we still deal with a number of live issues. It's part of the learning process, being honest.

Betts: You mentioned the relational database, and I think people tend to think those aren't efficient. Do you have any throughput considerations when you're putting all the data into a relational database, or does that keep up just fine?

Gil: For us, until now it's kept up ok. We've been in production, I think, for over three years now. What I've learned is that, as the scale of your problem changes, the way to solve it also changes, so we might need to review that. In the last year, for example, we have had to do a second migration of our infrastructure for our data warehouse, because it's grown so big that it wasn't meeting the needs, so we've had to invest in that. It might also happen in the lineage store.

Betts: Early on, you said you had a goal of keeping things simple. How do you find a trade-off between really simple but also being useful, and it has to be able to do more? Do you feel like you've achieved that with some of the solutions you've put in?

Gil: One thing that it has helped me, I think focusing on simple, is learning about the use cases. What are the use cases from the business and your stakeholders? Getting to know what are the pain points, and starting with those. Then, over time, probably most of the things that you've thought about were not needed. Also, there is an element that when you write the first time, the first implementation, the first version of something, everything might feel very simple, very clean. As you add to it, you begin accumulating the dreaded tech debt. You have to, as you keep working on it, not just address it, but also have conversations with your stakeholders and with your business, about the impact of it on the team, on the product. If you have a substantial business case, then it's time to tackle it.

See more presentations with transcripts

Recorded at:

Jun 21, 2021

Blanca Garcia-Gil

InfoQ Software Architects' Newsletter