Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Beyond the Distributed Monolith: Rearchitecting the Big Data Platform

Beyond the Distributed Monolith: Rearchitecting the Big Data Platform



Blanca Garcia Gil talks about how BBC re-architected a distributed monolith and shares the lessons learnt from operating it for nearly three years, how they designed their new microservices architecture so that it is easier to test, scale to cater for increasing demand, keep track of the message flow and replay errors without stopping the rest of the messages from being processed.


Blanca Garcia Gil is a principal systems engineer at BBC. She currently works on a team whose aim is to provide a reliable platform at petabyte scale for data engineering and machine learning. She provides leadership on ensuring that the development team has the correct infrastructure and tooling required for the entire delivery and support cycles of the project.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Garcia Gil: I had the privilege this morning of being introduced in the first talk by Sam Newman when he said, "What can be the worst type of monolith that you can have?" That's actually the distributed one, which is what I'm going to be talking about. He also defined it as, "The distributed monolith because life is not hard enough." If you are interested in learning what does it take in the real world to actually create a distributed monolith, you've come to the right talk.

First things first, I'm going to introduce you first to the BBC and what we do in the personalization space so you can get an idea of the context of the talk. Next, I'm going to introduce the problem space that I work in, which is data analytics processing. After that, we'll dive into...actually, we created the first version of this data and analytics processing and the data platform at the BBC and how it ended up being a distributed monolith. We go through the reasons why we call it a distributed monolith and what we learned. Then we actually had the opportunity to move away from this monolith, so I'm going to walk you from the very beginning through the steps that we went into designing the next version of it. Then last, I will highlight and share with you a few of the questions that we're asking ourselves looking beyond what we have right now as a data platform.

The BBC and Personalization

First things first, I work at the BBC and we're known for doing a bit of news and radio but we have a breadth of services available for audiences. Some of the services are Player, it's a very popular TV catch-up service in the U.K. We also have BBC Sounds, which is live streaming radio but also a podcasting service. We have CBBC, which is a channel and a website dedicated for children and kids and we even help them with tools to revise for exams. We also have a very popular weather service because we're in the U.K. so we love to talk about the weather.

Then within the BBC, I work in the personalization space. I work in the big data team and what we do is we ingest data from various sources inside the BBC and also external and we process that data store it internally in what we call a data platform. We have about 50 million users registered with accounts in the BBC. When those users register, what they unlock is a set of personalized features, so they will get access to personalized content via recommendations. You also get notified, for example, is David Attenborough is on the radio when I usually watch him on TV and that sort of thing.

You are probably wondering why would the BBC go into the data space as a public service? There's different reasons why we would do this. The first of them is it's 2020, people on the internet expect things to be tailored to them. We all have these expectations that the services are going to know something about us. In order for them to give us content that we might like, we need to give them and trust them with our data. The aim of the BBC is not to track people. It's to give them a better experience with our services because in the U.K. people pay for a license fee for us and that's one of the main ways that we fund ourselves. We wanted to give value for that service that they're paying for.

Also, we have a privacy promise, which is public. You can visit it on that link. This privacy promise is made of two parts. The first one is what does the user, what does the audience member get when they trust the BBC with their personal data? If, as an audience member, I register for an account with the BBC, I'm going to be asked for quite personal information including what is your email, what is your date of birth, or what is your postcode? By me trusting the BBC with that data, what I expect to get back is a set of personalized features. I said earlier content recommendations, but also if a generalization happens, we try to tailor data...we try to tailor the experience to what the person's postcode, or at least the postcode they have given us, is. For example, we would give them the local information to the constituency they've voted with and that is meant to help them get better informed and they cast their votes. We're not trying to change your opinion. We're just trying to give them access to information so they can educate themselves.

Then, on the other hand, what does the BBC get by getting all that data from the users? I said earlier, we're a public service so we're meant to have useful services for the breadth of the U.K. nations and regions, so all the different audience profiles that we have within there. By being able to measure what the reach of our services is, we're meant to be able to create a better BBC for everyone. These are the two parts of the privacy promise. Then, we think personalization, we talk about what is the gradient of personalization and there's an acceptable side of it, which is what we call the side of the Dr. Who. This is where the BBC tries to be, so for the users to get a positive experience about the usage of BBC. Then on the other side is more the Daleks, so this is when you're using services online where you feel that they know more about you than what you've given them or that they're trying to hook you to use their services and just spend endless time on them. This is not where we aim to be.

What is Data Analytics Processing?

Secondly, I'm going to introduce you to what is data analytics because I learned that not everyone understands the complexity of the problem space that we work with. This is end-to-end what a date analytics pipeline looks like. On the left-hand side, we have the users, which is the audience members. This is people like you and me going on BBC News website, using the mobile apps, and just reading content. When people do this, there's an analytics web server that collects data on the back of that. This data is then stored as activity logs and then they are ingested and processed. At the end there, they're stored internally in what we call the data platform.

Then we have internal users at the BBC. These internal users are people such as data scientists and data analysts. What they're doing is they run reports, they're looking at the data, they're measuring the performance of different products or they could be seeing what the performance of a given episode of "Killing Eve" was, for example. The team that I work in is mainly in the processing and storage and access layer and this is what we will be talking about next.

If we dive one step further into that processing and that storage and access, these are the different phases that usually the data goes through when you're ingesting it. You get the data, then you do some validation because usually your collection happens in a third-party service. You normally don't trust all that data that you're receiving in. Then we do some transformations with that data. We call that process enrichment. By enrichment, I mean, for example, if I'm watching an episode of a TV program, the enrichment process will take the identifier of that TV program and append some metadata of the TV show so at the end of it we can have the title, the description, when it was aired, the duration of the show, things like that.

After that, we format the data, and this is to put it in a format that can be easily queryable. In the big data space, we tend to use columnar data formats very frequently. That data is put into a data lake, which is usually just big, scalable cloud storage, and then a data warehouse where only a subset of the data is usually kept and is where, mainly, the reporting happens. They layer that our users use to access the data is usually via the data warehouse, mostly, and the data lake. This happens as a SQL interface. This is a bit of a different area than when we speak about microservices where usually it's HTTP interfaces and API contracts. For us, our contracts with our users is actually SQL. We put the data in a format that we think is going to make sense, but it's a lot more flexible because SQL allows them endless possibilities, so it's quite a challenge.

The Distributed Monolith and Lessons Learned

The distributed monolith, how did we come about being in this space and making our lives so much more interesting? This is the first version. At high level, this is what we care about. We identify three separate services. One of them was unloading the logs from the third-party. Then we have map reduce, this is distributed computing, at scale, processing all that data. We did it at scale because at that point, the third-party provider was providing us the logs once a day so we had to download all the data in one go and then process the next step in one go and then format it, so it was a big batch all over the place.

This is what the actual architect diagram looked like. I've out a purple dotted line around it because the three separate services were actually orchestrated by a service called Amazon Data Pipelines. The reason why we picked this technology was because we thought, we have these three separate phases that are very closely linked with each other. Orchestration is meant to help you recover from intermediate failures along the different steps. We had choices, Python applications, then we used Apache Spark as a framework, which is really powerful, and another Python application to load the data in bulk.

What we're going to be looking at now are the failures that we went through but as I said, we're working in the intersection of microservices and big data. This is quite an interesting area because big data, you would never think it's microservices. Some of the services that I've just shown you in the previous slide are quite hefty, such as distributed computing. Let's think that we're going to look at the rest of the talk as being influences by microservices principles. To give you an idea of the scale of what I've shown you, it's billions of messages per day, so that's the scale that we work with, and the data that we store is on the petabyte scale. This is not the only ingestion pipeline that we have in our team. We have a few dozen of these. The one that I'm showing you today is actually the highest scale one that we have. The rest of them, we solve them in different ways because when you're only working with millions of records, it's actually a different way of solving the same problem.

The Distributed Monolith and Lessons Learnt

I'm going to show you seven lessons that we learned from this. First one, you probably spotted it, batch, it just smells bad, but this is very typical of working with big data, working with batch systems overnight that they did the work and then reports available the next morning. The third-party provided us the data every day as a batch. The assumption or mistake that we make was that we then assume that all the other steps in our pipeline had to be batch. You probably guessed it, when things failed, they failed in batch as well. Imagine digging through billions of records. When things fail, it's just like going to a haystack and trying to pinpoint because each one of these billions of records on top of it had hundreds of different columns. It just takes me back in time. I don't want to go back there.

The second phase was the distributed computing. This is where we were meant to be doing validation of the input data and also doing testing because when you're working with distributed computing, you have to not only understand the framework but how the platform then splits up the different jobs and how the data gets shuffled within the different nodes doing that computing. We weren't validating the data enough. When things failed, debugging that in a distributed fashion became really hard. Also, because we had not really invested in enough testing at the upper layer of the testing pyramid, which is the end-to-end, we really didn't have a reliable way of putting data in through one end and then validate at the other end. We had a lot of unit tests that check the different functions but not so much at the top.

A third lesson, tight coupling. I hinted at this earlier. We chose this orchestration of service in Amazon because we thought it was going to help us recover from the intermediate steps. The way we had the [inaudible 00:13:53] services and the layers that the data went through, in reality, they were too tightly coupled so it was really hard when one of the phases failed to replay the data from that point or worse, so we're then having to clean up the intermediate mess and then starting over. That made our lives really hard. The other lesson about tight coupling was about the data itself. The data being collected was given field names, which sometimes they didn't really make much sense for our users downstream. There was a coupling between the data model that the analytics web server was using with the data model that we were exposing internally. Our users said, "This actually doesn't really work for us. A lot of these fields, they have names I don't know what they are." Then if anything changed upstream, you have guessed it, it just broke everything underneath, especially the data warehouse because it had static data types. You have to do all that cleansing prior to things getting into it.

Lesson number four, this is about alerting, so when you're working with microservices, you have to have enough metrics, enough monitoring so that you know what the health of them is. We thought that by using the orchestration, it would help us know when things have failed where they had happened, but in reality when we operated things, this wasn't the case. When things failed, we had to dive in through every one of the phases and we spent a lot of time digging into that. We didn't have enough alerts but then we said, "This is easy to fix, we can just add more alerts." Then we ended up on the other place, which is when your inbox and your Slack and everything is just clogged with alerts and everyone on the team is just constantly looking at it and you're being interrupted all the time. This is not where you want to be either.

Lesson number five is about understanding your traffic patterns. You usually make assumptions when you design an architecture and a service and you break down down the service like what your expectations of data flow are going to be. We have something quite particular at the BBC which we called Big New Days. What these are is while we're a global media organization, we have some predictable news days, which they can be big concerts and festivals, which is Glastonbury, people will go online en masse to watch them. They are [inaudible 00:16:30], they're also highly anticipated. They can also be tennis tournaments, such as Wimbledon, or even royal weddings. Then on the other side, we have unpredictable news events and these are usually atrocities, catastrophes, and tragedies. Those happen all of a sudden and they really drive traffic to our websites, our applications, and the streaming platforms that we have.

To give you an example, I put a screenshot there of the weather because a couple years ago, about this time of the year, we had in the U.K. what we call the Beast From the East. This was an air mass that went over the U.K. I even have pictures of it, there was just snow and ice all over London and the rest of the country and things just stopped. What that meant for our data platform is that we saw a 50% uptick in traffic. When you're working in the billions, 50% is quite a lot. Changing for that, adapting for that very quickly, your microservices, they need to be able to scale. Within your architecture, there will be different requirements of scaling alongside it. That's not where our data platform was back then, so it was very hard to scale.

Also, we learned that over time, our data is also only increasing. Looking six months back, it just keeps going up and up and up, so we also have to change our predictions going forward to cater for more involved data. Then lastly, because we're a public service, the things that we build, they have to have some cost analysis behind them. I've said almost the opposite and the same thing just now as in we have increasing data volumes but we have to keep the cost down. That's quite a challenge.

The last of the lessons, it was actually talking to our users. At the other end, the very far end, we have internal users, which initially, we really didn't listen to them very much. then we began having what we call Coffee & Data. These are monthly meetings with our users. We have coffee, some biscuits, and then we hope that they have little chats amongst them and with us. What they told us in these chats in the months that we've been talking to them, it's been more than a year now, first of all, was that the data was in a shape that it was very hard to use. Secondly, that when they were using the SQL interface to query the data, they were having to do the same calculations over and over. Some of those calculations could be quite expensive, so the queries would take a long time. We thought maybe we can work together in the future and do some of these calculations for them so when they access the data, it's in a more friendly shape for them.

Designing a New Analytics Ingestion Architecture

We've learned what the distributed monolith was. How did we come about to move away from it? The BBC as a whole decided to change the analytics provider and this is when I saw the sky open and I thought, this is the big opportunity we've been looking for. We can now think about how we're going to tackle this. We're going to have, in our team, to integrate with the new provider but also, the shape of the data, we're going to be ingesting what's going to be slightly different. There was a big exercise across the BBC about what the future of the data and how it was modeled was going to be.

I'm going to take a step back now. When you begin the sign-in system, you normally want to revisit the principles that you want that system to abide by. To paraphrase Sam Newman, the definition of microservices, "What you want is small, autonomous services that work together, modeled around a business domain." I wish we had done that in the previous phase. Sometimes you just need to go back to the basics. We thought, let's think about the different business domains again, do they hold themselves? Going back to that initial slide about the different phases of data ingestion, they still actually hold. We're still going to have to download data, not as a batch anymore because our new provider will give us data every hour. This is a big improvement for us. We still have to validate it and we also need to transform it and enrich it and also format it because the more data we have, in order for the users to be able to query it quite quickly, the data needs to perform for them. Then at the other end, we still have the data warehouse which holds that most recent data. If we can pre-calculate some of the fields that they use as [inaudible 00:21:30] that would also help them.

We looked at our team, we're quite a big team, we're about 30-something people, and we decided to divide our team amongst the squads that would take responsibility of each one of those bounded contexts. Each one of the teams was pretty much going to work on one of the services, sometimes micro, sometimes not so micro that we're going to deliver. The first one was going to download the data and then think about how can we keep track that we know that we have received the most recent data? The next one was going to be the one holding the validation and thinking, how can we decouple that data model that is coming in and what does it make sense to then expose to our users and have that columnar format stored as well?

Then the last one was the one about the feedback from our users, so the data aggregations. What are the things that will help our users get the most value of the data quicker, so what are these things that we need to pre-calculate? Also, this was a great time to think from the very early time that you need to invest in testing. Testing is something that, a bit like security, if you try to bolt it on at the end, you're setting yourself up for failure. It's a lot harder to add it later on. If you start from the beginning, then it just becomes something that is just natural for developers to work on.

This is what the first version of the new pipeline architecture looked like. Very high level, it's quite similar to the previous one. We introduced the notion of the intermediate data lake. What this was is just a way to decouple that model-driven validation phase and then the data summaries. It's just a sink for the data, an intermediate sink. If the data is validated and it's formatted, we can just store it there and we know it's in a shape that then we can derive things [inaudible 00:23:30]. The part that we received the loss, as I said, that was a lot easier this time. Instead of having to connect to a third-party API, the data was being delivered to us and we needed to learn if we were missing anything.

Then we had to, as microservice architectures, evolve over time. There's also some growth that happens as you learn from them. The first one was, how do we keep track of this data? The data was being delivered to us in small files and it was easier to know those files had some logic behind them. If we had stored which data had come through and we knew how much we were waiting for, we could automate this bit and see whether we were missing anything as early as possible. We introduced a lineage store and this is just a relational database so the different steps would write an event to this relational database and say, "This file has arrived. This file has been processed." That way, all across the pipeline, we would be able to know in which state the data was. If things failed, one of the steps, we would also store an even there. It became a lot easier for us to triage what had failed and where. We didn't have to debug the entire thing that had been processed.

The other part was you're storing all this data in another relational database but then you have to understand how you're storing it, what is the format, and also all your microservice architectures. If there's people who are new to your team or are less experienced, that's a lot of context overhead that they have to have in their heads to be able to just triage something. We thought, how do we minimize the knowledge of everyone in the team that they need to understand to be able to deal with this architecture? Someone in the team had an idea, let's develop a command line tool that all the developers can install in their laptops and we can ask it basic questions. Sometimes the questions are, am I missing any data for the last hour? It just gives you a report. Are the service level agreements being broken for the last 30 days? It just gives you a report.

That is something that has transformed our team. We have had a number of people joining in the last few months and they have been ok with doing theses things. Whereas in the past when you asked someone who joined the team, "Are you ready to get on the [inaudible 00:26:05]?" People are always just like, "No." It's just too much. Now everybody is like, "Well, if I can ask these basic questions and then with the output that I getask for help, then that's ok." These command line tools work on top of the lineage store. We don't want to be running this manually all the time. This is not productive developer time so we added some alerts as well on top of this lineage.

One of the main alerts is when is data missing? We automated this on top of the lineage store and it's just a [inaudible 00:26:43] that checks and queries, is there any data misisng for the last hour? The best part of it is not that. It's that it actually automatically will open a ticket with a third party and just tell them, "This is the data that I'm missing," and it just tells it all the data that is missing. It also emails us and says, "I've already opened a ticket. You don't need to do anything." It's been amazing.

The next bit, the last part is about replaying and recovering from failures. Even if architectures are more resilient these days, you're never going to be able to escape failure. You're always going to have to replay data. You're always going to have to recover from it. We thought we need an easier way to be able to replay data that has failed processing to the architecture. We also, at the end, add utility that we will be able to drop the data. Once we knew from that lineage store what had failed, we would be able to pump it through at the beginning of the pipeline and just allow it to [inaudible 00:27:44]. The other really nice thing is because we don't operate in a batch mode anymore. If anything fails in any of the steps, what happens is it is taken out of the processing pipeline. Data will keep flowing but we will not stop and it will give us enough time to then figure out what has gone wrong.

This is what the new architecture looks like. It's quite daunting. I thought if I show you this from the beginning, you will not be like, "Oh my God." Things are still the same as earlier, it's just that the boxes are a bit more friendly. I'm now going to go and break down into what were the benefits of using microservices principles in the different parts of the architecture so then you can kind of tailor the assumptions and the thinking into how it's working.

The first thing I said was about enabling the team to be able to choose the technology that was fit for each one of the problems that we needed to solve. I've highlighted in yellow the three main phases of the ingestion, the validation and transformation, and then the summaries. They're very different pieces of technology and the glue between them is also different. They will have a [inaudible 00:29:03] application, which our team has a lot of experience operating and scaling so it was almost a no-brainer. Then for the validation and enrichment, we chose Apache Spark and its Scala implementation. This is a bit trickier to operate sometimes because our team doesn't have as much experience, but having it separated from the rest of the pipeline will allow us to potentially change it if we wanted to. If there are any changes in any of these areas, we can easily deploy them independently. That's also another big bonus.

Benefit number two, I've already mentioned it, it's about the ability to change it and replace it. The three different services, they're deployed independently. There's no dependency between them. Let's say that we look at the data summaries. This is where we chose Apache Airflow. it's another very popular orchestration service in the data space. It's written in Python. We chose it because our users, they design this and they download this, they mostly know Python as a language so it's a nice way to collaborate with them. They actually help us build the data summaries, but if something else came along that was more compelling, we could very easily tear it down and then just replace it with the latest language or latest framework that we thought about.

The third benefit is about isolating failure. You want your microservices to be able to fail independently and not for everything to just go down as a result. In architecture, we have two ways that we isolate failure. One is in the data that we process. Before, in the batch space, if something failed, everything else failed, so there was no data delivered whatsoever. Our users were very unhappy when that happened. Now that data is taken out of the processing stream and at least they will get some data and we let them know when the data is complete. Then the other one is about the services. Id the data summaries go down, we won't have to calculate the summaries but we'll still have data which is not aggregated in the platform for our users. By it not affecting directly the data warehouse or the upstream service, we can still continue processing.

The fourth benefit is about how your microservices architectures, they allow you to adapt and grow as you operate and learn about it. For us, that meant adding different services that would help us operate it, but we also know that in the future as the scale of the data that we ingest increases, then it's very likely that perhaps some of the assumptions that we make to build the processing pipeline, we'll have to maybe break them down into smaller problems or just tackle them differently. That is something that choosing microservices has allowed us to do. Comparing back to what we had earlier, it is quite some growth. It's been quite a leap in thinking for the team.

What are my main takeaways from this exercise? The first one of them is change is the only constant. You're always going to have to adapt to change even if it's using monoliths. As things change and you measure them, you're going to have to adapt. Then make sure that everyone on the team can dive into live issues because it's a very easy way to spread the knowledge in a team but also to learn as developers whether the assumptions that we made when we built a new feature, they actually turned to be ok in production. It's a way to close that feedback loop and learn as we go. The third one is about being empowered to choose the languages and tools that the whole team owns because that way in some cases of the architecture where you know the problem really well, perhaps you can take some risks in other parts. You just need to go with something that you're very comfortable with but that then you can operate.

The Future of the Data Platform

This is bringing us towards the end of where we are now but we're already having conversations about where does the future of a data platform lie? Where are we going with this? What are the challenges? For me mainly is how are we going to change this architecture as data load increases? Also, our user base internally is increasing and that is a separate challenge but that's another talk altogether. The next one is about the users' requirements of our data platform. Most of our users use SQL as the interface but we also have machine learning use cases within the BBC, so we're going to have to enable different access modes and different ways for the users to use the data. The last one is about how can we make it easier for our users? Most of our users, they do not really come from engineering backgrounds. For them, SQL is just an interface and they only look into the data, so we need to make it to them as easy as possible. For me, the concern is how do we keep the data secure while making it easier for our users?

To end today, I'm going to tell you something about the data because we've been talking about pumping data all over but are we actually getting some insight out of it? One of my favorite things that we've learned over the last year is about one of my favorite TV shows, which is "Killing Eve." When it was released about a year ago in iPlayer, it was released to us the whole series. We all love binge-watching these days because that way we can spend the whole weekend watching something. What we found out when the data came through was that there was a significant number of users who were watching the last episode before anything else and we thought there was something wrong in the data. We were like, "This can't happen." I know people...I love reading books so I know people who actually read the end of a book before they decide whether to read the rest. We did a little poll around our team and it turns out that there were people in our team who actually watched the last episode before going in. I thought, ok, this is a new watching pattern. With this, I will leave you today. Thank you.


See more presentations with transcripts


Recorded at:

Apr 21, 2020