Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations BBC Online: Architecting for Scale with the Cloud and Serverless

BBC Online: Architecting for Scale with the Cloud and Serverless



Matthew Clark discusses how the BBC’s website is designed in a scalable, performant, and resilient way, what the architectural solution is, and some of the technologies used.


Matthew Clark is Head of Architecture for many of the BBC’s online products. He’s been at the BBC for over 10 years, and has been involved in multiple projects such as covering the London 2012 Olympics, and getting BBC iPlayer working on the International Space Station.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Clark: We've got a whole website and set of apps, which we need to design a system for. We're talking BBC Online, where at any one moment, millions of users can be using the broad range of features it's got. This is an interesting challenge we've got. What's BBC Online? The bit you might know the best is BBC News, used massively around the world. According to Feedspot, it's the number one news website in the world. I've seen some other sites that claim Yahoo is the number one news site in the world. People still use Yahoo? Maybe they do. Look for that on our BBC News, which is huge traffic. Users in the millions can sign up for the breaking news event around the world, and obviously in the UK, especially. It's available in about 40 different languages. There is just a few of them. It's a pretty massive site. That's just the start. For people in the UK, there's BBC iPlayer, which is maybe like Netflix, but in some ways even better. Loads of live content. Huge content catalog, and millions of plays on that every day. This is just the start. There's radio, and podcasts, live events, sports scores, weather forecast, food recipes, things to make you think and to explore. Great stuff for children to grow and revise for their exams and so much more. There's about 15 apps, as well as TV applications, CarPlay integrations, and voice applications, and all sorts more. This is really a broad range that we need our architecture to support.


Our requirements basically boil down to three things. We need to handle the scale of millions of users turning up, not just high traffic, but highly variable traffic as well there, a breaking news alert. Large numbers turn up out of nowhere. Then we've got that breadth of content that we've just seen. We need to handle all those different content types. We're going to have multiple teams building these sites and apps, lots of stakeholders, lots of content makers, lots of feature requests. How do we scale in that way as well? Finally, this is a public service. A not for profit. The BBC spends public money, so we must spend it super wisely. We must watch out for our costs. There are obvious ones like the cloud costs. The number one cost that most of our organizations and businesses have is people, teams, engineers. How are we designing something so that we can move quickly, develop as fast as we can, and have a low operating cost as well? That's really key. That's our requirements then. You think we can design an architecture for that? Obviously, that's not all the requirements. There's all the features we need, and all the content. We need to be personalized, like all modern services. What else? Nonfunctionals. Security is a given, and reliability. Let's focus on that one. That one's key. This is a service that people will rely on, so high uptime is vital.


We've quite a lot to look at. At the highest of levels, let's have a go at this architecture. We can do a blank sheet of paper here, because the BBC did switch off its data centers last year, moved to the cloud. We can do a proper static, and cloud native option. We have users, a good place to start. We have millions of users, we said that, so maybe we could draw it like this. We have websites and apps that they are primarily consuming. We have an awful lot of them as well, so scale there as well. That's one side of the diagram. On the other side, we should show where the content is coming from. There are a few places but we can probably say the main one, which is people: BBC, creative people, news journalists, TV producers, those kinds of people. We have thousands of them doing all kinds of different things. We need to give them some really good tools, content management systems to create and manage all that. It can't just be one tool, because there's so much variety, and we need them to be as efficient as possible as well, don't we? Make as much great and rich content as we can. We're going to need to give them those great tools. Let's get rid of that depth bit, because we know there's a lot going on there. We have those core bits.

What are we going to add to this cloud architecture to make it work? Media maybe is the place to start, our video and audio and images. How do we handle that? There's a lot of complexity there. Basically, it's handling that media, transcoding it, preparing it, storing it, and then packaging and delivering it to the right place at the right time. At the highest of levels, let's draw it like that. Let's focus perhaps a bit more on the other type of content, the metadata describing it. The written content like articles and food recipes, or the data rich content like sports scores or weather forecasts. There's loads of that type. I'd suggest the first thing we need to do when we get content like that is to process it, understand it. What is the content? Who's it for? What does it reference? If we can understand the content and get it right, now, at this point at which the content is made, that will help us when we come to make our user facing experiences later. They need to store it, of course, in the content store. It's not going to be one thing. We've already said there's so much breadth. This is going to be a range of different stores, because there's different types of content. We might have key-value stores, for example, there. Nice and simple, aren't they? They work great. Sometimes things are a bit more complex than that. I'm a big believer in using the right tool for the job. A relational database might be good if your data is relational. We probably want to use things like search engines, some recommendation engines as well, because how people discover this content is really key. We'll probably do many things in the storing of our content.

Business Logic

Let's look at the other side. We've said there are websites and apps. To make websites, you need to render some HTML, server side rendering is essential for good performing websites. For apps, of course, they don't need HTML. They do need a great API from which to get the right content. We need one of them as well. Of course, both of those things will need to get the content from that content store we just drew. I'm going to add one more box in the middle called business logic. I don't like that phrase, business logic. Do you know what I mean? The actual logic of working out what to do? Who is the user who is requesting something? Are they signed in? Where are they in the world? What do we know about them? Then the content, what content do we have? What's live now? What events are on now? What programs are on now? What's happening? What's trending? All of these things can come together to help work out the right thing to share to the user. The business logic does that bit. One box there, so we can share that across our sites and apps.

Traffic Management

Let's draw one more box in here. See that one there on the right-hand side, traffic management. That's the combination of CDNs and routing layers, proxy layers, those things. If we've got a very high set of users making thousands of requests a second, at a broad range of services, we need to get really good at handling those requests. Making sure they're valid. Making sure they're successful. For example, the user is properly signed in. Is the request a valid one, or where you do a DoS attack, or something like that? A really key thing we need to do with that traffic management there to get it right.

User Store

Looking good, isn't it? Though, maybe a bit one way, content going from left to right, which is broadly where most stuff happens. What about those users themselves, maybe they can interact. They can certainly sign in and comment, and maybe they can even make their own content. Let's have the concept of a user store, again, could be a bit generic. A place where users can track what they're doing. Once we have that set of content, we can start to offer amazing things such as that recommendations engine above it. Let's draw an arrow into that. Analytics, we can start reporting back to our content makers, overall, how many people are actually seeing that content, so they can make sure they're making the right thing. We have an architecture, we're done with this. Have we really hit those requirements? We had scaled, didn't we? Tons of scale and breadth of content and good value. Have we hit them? Maybe we're on the way. We probably need a few more details to really get them sorted.


Let's have a look at a couple. Let's start with caching. Hopefully not too controversial a point. Fairly standard way of helping this stuff. To me, there are four big advantages of caching. Obviously, scale is one of them. Caches can respond to normally far more responses than if you're just computing them every time. That's what we need. That's important. It does it by being phenomenally quick, of course. Let's call out speed as another advantage, because we notice performance is a really key nonfunctional for services like these. We know that for every second a page takes to load, 10% of the people head back, and go and do something else. A page taking 2 seconds, which is not that unusual with 3G, for example, that's 20% of people gone on every click. Speed is perhaps the number one thing you can do to increase the amount of traffic you have, so that we'll have that advantage.

Cost is another one that's important to us. Caches on the whole are normally cheaper than computing things every time. There's a fourth benefit we don't always remember when thinking of caching, but they offer a brilliant resilience option as well. Should, for example, your database be struggling under load, cache may have something already in it from which it can respond. Serving last known good behavior, or serving stale on error, as it's sometimes called, can make the difference to a wobbly system becoming a good one. We'll take all those benefits. Caching sounds ideal. Where should we put it on this diagram? I think that traffic management there is the really obvious one. We mentioned CDNs before. That front door as people come in. If two people request the same thing and it's not personalized, why compute that twice? The second person can have a copy from the first. Let's definitely use a cache there. In terms of technology, yes, maybe CDN can play a key part. I don't think it's just that, because we said the traffic management layer does a lot of complexity around routing and understanding the user. We're going to need boxes with all that logic in as well, and we're going to want some good cache on those boxes as well. We have cache there. Where else? The content stores, databases, they can sometimes struggle under load. A thundering herd of requests, they might not be able to handle that well. A cache in front of the content store seems to make complete sense.

Which technology should we use? We've got loads of options here. Let me suggest Redis. We've used that for years at the BBC. It's phenomenally fast, super flexible too. It gets my vote every time. Let's mark the technology for that being Redis. Maybe one more place, I think we could put a cache I'm going to suggest here, between that business logic and the web and app API rendering, just because I think there's a lot of overlap there. A lot of different pages and apps are going to be sharing the same information about the content and who the user is. It seems an opportunity there to be efficient, too. As well as the efficiency speed arguments, as we said before, there's a resilience argument as well. Let's say that that business logic layer, whatever it is, is having a bad day and being a bit wobbly. The cache in front of it can have a go at serving the last known good, assuming there is one, of course. Assuming it's not personalized. That can keep us on the air, keep us going until whatever time, as we can go and fix that issue.

There is one catch with caches, of course, is they can slow things down because they have a habit of serving old things. We need an event driven element to this as well. On the left-hand side, we got content, we push through the process into the content store. If we can also push it through into those caches, or at the very least mark those caches as needing to be stale or need to be revalidated, then we can correct that issue. Particularly, if anything's fast changing such as a breaking news event or a sports score, we can definitely get those things updated so we're not sharing old content. Caching, no-brainer. It definitely needs to be part of an architecture like this.

Serverless (Compute and More)

When this may be a little bit more controversial is the use of serverless. In the cloud, we got the virtual machines, we've got the containers, and we got the serverless, broadly in a line. From the first being where you have the most control up to the serverless, where you perhaps have the least control, and in return, you have the least to do. It's that balance. Nobody wants to be looking after a service, worrying about operating system patches, or the amount of disk space, or worrying about autoscaling. You just want your code to run or your content to be stored whenever it is you're doing serverless. The catch is that compromise. Are we willing to accept the compromises of serverless, which I think basically boil down to three things: it can be slower, it can be restrictive, and it can be expensive. As always, with technology. There's never often a right answer. Just pros and cons of different ones. How do we choose this time? Let's delve a little bit into these.

Slowness. We've got a real example here. Do you remember the HTML box we had in the diagram, the web rendering service? For real, much of BBC Online is rendered using AWS Lambda. We have thousands of web page renders a second, and this is the performance we see with them. The p50, the median, that green line there, about 140 milliseconds to respond, which includes backend calls to fetch the data. I think that's pretty darn good for an average. The p99, the top 1%, that's the purple line there, about 400 and something milliseconds, so significantly worse, but it's only 1%. Again, that might include some slow data calls under the hood, so probably all right that one. The issue isn't those two numbers, it's the p100. It's the very slow final few requests that can't even fit on this chart. They didn't fit on the slide. They are multiple seconds long. This is the cold start problem that we get with Cloud Functions, where your code needs to be copied onto a new server under the hood, and instantiated for the first time. Under the hood, serverless is just servers. This is a genuine issue. I'm glad that wasn't there. If you were doing something truly performance critical, you should worry about it. It happens so rarely, and we can mitigate against it. We can just try again after half a second. Chances are it's going to work, and we can probably cope with that one.

The restrictive point, number two on that slide. Yes, there are some restrictions. You have less choice of compute and storage and those things, but I think we can live with that. Particularly, for what we're doing here and handling data and building web pages and so on. That single request at once, and a stateless model and basic standard CPU. You don't need any GPU or anything like that. I think we can probably work quite well with that.

Idle Function Problem (aka Function Chaining)

That expensive point, that's a bit worrying. One of the challenges with being expensive is you have this function chaining problem where you can end up being idle. You know this one? This is the way that all three of the cloud providers handle their Cloud Functions, or Lambda as Amazon calls it. How it is you pay based on time the function is running. If your function is not doing anything for some of that time, you're effectively wasting your money. In this case, let's, for example, say that this storage on the right takes 300 milliseconds to respond, which is something that is not unusual. Say this Cloud Function also needs 300 milliseconds to do its processing on the response, in total it's going to take 600 milliseconds. You're only doing 300 milliseconds of compute, the other 300 you're paying for that function, or that Lambda to do nothing. Just to wait for that underlying storage.

If you've chained your functions like in this one, then you might to be a point where that one on the left, for example, may need 300 milliseconds of compute itself, but it's ultimately paying for 1200 because it's waiting for those other ones down. This theoretical example, we're using 900 milliseconds of compute across those three functions. We're paying for three times that. Please, cloud providers, change your pricing model. Charge us less when we're not using the compute. Till that time, we're going to be careful about this chaining problem. Which is a pity, because this is the microservice concept, with separate things doing one thing well. It doesn't work so well in a serverless environment. Certainly not if you're running thousands a second, you're just running a few a second, which is nice. At scale, this really costs, and so does using a slow storage solution, like this example here as well. We need to design our service to not have this issue, which is pretty, but we can manage that. We can do that. That's one cost challenge out of the way.

Autoscaling Is, In Effect, Permanent Overprovisioning

Fundamentally, isn't serverless just more expensive? Yes. We've done the math for all the cloud providers. If you compare the rack rate virtual machine costs, as many of them, if you're picking up a good example, against the rack rate serverless cost, just the actual cost of access to CPU over a period of time with the Cloud Functions, or Lambdas, you're paying 2, 3, 4 times as much as if you were using the virtual machine. Not orders of magnitude, yes, but a significant amount to worry. Is it really like for like comparable? Maybe not, because with serverless, you only pay for it when you need it. Let's take this standard behavior of traffic coming over time, for quiet periods and high periods, and growing in between. In the old days when we had data centers, we needed our capacity to be about this line. We needed to have enough capacity to handle the highest moment, the busiest moments, and we needed to hope that we didn't get, at the busier moment, we run out of capacity.

Then, when the cloud came along, we got autoscaling, so we're able to do something like this, where we were able to respond to the amount of traffic and to horizontally scale. That's more instances to cope with that. We never could get that yellow line quite that close to the underlying blue line or the real traffic, because it took a moment to spot that traffic was increasing, and then it took another moment to gain your instances to give you that additional capacity. You always had to have that special reserve. In effect, you were permanently overprovisioning with that capacity there. With serverless, of course, you're just paying for what you need. You're just paying for that blue line. Great. It's likely that's not two or three or four times as much, so you're still thinking serverless is more expensive. Maybe it is in some situations.

What's that yellow line? Chances are that these virtual machines or the containers you were using are not maxing out before you autoscale. Chances are your CPU is probably only getting to 20% or 30%. Normally, you don't want to get any higher than that, because a contended CPU does not give you good performance. Same with memory and networking, you do not want to be pushing those limits too much. You are already paying for far more compute than you need. Suddenly, that two or three four times from a serverless point of view doesn't seem that bad. We've done a lot of math on this. It certainly varies on the variables. Different systems will have different requirements. I don't think there is a major cost issue with serverless. It may cost more, and I really wish the cloud providers would make it a bit cheaper. Ultimately, you are using less of it. In many cases, it won't be any more, overall. You can see where I'm going? I think there's a part for serverless to play in this architecture.

Let's have a look back. Let's start with that process box. Do you remember what that is when a piece of content is made, you want to analyze it and store it. What about that? Do you think that? I think so. I think this is the most obvious one. It's not that time critical. If it takes a second or two, probably doesn't matter. It's event driven. It's a perfect place for a third function, I think, to do that processing. We'll mark that one as serverless. What about the content store next to it? This isn't computer cost, this is storage. Can this be serverless? All the cloud providers have some options around here, don't they? Azure's got its tables, and Amazon's got its DynamoDB, for example. Key-value stores seem to be well suited for this. Not always perfect. On the whole, there are some serverless options there. When you get to some of the more sophisticated storage, relational databases, and search engines, and recommendation engines, becomes a little hard, this one. The true serverless dream where you just have infinite capacity as you need it, where you're not having to worry about how many replicas you've got, or what your sharding situation is. We're not quite there with all those. I'm sure the cloud providers are working hard. It's a bit of a mixed picture here. Let me suggest this as serverless where possible.

What about this traffic management layer? This has a bit of handling lots of concurrent requests, so high network, high throughput, relatively low CPU, because it's passing those things on. This, not a good serverless option, or is it? There are API gateway solutions that cloud providers have, but we want a bit more control than that, so we wouldn't want to use a Cloud Function here. I'm going to suggest this is probably good for sticking with a virtual machine, or maybe you could do it with containers. We want quite a lot of control to get this right, because it's a lot of traffic.

What about these ones to the left, then, the rendering of the web pages and the app APIs? Harder one this one, because we want the scale. We know that thousands of users come through, and we're going to need the compute in order to render the pages and so on. We have that issue with the function chaining, and we know that these services will be waiting on the downstream ones, such as the content store to respond. If that's slow, there will be pain, just to sit and wait for it. We can just put those caches in, and Redis responds in under a millisecond. We're hoping most of the time there won't be much to wait around for. We could really do with that compute on demand, and not have to worry about autoscaling, because this is the bit we really need to scale. I'm going to suggest this is a pretty good choice for serverless.

That final one in the middle of business logic. By the same argument, I think there's a good serverless option there as well. You get the principle. Looking at those white bits, how much is serverless, half, more than half? It's a fair chunk, not everything. It's a serverless where default philosophy, use it for all the benefits until you've proven that for that use case it doesn't make sense. It's too expensive, too inflexible. We have some serverless in there, quite a lot. Those teams then building their systems can worry out the functionality of what they're doing, rather than working out the best way of configuring everything from an operational point of view.

Have Clear Ownership

We have an architecture. Are we ready to just print it out and deliver it to our teams, or do we think maybe we need a little bit more? I'm thinking maybe we do. When we've done this for the BBC, for real, the one thing we found that was critical, was to make sure that we then have clear ownership of everything that needs to happen. It's obvious. What that architecture diagram is, and what actually needs to be owned, not necessarily the same thing. For example, there are core capabilities that we must get right? How do we host it? How do we do our build pipelines, CI/CD? How do we do our analytics? There are common concerns we need to worry about like security and accessibility, and so on. We need people to own these issues. We can't all be experts in all of them. We need to make sure they happen. We've got common capabilities that we want to have across our sites in-house. For example, video clips, how are we handling them? We want one team to own that, and do it really well. In fact, we want a team to own all of the things on these diagrams. There's the fourth and final circle with other things in them, they all need to be owned. For example, weather, that one over there, we need someone to own weather forecasts. Say you're on that team, you're going to want to use lots of the other things in this diagram. You're going to need to consider hosting and build pipelines and maybe video clips. You don't want to have to resolve those problems. Because other teams are owning them, they can provide that as a service for you, allowing you to focus on the things that really matter to what you're doing that's different. In this case, weather forecasts.

It's obvious, but by making that list of everything that needs to happen, and then organizing our teams around that to make sure everything's owned, makes a difference for this being a deliverable thing, rather than just an architecture. By ownership, of course, we mean proper end-to-end. You design it. You build it. You operate it. It breaks, you get rang up. The full DevOps ownership lifecycle. You get that right, and you get what I love as one of the great mantras, I think, that came from Perl originally, is to make the common things easy, and the specialist things possible. What do we mean by that? If you're a team building something, you want most of the problems that have been solved elsewhere to just be available for you to use, which could be video clips, could be monitoring. It could be anything. If 80% of the time you're on the same path as everyone else, then hopefully, though, you'll build up this ecosystem of features and capabilities and best practices that you can all use. That way, you're not all reinventing everything. You're not all coming up with your own build pipelines, for example, because teams are sharing things where they can.

You don't want to overgeneralize. There's going to be 20% of the time, when you're doing something different. Maybe you're doing machine learning algorithms and no one else is. Maybe you're doing video transcoding, you're going to need a different process, different tools. That's going to be ok, too. We're trying to get that best of both, for that common stuff is easy, because everyone's following best practice and being as efficient as possible, but still allowing that specialism so you can have that breadth where you need it.

Let's take one final look at our lovely diagram to show how that works in practice. Say you're a new team that needs to make a new website. No worries, we have the place where we do that. You can extend that or clone it as necessary. Let's say your website needs to be radically different. Maybe it's some 3D world that needs to happen. No worries, we can go off and make something different if you want. Maybe you pick a different technology, not use serverless, whatever. We can extend our architecture some other time, at least, to work in a different way. Maybe a CMS is needed that's different as well. No worries, we'll do that. Given that best of both, where we have that consistency, that efficiency, so that we can be as be fast and optimize as possible, we're still allowing that breadth. That's the beauty of microservice architectures, fundamentally. If you get that balance right as teams where you're allowing that golden path so everyone's sharing where they can, but being different where they need it. We're hopefully getting that efficiency argument right as well. We're allowing everything to happen. We're not duplicating it necessarily. We're ultimately creating what should be a nice, fast, and reliable architecture.

Is BBC Online made like this? Pretty much. It is 25 years old. I won't lie, there's some tech debt. Old systems aren't quite like this. At its heart, this is it. If you were to go visit the BBC homepage now or read a news article, that web page would have been rendered in a Lambda. It would have been cached in Redis. This architecture you see is powering hundreds of millions of requests every day, in a reliable way, the way it should, and in an efficient way too. It's not rocket science, but it works.


We have looked at that architecture, the highest levels. We've looked at some of the tech choices, particularly serverless. Very briefly, at the end, we looked around how we organize ourselves, making sure everything is owned, end-to-end, to make it happen.

BBC's Content Management System

A few questions about the content management system. The BBC builds its own basically. The BBC builds its own content management system, so it's headless in a way. We have a team building that, so we can get the workflow exactly right for our content makers, which then effectively they create an API or a feed in effect. Which can then go into the storage to which we make our websites and apps. Obviously, building your own content management system, it's got a bit of a history to it. There are some brilliant ones out there. That's just what we do.

Serverless Testing

There were some questions about testing. How do we test the serverless stuff? We used to, in the old days, have a QA environment, a test and stage environment, which things would come along. We still have that. We still have one pre-production environment. One of the nice things of course about serverless, is every pull request can have its own separate set of tests run, and you can create an environment just like the production one. You can have at any point hundreds of them, of course, because they're all just copies of your function, your Lambda. We do that. CI/CD pipeline does that automatically. We still have a moment when it gets to the QA environment, we can press a button to launch so it's not completely automated to live. It's just a one click deploy once we're happy with it. We don't do anything clever like blue-green, or rolling out deploys. It's something we'd love to do. We've always found it because it's so easy to roll back to a previous version, that we found it just easier to roll out 100%, give it a go, and you notice pretty quickly if something is wrong, and you can roll back. It's not the smartest solution. We could do something clever there. We actually find that to be good enough for now and something we'll work on.

The BBC's CI/CD Pipeline

What CI/CD pipeline do we use? Most of what we're doing is on Amazon. We basically do it there, kind of build code pipeline, I think just because it's so managed. We used to do a lot of Jenkins, but there was quite a lot of overhead in keeping that going. We have gone with Amazon with the CI/CD stuff.

Security of the Architecture

Security in the architecture. There's a lot around access control. Obviously, with serverless, it really helps because you don't have the boxes, which you need to worry about who has access. We have our own process by which you access those ones, when we do have those virtual machines. We go the full on with the IAM access, write least privilege, making sure that each function and each server, and the rest of it already have access to what they need. We do quite a lot, of course, with handling requests into our site as well. Being a high profile site, we do get some interesting attacks on us.

Serverless Restrictions

Finding restrictive about serverless. There's a whole set of things here. You have a limited amount of compute variation. You can't do limited amounts of GPU, for example. Not a lot of storage. You can't be stateful. That's a really key one if you're doing something. You can't handle concurrent requests at once. That also ties into why we use the virtual machine for the traffic management layer, high numbers of requests at once. Perhaps, could do it with serverless, but it doesn't make sense when you need to do something that involves a lot of concurrent networking requests.


See more presentations with transcripts


Recorded at:

Oct 24, 2021