InfoQ Homepage Podcasts Matthew Clark on the BBC’s Migration from LAMP to the Cloud with AWS Lambda, React and CI/CD

Matthew Clark on the BBC’s Migration from LAMP to the Cloud with AWS Lambda, React and CI/CD

Mar 16, 2022

This is a re-post from March 2021.

In this podcast Matthew Clark, Head Of Architecture for the BBC's Digital Products, sat down with InfoQ podcast co-host Charles Humble and discussed: the new architecture for the BBC’s online services; the challenges of using Lambda functions including cold start-up, function chaining, debugging and setting the memory profile; the role of DevOps and CI/CD; and the nature of a cloud transformation.

Key Takeaways

Pre 2012 the BBC’s digital services ran on a fairly conventional LAMP (Linux, Apache, MySQL, PHP/Perl/Python) stack in a pair of data centres based in London, both running as hot sites for resilience.
The BBC started their cloud migration around 2012 or 2013. Some things, such as video transcoding, were moved early. The migration wasn’t purely a technology shift but also changed working practices to adopt more DevOps and CI/CD.
They built their own Cloud Function-like platform, which they referred to as Nano-services, on top of AWS, before switching to AWS Lambda as it became more mature.
Performance is super important to the BBC. In their tests they found that if a page slows down by a couple of seconds they lose a quarter or more of the audience. To help they have a traffic management layer built with NGINX that runs in VMs rather than Lambda. Redis is used to handle caching and has also been used for queuing.
All the rendering happens in AWS Lambda - it is effectively one large Lambda function. The BBC does chain Lambda functions but has to be very aware of the cost of dwell time, and limits chaining to around three functions.

Subscribe on:

Transcript

Introduction [00:35]

Charles Humble: Welcome to the InfoQ podcast. I'm Charles Humble, one of the co-hosts of the show, a managing editor at Cloud Native consultancy firm Container Solutions. Today, I'm speaking to Matthew Clark. Matthew is Head of Architecture for the BBC's digital products, a role that covers the websites and apps for BBC's news, sport, weather, children's education and content discovery services. Over the past few years he and the BBC's design and engineering teams have completely rebuilt the BBC's website using modern approaches and technologies such as React and Serverless. It's that migration that we'll be talking about on the podcast today. Matthew, welcome to the InfoQ podcast.

Matthew Clark: Thank you. Thanks for having me.

How big is the BBC in terms of traffic and page types and so on? [01:19]

Charles Humble: Before we get into the architectural details, can you give us a little bit of context? How big is the BBC in terms of traffic, page types and so on?

Matthew Clark: It's pretty big. Yeah. Page types certainly have more than 200 last time I counted. There's a real broad set of products as we call it on the site. Being a public service that BBC does a lot of things, trying to find something for all audiences. So there's a really big global news, for example, probably the largest news websites in the world. We've got things more for the UK only audience. Like iPlayer, the VOD platform that gets millions of plays a day. Then there's all kinds of specialist stuff. Children's things, educational things, food recipes, all kinds of stuff. So it's broad, but also the traffic, the number of users we have is super high as well. Tens of millions will use the services every week. Often more than that. The US election, for example, last November we've got 165 million unique users come to our coverage either video, audio or text online. So pretty huge numbers. As the technologist, that's part of the fun, right?

Matthew Clark: How do you build something that scales and works well and gives everyone an interactive and personalized experience as appropriate at those moments? You can't just rely on CDNs or caching. There's a real fun challenge there about offering such a great service to that many people, those big moments.

Charles Humble: Can you give me an idea of the number of developers at the BBC as well?

Matthew Clark: Maybe in the order of about 1,000. A bit more, depending on what you call a developer. It's not a small team but compared to some of the big competitors that we find in the US giants and so on, it's not a big one either. So we do have a key goal to be as efficient as we can. We're a public service so we do our best to make the most of those development teams. We've got, as I say, big server workload, a lot of people we need to be as reliable as we can in those really big moments because people depend on us. So we make sure that those developers are working as efficiently as they can on building great products for us all.

Does this same architecture support all different platforms and services that you are responsible for? [03:03]

Charles Humble: Broadly speaking, does this same architecture support all different platforms and services that you're responsible for? So the various news and sports apps, internet-enabled TV, the website, BBC sounds and so on, is that all one fundamental architecture?

Matthew Clark: Broadly, yeah. As a lot of places do, you have different teams working on different things so there's natural variation, but fundamentally, yes. We're all working the same way. They've got the data sources from which you can then build websites and apps and, as you say, connected TV is another big one. You obviously have to work different ways for different platforms, but fundamentally yes, it's the same thing.

What is the main driver for the BBC’s cloud migration? [03:36]

Charles Humble: Staying at a high level for the time being, it's my understanding that you've been on a very familiar cloud migration. So you've been moving to a public cloud AWS in the case of the BBC and also adopting more Continuous Delivery and DevOps. Is that basically correct?

Matthew Clark: Yeah, that's right. We started our cloud journey fairly early for a large org, 2012, 2013. We started doing it. We moved some things quite early on, such as our video transcoding capabilities. All went rather well. So we made the call two or three years into that, I guess a few years ago now, we should move all of our online services to the cloud, the websites, the apps, and all the APIs that power it. Move that fully into the cloud. We gave ourselves a deadline of 2020 to do that. That was a good moment to switch off our data centers and move across. That's what we've just completed.

Charles Humble: What was the main driver for that migration?

Matthew Clark: There are several reasons really. I mean, fundamentally you want to be in the business that you're in, you don't want to be doing more things than you have to, right? So looking after data centers, procuring servers ahead of time. If you can get away without doing that you want to do it, right? But as you go on, you realize that the cloud is more than replacing your data center, right? It just does so much more. The ability to have unlimited computer effectively at a moment's notice and the same with storage and machine learning capabilities and everything else is just extraordinary. It opens so many doors. So I don't think we perhaps realized that so much when we started the journey seven, eight years ago. But now it's not a correct comparison, is it? Running your own data center versus the cloud? They're just completely different ways of working. The amount of stuff that we've done since moving to the cloud that we couldn't have done before is huge. It's now such a no brainer I think for anyone doing this. You've got to take advantage of that capability.

What did the architecture look like before you started to migrate? [05:08]

Charles Humble: So can you give us an idea of what your architecture looked like before you started to migrate? So I think your code was mainly written in PHP. I think you were hosted in a couple of data centers in London, in the UK, but can you describe what that architecture looked like pre the migration?

Matthew Clark: Yeah, you've got the basics. I guess fundamentally it was like a classic LAMP stack, which was the thing you did 10 years ago, a bunch of Linux servers running Apache, running PHP, and then various data sources underneath with all your contents, things like MySQL or other databases we had. Maybe a bit more than that because it's a large site. We had a set of Varnish servers doing caching, for example. We had a separate Java API layer to separate the presentation from your logic. So there's a bit more maybe than LAMP, but yeah, fundamentally it was that running on two data centers - hot hot on both so there's resilience to one failing. But it was a very sensible architecture for when we made that back in roughly 2010. Of course it seems old fashioned now as technology moves on.

What was the role of Nano-services? [06:02]

Charles Humble: Right. Yes. The constant march of technology, I suppose. I think you also have an interim step. I found a fantastic talk actually you gave at re:Invent in 2017 where you were talking about nano-services. I think of nano-service as being smaller than a microservice, but perhaps bigger than a function. So I'm assuming you moved to nano services first and then moved to functions though it's probably not quite as clean as that. But is that roughly right in terms of the journey?

Matthew Clark: Yeah, that is roughly right. When we started building things in the cloud, because as you say, we moved to a DevOps model where you build it, you own it, you maintain it. That is great for the Continuous Delivery point of view, right? Teams can be fully empowered to build out as they want, and release it as they want and that's great. But it does come with quite a lot of expense. All of a sudden teams that weren't doing it before now suddenly need to think about networking, right? How their VPCs are set up. Or best practice for patching OSs or scaling or monitoring all these things that on the data center we had separate operations teams looking after. That's totally possible of course, the full stack engineer is capable of doing all of those things. But it's quite a lot of work and a team that specialized in, I don't know, building great websites and good HTML accessibility, it's quite a lot for them to learn to be good at all those things.

Matthew Clark: So we looked at how this could happen. This was the days before Serverless really was a thing, before Amazon had launched Lambda, all those kind of things. We wanted something similar, but we said, we want the way in which teams can write the code they want to write and have the full power of the cloud, the ability to release when they want and so on but not have to worry about the practicalities of operating it. We built a platform which is a vector, if you like, as Cloud Functions are today where teams could upload their code, not worry about the servers they're running on and how they access it. It gave you the golden path for the 80%. If you're building an API or building a webpage, here's a standard way of doing it. But you've still got the whole cloud if you wanted to do something more tailored or sophisticated.

Matthew Clark: It worked well. One of the things we found was when we gave people the service, the unit of change they wanted to make was smaller than what they'd been doing before. They would change just one page or even one part of the page, or they changed just one API endpoint rather than a set of API endpoints you might have in the microservice. So, yeah, we built this term nano-service, this idea of this unit of change you're making was smaller than you classically would do if you were running your own virtual server or container. They're roughly the same size, I think, as a cloud function is today. I didn't like using the word function because we already had that in programming.

Charles Humble: Right, yes, of course.

Matthew Clark: Because the cloud function is a lot bigger than a programming function. So that's why we picked this term nano service. I don't think it will take off now because Cloud Functions are so unique, well so common as a term. But that idea of the smaller units of change, I think we are seeing in the Serverless world that your designs are less about servers or microservice size things are more about these little tiny components that you then hook together in a chain.

When did you move from your custom Nano-service solution to Lambda? [08:44]

Charles Humble: So when did you move from that to Lambda when Lambda became available?

Matthew Clark: So when Lambda first became available, it was quite basic. We looked at it and we looked at what the other cloud providers were doing when they were coming along as well. It was early days for that technology. Certainly for request driven things, building websites and APIs, the performance just wasn't there. Especially the cold start time. Sometimes you would request something that it could take several seconds to respond, which just isn't where you want to be of course when building web things. So we actually stuck with our solution for a few years. Then we got to about maybe 2018 and Serverless technology was just coming on nicely with all the cloud providers. We realized right now is the point we need to let go of what we've built. This platform we've built during these kinds of nano-services, our own cloud function approach. It's worked well but as the bar raises with the cloud providers doing more and more for you, you've got to follow that. You've got to stop building your own things in house and move to what can be provided as a commodity off the shelf way.

Matthew Clark: We reached that moment we realized, "Yep, serverless functions are good enough. We should be using them rather than building our own."

Were there specific reasons why you went to Lambda as opposed to using VMs on EC2, or Fargate? [09:44]

Charles Humble: Were there specific reasons why you went to Lambda as opposed to, I don't know, using VMs on EC2 or something like that?

Matthew Clark: It boils down to this idea that why do something that someone else can do for you? I mean, it's obviously more nuanced than that. But you want your teams to be focusing on what's really important to the organization. You don't want to run the servers, be them physical ones in the data center or virtual ones in the cloud, right? The cost of keeping them secure and appropriately scaled, or even getting the auto-scaling right and patched there's significant overhead to that. With Serverless, you get to a place you really want to get to as an organization. I've got my business logic, I've got what I want to happen, I just need it to run at the right moment in the right place. Serverless just takes away all those problems so you focus on what really matters.

Charles Humble: I presume it will be similar for something like, say, a Kubernetes cluster or Fargate or something like that?

Matthew Clark: Yeah, Serverless in some ways abstracts even more, doesn't it? Because you don't have to worry about ... Even less concerned with things like the scaling and the underlying infrastructure and how much capacity you've got. But yes, you could have also done this. We've not taken the container route just because of the path we've taken; it's a perfectly great route of course. But because we just started the cloud journey early, before Kubernetes was a thing, we got really good at building VM images. So from that, the natural step was to jump to Serverless to free us up even more from the underlining hosting.

Are there specific advantages that Lambda has for your particular workloads? [11:00]

Charles Humble: Are there specific advantages that Lambda has for your particular workloads?

Matthew Clark: As well as that general less to worry about the fact that it can scale so quickly is very useful to us because our traffic can vary so much. We can go from thousands to tens of thousands of requests per second in not that long a time and multiple cloud providers can now provide that as a service. But the fact you can do that and you don't have to worry about auto-scaling taking a minute or two as sometimes it can, or maybe you've not quite got every metric for when you need to scale, is it compute? Is it memory? Is it network and so on? You can rely quite on the fact that you have effectively unlimited compute. Each one's going to have a guaranteed set of factors to CPU and memory more or less. That does provide you with a wonderfully consistent experience for everyone. There's all kinds of other advantages too we've found.

Matthew Clark: Testing is another great one. You can create a test environment, a pre-production environment that behaves exactly the same as your production one. Scale it up for maybe a load test for a few minutes, pay just whatever a few dollars charges for that. The moment you finish your load test, it goes back to being nothing again. That's super useful for our engineering teams in making sure that our low services will behave the same as they're evolving.

How do you get observability into your Lambda production systems? [12:10]

Charles Humble: I find that really interesting because a common complaint about Lambda is that it can be quite difficult to debug. Is that something that you've found and how do you get observability into your production systems?

Matthew Clark: Not being able to SSH onto the box and do some digging around is definitely a limitation. When you get to the really gnarly things around some low level library or network packets or something, you do miss that moment. It's not really hit us this one because as you build your Cloud Functions, they're naturally simpler and smaller than perhaps what you would have built something running on the server, right? They're typically not that concurrent, don't handle parallel requests, not very long running, short-lived. They don't have a lot of states normally. So actually there's less to go wrong, they don't find themselves with memory leaks or getting themselves in a bad state. When they do, you can recreate it fairly easily in your local environments. So we've not had too much of an issue. The heart of it, of course, is because you're now more distributed, you have more separate systems. So how do you understand where in the chain the problem is? We've gone big on tracing. We give requests IDs the first moment they hit our system and pass that all the way through the stack. That is super useful.

Matthew Clark: Of course you make sure you get the logging right and the metrics and the alarms right as well so you can spot when one part of your system is not behaving as it should.

Can you give us an overview of what the new architecture looks like in terms of traffic management and how the rendering layer works and so on? [13:18]

Charles Humble: We should probably step up a level, can you give us an overview of what the new architecture looks like in terms of traffic management and how the rendering layer works and so on?

Matthew Clark: A high-level overview, when a request comes in for a web page, the first couple of layers are traffic management layers. Because we have so much traffic, reliability is super important to us so it's a really key layer for us handling that, handling any mischievous requests coming in, making sure people are signed in if necessary. Caching where appropriate to handle the load, handling any errors and so on. So that traffic management is super key. That runs on VMs. We haven't made that Serverless. Just because it doesn't make sense, right? That kind of proxying responsibility where you're handling large amounts of concurrent requests and generally waiting, lot of dwell time waiting for the underlying systems to respond. That's not a good fit for Serverless today. It'll be interesting to see if Cloud Functions evolve to make that a more reasonable thing, but today that's an example where you wouldn't put it.

Charles Humble: So is that NGINX then, the traffic management layer?

Matthew Clark: Yeah, exactly. A lot of it's based on NGINX and what you can build off that. Beyond that, then you hit Cloud Functions, the HTML rendering part, like everyone else nowadays we use React and Node.js and the wonderful isomorphic way the same JavaScript runs on the server and then a dynamic client update. So that gets rendered as a function, that works very well that way. We separate that presentation logic from the underlying data logic of "where do you fetch the data and how do you understand it and give the right thing to the right people?" So that's a separate, kinf of effectively an API, a business layer we call it, again it's just more functions. So it's functions calling functions. You have a chain of functions at this point, you have to be a little bit careful how far that goes. Of course those functions are then able to call whatever underlying data store it is to actually fetch the content to build the page. It's not a million miles off the standard, kind of back to the LAMP stack we were talking about earlier, right?

Matthew Clark: Request comes in, you do it and return. You're just using functions to do the grunt work, the CPU intense work because it scales so well and there's less to look after.

Why is your WebCore a monorepo rather than using something like Web Components? [15:07]

Charles Humble: One of the things you talked about in a blog post, which was actually one of the things that kicked this whole podcast of, and I'll be sure to link to that in the show notes. But one of the things you mention in that blog post is the fact that your WebCore is a monorepo rather than say using Web Components or something like that. Can you talk a little bit more about that particular choice?

Matthew Clark: When you build a site, we found we have lots of teams working together on building the site, right? Different teams will own different pages and sometimes different components within pages. So there's an interesting question of how do you get them? How do you allow the DevOps Continuous Delivery thing where they can all work on their own and do their own thing and release it in their own cadence, but somehow it comes together to make a really great website? We'd looked at some of the Web Component standards out there that allow you to stitch things together. We've done other clever things in the past around your edge, your traffic management stitching together things. The problem with that approach is you end up with different teams doing different things at different rates, and it doesn't create the best webpage.

Matthew Clark: If you want the consistency in your design and using the same JavaScript libraries. Like for example, we've had pages in the past, we've had multiple versions of React on them just because different parts of the page we're upgrading at different times. It doesn't lead to a fast site. Performance is super important to us. We've seen over the years, done a lot of measurement on this. If your page slows down by a couple of seconds, you will lose a quarter if not more of your audience. So performance is super important. Accessibility is another important thing for us. We want to make screen readers and so on to really understand it. So we realized to do this well we needed to have one way of building a website, certainly for the common stuff and decided to go down that monorepo approach. So we have multiple teams owning their own part of that monorepo. We used the GitHub code owners file to help understand who owns what, who needs to review what. But then it all comes together to make one and ultimately one deployment as a function that can render all those different pages.

Charles Humble: Okay. Then if I'm understanding you, all of that rendering is then happening in AWS Lamda, right?

Matthew Clark: That's right. Yeah. It's all happening in Lamda. It's effectively one large Lamda. We might choose to split that up in the future. But as always, you've got different projects, different stages so not quite everything happens that way. But for our new stuff, you have that one Lamda that is able to do multiple different pages from that monorepo and multiple teams effectively contributing into that one Lambda.

Why do you not need an API Gateway? [17:17]

Charles Humble: Another thing that I found interesting about your architecture was that you don't have an API Gateway between your traffic management layer and the Lambda functions, which is typically how it's done. So it was a bit unexpected. Can you talk about that?

Matthew Clark: We obviously looked at it because that's the standard way in which you call these functions. But we realized we didn't need it in our case because we had our own proprietary traffic management layer in front of it. It was able to call the Lambda API directly. So we didn't need to put in an API gateway. Then that saves a bunch of money also because it's one fewer step, bit quicker, bit simpler process as well.

What is each Lambda function responsible for? [17:48]

Charles Humble: So it's a little hard to do this without the aid of a whiteboard, but can you give me a rough breakdown of what each function would be doing? So if I look at the BBC news homepage this morning, then there's a red box with navigation, you've got various headlines, you've got some RSS-style abstracts. You've got sections for must-see, most watched and so on. That we've said is all built with functions. So how does that map? Is that one function per box? Or how does the mapping work?

Matthew Clark: In the past we've tried to separate those different parts of the page. Now, as we were saying before, we have that Monorepo for all the presentations bit. So from an HTML rendering/React point of view, there's one function doing that with different teams building the different components that come into that. But then underneath that at the API level where the data comes from, yes that's separate. So there's different parts of the page, the navigation and the main content and the most read, for example, all those bits, you say. That they will be separate functions, each responsible for their own whatever unique data type it is to power that page.

How do you avoid the cost implications of long chains of functions? [18:49]

Charles Humble: You mentioned earlier that you do chain functions, but that does have cost implications, right? So I'm presuming you have some way of managing that and not having overly long chains?

Matthew Clark: The fact that you're paying for CPU, whether you use it or not means that if you've got a functioning waiting on another function, you're effectively paying for that dwell time which isn't great. That can add up. So yeah, we try and limit the depth of those function chains to no more than about three. Just so you're not getting that too far. It's a balance of course, because you do want the isolation between them. It's nice to have the separation of concerns between them. So there's an internal debate always happening how do you get that balance right? You want to optimize for costs, but you also want to optimize for performance and reliability and understanding of what happening.

Did you have issues with setting the right memory profile? [19:30]

Charles Humble: Another thing that can be a bit challenging with Lambda is setting the right memory profile. Was that something you struggled with?

Matthew Clark: There was a trial and error aspect to that. It will be something great if there could be things to help you with that more because you have that balance where it's a fixed ratio between CPU and memory. We had to go to a GB of memory, even though we don't actually use anywhere near that. Because we work just on balance that works out to be the best value overall. The fact that you would respond quicker, it makes it cheaper than one of the cheaper ones overall. So there is a bit of trial and error on that. But we did the trail and error, we found the right answer and we just leave it alone now and we'll review it periodically.

Did you have issues with cold startup? [20:05]

Charles Humble: Cold startup times is another area where Lambda can sometimes be somewhat challenging. Is that something that you've found to be tricky?

Matthew Clark: It is. It seems to have got a lot better, but it still is a problem sometimes, it can be several seconds. I mean, over half of our Lambdas will return in 50 milliseconds, we've really optimized that and that's great. But you do get some that can take several seconds. In the worst case that isn't great. As a website, you really don't want that to happen. So we watch it carefully. It's fortunately a small percentage, but it is definitely one of the disadvantages of the cloud function model.

How does the financial cost of running a cloud function model compare to what you had before? [20:37]

Charles Humble: Another potential disadvantage of the cloud function model is the financial cost. When your blog post was first published there was a lot of discussion in the community, some of it honestly fairly wild speculation around the cost with the underlying assumption that Lambda compute cost is higher. I think that underlying assumption is fair. So have you been able to offset that?

Matthew Clark: It is a totally valid argument. Because if you do the raw calculation, how much does it cost to have access to a CPU and memory for a period of time it's apples and pears, right? They're different. You can't really compare. But if you compare the virtual machine versus the function cost it's between two and five times as much to use the function. So significant, it's not orders of magnitude. But it's significant. So yeah, it was a concern I had from the start was, is this going to be a financially not great choice? I always liked the idea that if you design for Serverless in a stateless simple way, then moving that onto a VM or a container later is an achievable thing. So we always have that fallback plan. It turns out we didn't need it.

Matthew Clark: I mean, there is of course the TCO argument, the Total Cost of Ownership argument, right? You have to do less. I think that's a totally valid one, but we didn't need that one either. Because when it came to it, it didn't cost us any more to use Serverless, maybe even slightly less. Just because you're not paying for compute for any more than you have to. The millisecond you stop using it, it goes and that makes a big difference. When we run APIs and website servers, you typically run with your CPU usage at 10%-20%, right? That's best practice, that's healthy. You don't want to be contended on CPU or memory or networking or anything else to run a healthy server. Otherwise you will not stay fast and reliable. So if you think about it, say you're at 20% CPU utilization, you're effectively paying for five times as much compute as you actually need. So all of a sudden, by comparison, your serverless costs turn out to be pretty similar

Matthew Clark: It's not quite that simple of course, because you're also not necessarily using all of your CPU in your Lambda But if you get that right and you've got nice fast response times and lots of caching in place to make sure that your functions return quickly, what we found was is that it wasn't any more expensive as I say, maybe even slightly cheaper.

How is caching handled? [22:35]

Charles Humble: That's really interesting. Thank you. How'd you handle caching? I think you were using Amazon ElastiCache, which is basically AWS's hosted service for Redis. Is that still the case?

Matthew Clark: Yeah, we use Redis numerous places. It's fantastic. It's such a versatile service.

Charles Humble: So how does Redis work for say caching content?

Matthew Clark: That use case is really simple, Redis at it's heart is a key-value store. So we have it at multiple layers and whenever content is rendered, be it the HTML or the underlying the data in the API, we simply put it into the key in Redis without an expiry time. Subsequent requests can just pull it from there rather than re-render it again.

Charles Humble: Presumably, because Redis is storing everything in memory, that means that everything is ephemeral?

Matthew Clark: That's right. I think you can configure it to be more durable than that, but yes, the use case we have you store it in memory. If you lose one of the servers, you will lose the data within it.

Charles Humble: How do you cache bust if you want to force an update?

Matthew Clark: Because it's a key value that your code is writing to, you're fully in control of that. So when we do get an event in saying "something has changed" for whatever reason we can update the document within it, or simply mark it as stale. So that next time around we can re-request the data from the underlying data source.

Charles Humble: Do you have problems with cross region calls? I'm guessing you might take a performance hit there?

Matthew Clark: Yeah, we don't do a lot of cross-regional stuff. So it's not caused too much of a problem. We noticed that across the availability zones, it can certainly add up. But Redis is so fast. Anyway, sub-millisecond within an AZ and not that many milliseconds across AZ. The performance has always been phenomenal anyway. We haven't done much cross-regional stuff, but I suspect it's not bad anyway. You can do master/slave of course as well. So you could have local replicas of your caching if you need to.

How do you use Redis as a queuing system? [24:12]

Charles Humble: Ah, yeah, of course, you could. Yeah, that would totally solve the problem actually, if you were to ever hit it, wouldn't it? You said already that you're big fans of Redis and you use it quite extensively. I know that you were using it at least as a queuing system, which I was intrigued by. Back in the late 90s and early 2000s. I did a lot of work in both retail and investment banking, particularly around early internet banking. A lot of that was built on top of what was then called IBM's MQ Series, which was their early queuing software. It was actually a really brilliant piece of software, very solid, very reliable. The idea of using Redis for queuing, therefore, intrigued me, do you still do that? If you do it, why?

Matthew Clark: Not in our latest stuff. But we do still have that operational for some of the things we've made before. It's an example of where Redis is so flexible. It is a key-value store, but your values can be a range of different types, including lists and even streams in the latest version. So creating something like a queue in Redis is actually not that hard. You can push and pop things off that list effectively to be a queue, you can write Lua, which runs on the box to do some clever stuff like having a dead letter queue or even some more sophisticated things like spotting duplication in queues or having prioritization, for example. Why did we pick Redis? Just because it's so unbelievably fast that you can communicate between things in a millisecond or sometimes less. It's phenomenal. Whereas most other queues are tens or hundreds of milliseconds, depending on what they are. It's an unusual choice, right? Nine times out of 10, if you want a queue, use a queue, right?

Matthew Clark: Use the right tool for the job. I wouldn't recommend it most of the time, but if you did need something incredibly fast, you were willing to accept data loss, because as you were saying before these boxes can fail and you lose what's in memory. In a world where you're maybe tolerant of that and speed is your number one thing. It is an interesting choice.

How many releases a day you're doing now via CI/CD pipeline? [25:51]

Charles Humble: Yeah, it really is. I'm impressed that you've been able to do things like prioritization and dead letter queues and even handle duplication, that's more than I would have expected. It's really interesting. We're coming towards the end of our time. I'd like to touch a little bit on some of the DevOps and more cultural aspects. Can you give us an idea of how many releases a day you're doing now via CI/CD pipeline?

Matthew Clark: Yeah, that's a good question. A all of, because there's an awful lot of internal tools as well so it's probably every few minutes. But in terms of what the user sees, the kind of website and the APIs that power it, we're making releases during the workday every 10 to 20 minutes. So fairly frequently. Some of those things you'll see directly, right? There'll be presentational things. Some of that will be under the hood things or maybe just one part of the site. But there's a continuous process through we have, as everyone does nowadays, an automated CI process where as soon as things hit master and pass all the tests they're immediately launched to live.

How did you manage the organizational shift of moving to the cloud? [26:30]

Charles Humble: One of the interesting things about moving to the cloud is people tend to think of it as a technology move. But actually it's much broader than that. It's really a huge cultural shift and a huge organizational transformation in a lot of cases. For something the size of the BBC I'm imagining that could be very challenging.

Matthew Clark: There's so much we could talk about here. Teams have had to take on more and of course, different teams react in different ways to that. One of the interesting challenges we've had of course, is that the cloud gives you the opportunity to do so much and give you so much technology at your fingertips in the way that when you have your own data centers you really couldn't before you had a much more controlled environment. One of the challenges we've had is to find the right way to control that. Because every team is going off and doing a different thing that probably doesn't make sense from an organizational point of view. So how do you get that balance right between empowering teams and making them want to do that without diverging too much, and having to find a way of bringing that in later? Ultimately this is a transformation program, which [in] large organizations is always an interesting thing to do. It affects hundreds of people to change how they think, change what their day-to-day job is.

Matthew Clark: That is a huge challenge. The amount of communication and ultimately persuasion you need to do is massive. It was one of the things as we rebuilt, we moved from the data center to the cloud we didn't just want to rebuild like for like, we didn't want to just lift and shift. We wanted to change our approach and what people were responsible for and how we built and how we made sure we weren't duplicating anything across teams and all those kinds of things. That's a huge challenge. You've just got to keep plugging away at it and getting people to understand where they need to be.

Did you get pushback from individual developers as their responsibilities changed? [28:03]

Charles Humble: Right. Yeah. I mean, any large-scale organizational transformation is always incredibly difficult. Then thinking about it from the individual engineer's perspective as well, you said that you'd moved to DevOps, you build it, you run it, you're responsible for your service running and production world. Again, if you're an engineer who's signed up to work at the BBC before this transformation happened, that's a big shift in role, you know, out of hours and all the rest of it. Did you get pushback from that as well?

Matthew Clark: Yeah. We had that model where engineers had to be on call. Right? You build it, you own everything that's really key to it. Yeah. It was for some people and not others. We were lucky I think that each team had enough people who were willing to be on call for it to be all right. We have a very good first and second line operations teams as well, keeping things going. So you know you're only going to get called if there is a genuine issue. So it turned out to be all right in the end.

Charles Humble: That's very interesting actually, I guess if you've got first and second line support that's really strong then that does reduce the risk of fatigue and burnout and that sort of thing for an engineering team. So actually a very good point. Matthew, thank you so much, I've really enjoyed speaking to you today. Really interesting conversation and thank you very much, indeed, for joining us this week on the InfoQ podcast.

Mentioned

About the Author

Matthew Clark

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Architectural Patterns: Moving Beyond Cloud-Native to Local-First - Insights from Adam Wiggins

How eBPF Empowers Developers to Observe inside the Linux Kernel in a Safe and Unintrusive Way

Increasing Users' Data Agency: from BlueSky's AT Protocol to the Local-First Software Movement

From MCP and Vibe Coding to Harness Engineering: How AI Native Engineering Evolved in One Year

InfoQ Software Architects' Newsletter