InfoQ Homepage Presentations APIs at Scale: Creating Rich Interfaces that Stand the Test of Time

APIs at Scale: Creating Rich Interfaces that Stand the Test of Time

View Presentation

Speed:

49:50

Summary

Matthew Clark, Paul Caporn take a look at versioning, design patterns, handling different use-cases, supporting high-traffic moments, and the merits of different API types.

Bio

Matthew Clark is Head Of Architecture for the @BBC's Digital Products. Paul Caporn is Lead Technical Architect, TV and Radio @BBC.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Clark: Let me introduce Paul, and myself, Matt. We both head up the architecture of the BBC's websites and apps. We have many APIs. I think we've got over 100 of them. We've learned quite a few things over the years, so it's an absolute pleasure to talk a little bit about some of the things we do. Of course, the challenges we have won't be the same as the one you have, so what we're going to do is we're going to rattle through all kinds of things, just throw out some of the things that we've done and things that we've learned, and some of the mistakes we've made. Just in case they perhaps can spark some ideas for yourself about solving whatever challenge you've got. We'll quickly look at how we build APIs for frontends, websites, and apps, because obviously, we do a lot of that. We'll be talking about change. How do you handle? We all like to work agile, iterative, continually evolving our stuff. APIs generally don't work that way. They like to be standards, like to be consistent. How do you get those two things working together? We'll look a bit at that. We'll look at scale. The BBC does a lot of high scale things, lots of users use our stuff. We'll talk a bit about how we handle that. We'll touch on validation as well, schemas, and handling different types of data.

The BBC Online

The BBC Online is absolutely massive. Paul and I are very busy with an awful lot going on. We have all these different products of different kinds, things in different languages for different parts of the world, as well as loads of stuff predominately from the UK. This is one of the reasons why we have so many APIs, all kinds of data going back and forth in all kinds of directions. We also have massive traffic. BBC News, that's one of the bits I look after. It is the number one most popular news site in the world. We serve more articles than anyone else, beating MSN, CNN, and Google News. We're massive. Paul, let's look at iPlayer, one of the products you look after. It is number two, come on Netflix, just 2% ahead of you. Surely we can catch up. Pretty massive, whatever it is. I actually went through and tried to create a list of all of the APIs we had, got to about 100 and something, and gave up. There's just so many of them. Let's go through them. The first one there is the news app business layer.

I've got a few stats from that. How many of these are serverless? It turns out a third use serverless to some extent. I'm a big fan of serverless. It really does reduce the cost of building things, and scaling things. Over half of our APIs, we use Mutual TLS, so client side certificates to access control as a version of HTTPS, where basically the client is saying who they are, as well as the server. A little bit of access key, classic API key stuff and a good chunk of it's public as well, because that's what we do. It's nice to have a variety of options we find for different situations. The third thing I thought was interesting is event versus request driven. On the whole, again, because you have very large numbers of audiences requesting things from us, we do bias towards the request driven things but there is quite a lot of event driven too, obviously there's a lot of talk at the moment about event driven architectures. They are awesome. Request driven architectures are awesome, too. Do pick the right tool for the job. Finally, the response formats, the client RESTful JSON API is pretty dominant. They had a GraphQL in there, not as much as I thought, because we do have some good GraphQL uses. Again, it's about picking the right tool for the job. GraphQL is super strong in some bits, but it has its limitations like everything. Having a toolbox of options is always wonderful.

Designing for Frontends - Micro-Frontend API for TVs

Caporn: iPlayer, which is the BBC's video on-demand product predominantly serve audience at scale to TVs. Many moons ago, we thought we solved the TV problem, we'd introduced this framework called TAL, TV Application Layer, which provides a jQuery API for integrating with long tail of TVs where they might have quirks around them. That had led to us taking our eye off the ball and not designing our APIs cleanly enough for what we built on top of that, that TAL application. The code is broad. We had moved to doing server side rendering for a lot of the TV application, and the boundary between the server and the client, really, it just wasn't defined, which made it very difficult to work with. We ended up in a situation when we had monolithic spaghetti code, single page app, split across a backend for frontend, and the client application.

Where were we at? We had our TV application laid out, this sprawling code base. We're at a position where we needed to launch a kid's experience, not only just a reskinning, bringing in new content items, changing the menus, making sure that it was appropriate for a different age range. We also had had enough time to go off and change the API, iterate on it, modernize it, make it more elegant and engaging. At this stage, the boundaries between the back and the front were unclear. It was horrible to work with this code, as in the developers just didn't like adding features because of the pain. Rather than just keep digging, they decided to stand back and address the problem by introducing a clean API. Our teams have very silly names, the team that did this work we called Lovely Horse. They wanted to bring it to one framework for components, so the One Love project was born. The idea was to basically bring in micro-frontends for the client TV application with separate components that could be worked on separately independently of the client TV code base, and then just slot it in.

The API for the components that were going to get plugged in is very simple. It needed to know when it was given control, a lifecycle method when it needs to clean itself up. It needed to be told when user interaction happened, which is that handle key down. The hosting application's responsibility then was to compose these components. It did that in a couple of ways. It makes a request to go off and fetch the contents of the component, which brings back some HTML and some supporting JavaScript. It then instantiates that API for interaction that we just saw, and parse into it capabilities for the hosting application, how to make that request, and how to report events such as telemetry, MBT, stats, when to move off to another component.

One of my favorite things about this project was the team then went and built an internal presentation to other people to explain what they've done. They built it as an HTML application. Developers who thought, PowerPoint is fine for me, but they went off and built an application for it. They were talking through the similar content we just explored, and talking about how these components fit together. Then there was the mic drop moment, where you can see the slides going through. Now look, there's a nice picture in there, but it's not a picture. It is actually the iPlayer application embedded in their slides.

Clark: This is the TV application that you'd normally find on smart TVs working inside a presentation.

Caporn: Here is the current homepage, non-personalized, but the current homepage embedded in the slides. Here is the search functionality of iPlayer embedded in the same slide deck. You can see that we might be able to input these components, and here is the new search. We're able to explore new designs completely in isolation, compose them together, do MBT testing, do beta testing, get the component to a good state and then swap them out, run them side by side if we need to. It's really enabled innovation and moving far more quickly from a horrific code base where people don't want to touch any features to the teams are excited about developing audience facing features. That's a bit about the TV code base. How about the website?

The BBC Website

Clark: A similar story for the web really, got lots of separate teams building lots of separate parts, we need to bring it all together. Take for example, this page, live coverage of Wimbledon. You've got the bit maybe showing video, the bit showing some text, there's a bit showing some more journeys. You have different teams responsible for these just like you had with Paul. Behind the scenes that is different APIs powering them as well. You want those APIs to be independent. Different teams owning them, releasing their own cadence, doing their own thing, having some responsibility for those ones. That's what we do, the classic microservice thinking really, where you build it. You own it. You maintain it. It works just as well for APIs as it does for applications.

In practice, you could argue that our website has one API, as you'd imagine, powering it all. What we have maybe is one page that'll be owned by one team, so then it built an API just for that page. There might be other types of pages as well, maybe there's a homepage, for example, or a news article page. They are different endpoints on the same API. One of the beauties of serverless is that we can keep them separate. They can be deployed at different times. Even other endpoints are the same thing, they're owned by different teams. They can have the responsibility for designing them as they want to. Because you need to do quite a lot of sharing, and a lot of the content needs to be shared between these page types. What we do is we separate those APIs that are directly connected to a presentation, which is the website, we call that application APIs. Then we have separate domain APIs, which also might be about things like a news article or a piece of video. The domain ones are agnostic of how they present to the computer and all kinds of different ways. The application ones on the left can talk to the domain ones on the right. If something else comes along, such as a mobile app, or a TV app, it can have a separate API of its own, which also uses the domain ones. You're sharing your data as you need it, but you're also allowing that uniqueness for the particular use case.

One of the things we do employ is the backend for frontend philosophy, where we've made those apps as dumb as possible, and put as much logic in the API as possible so that you can make changes more iteratively by deploying your API much easier. Of course, you can by changing an app installed in many places, so that we find that really helps as well. Serverless used all over the place, this microservice thinking, because your APIs will be different, but this idea of doing one thing really well. Considering the content, the data within it, what is it doing? In this case, like the ones on the right are doing that slightly different thing, being agnostic of the presentation. The ones on the left are unique to the presentation, so it's just good to keep them separate.

They have some scaling requirements as well, based on how many people are using it. Different restrictions, of course. The ones on the left are going to be more open, because they're used by users directly. They also want to be limiting perhaps on what they offer, because there's some internal stuff that stays on the right-hand side, in those domain ones. You have the security side as well, and thinking about the different clients you have as well. We have found that APIs that do one thing well, even if that means you have several APIs chained like this, to an extent, at least, it does help for a better design. I did try to show this, with the tracing we use Amazon X-Ray. Tracing is super cool we can see some real numbers in here about how long things take, nice, small, those are tens of milliseconds. The automatic presentation side of this has had so much to be desired.

Multiple Teams - Inconsistency

One thing we found as well, one lesson we've definitely learned to doing this, is if you have lots of teams doing lots of different things, owning their own bits, that's wonderful, but they will make different decisions. No decision is wrong often, but they're just different. Looking through our API, you can see that sometimes we use the word title as the key, we sometimes use the word headline. Not to argue whether one's better or worse or not, it would have been nice to be consistent. Say with timestamps, do you prefer milliseconds since epoch? Would you prefer an ISO formatted timestamp? Pros and cons of both, but probably just pick any one and standardizing on it, is the right answer. There are other examples as well, such as handling links. The shape of the JSON, for example, just isn't as consistent as it should be. We have more conventions is absolutely a lesson we have learned there.

Designing APIs for Change- IBL (Product as an API)

Caporn: What we're going to focus on is the backend API for pipeline, which is called the iPlayer Business Layer. The responsibilities of this API are to make sure they are the source of truth for all product concepts, so that we don't have clients inventing new ideas, which we then need to maintain as the API evolves. If we want to evolve the product, we want to make sure that the product is coherent and understandable. The API is also responsible for making sure that it is the arbiter of what content appears on the client, and what order it appears in the client. It encapsulates the editorial and algorithmic choices of what is the most relevant content for the user. We really don't want our clients to then be second guessing those decisions, either hiding or duplicating content by recomposing multiple calls together. In one call, here's all the content in the right order.

On the other hand, what does the API not do? It is not opinionated about how the UI is laid out. It provides information that the clients need to be able to compose a UI, but it doesn't say, render this here using exactly this text with this image. It allows flexibility for multiple clients with different iterations to change their UI without everyone having to move in lockstep. The way that we expose this API to the clients is using GraphQL, which is great because it gives us a schema, so clients can work against and validate against their stock data. They can decide what content they do want to consume. It allows us to express those product concepts, those entities explicitly in a schema. Putting things into a schema, that's great, but it's not a free lunch. Every property that you put into that schema has a cost once you've published it. Once the horses bolt, it takes a long time to round up. Trying to remove content or properties from a schema is a thankless task, trying to get everyone else's roadmaps to line up so you can safely remove stuff. You have to think very carefully about each and every property that you expose in any APIs, because you will live with them for a very long time.

Clark: What you're trying to say there is GraphQL is great, because you define that schema, but it's quite easy to get carried away, adding all these wonderful fields to the schema. Then once you've launched it, your clients will start using it, and it's hard to change that data.

Caporn: Yes, think about your future selves when you're adding stuff. Always think about how you want to turn a service off rather than just the day you created. Key parts here are, know the purpose of your API. Who are you trying to serve with this API? We did have issues in the past where we had clients trying to walk the whole catalog of iPlayer so they could build search engines or recommendations. The purpose of this API is to serve the audience based on experience. It is not to provide that service. What we did was we split our separate API, and published our catalog event driven, rather than request-response. That powers those different use cases. Design explicitly for your client's use cases, understanding what data they actually need. Again, not publishing content that's internal to your working just because you've got it, make sure you understand what their needs are. Like that thing, limiting your support to consumers, so that when you do want to change things, you know who you need to engage with for breaking changes, rather than thinking, I don't know who my consumers are.

GraphQL persisted queries can be really helpful because it means you've got a registry then of what questions are being asked of my API, and who those clients are. The way that we track them is we have a GitHub repo, where each query is registered. We have an automated publishing process, and it generates a hash ID for the query to the developer. A developer then parses that in when they call our API, and we look up the query and rehydrate it, and then serve it. One of the benefits as well of tracking what use there is of the GraphQL endpoint is it reduces the payload of callers, so we're shipping less bytes, which is always a good thing. Changing things is always hard. How do you go about making changes in your space?

Clark: You use persisted queries for GraphQL, you don't allow freeform queries at request time.

Caporn: We allow freeform queries in the test environment for people doing exploratory development. Once they've decided what their query is, they register it in production. We don't support non-persisted queries.

Handling API Change

Clark: Let's look more generally, then stepping away perhaps from GraphQL a bit and just handling API change, just talk about the classic JSON REST bit. Here's a simple API that we might have, for example, for BBC News, we're representing an article, you've got an ID, and a URL, and a date, and the title. Because over time, you want to add more things to this. You might want to add a category that the piece of content is in. You would hope that most of your clients are good at additive stuff, they know to ignore extra fields that weren't originally there. Certainly worth making sure your clients are aware of that, that you should be able to add stuff over time, that's the easy way to extend your response rate. Because it's never that simple, so you might suddenly decide one day that you don't just want one category, you want a few. Maybe you have to make other categories pull all over one another, they'll just get a bit messy. Usually, you don't want to be changing the type of a field, so maybe you end up making ones, making version twos. That's maybe what you have to do. Eventually, you get to the point where you need to start saying some of this stuff is going to go away, we're going to deprecate some of it.

That second example there is another example of an internal BBC API, where we're actually using one of the properties to define it as deprecated, and say what it should be replaced with, which is a nice idea. Look at that date, deprecated since 2014. Eight years on, and we're still having to include that data, because I know you might have told all your new clients not to use it, to use something else. Instead, you still may have the old ones. Depending on who your clients are, it could take a while to move across them. Another example there, jumping to GraphQL again, in the schema, you can define stuff as deprecated as well. It's a lovely idea, helping people to understand which fields they shouldn't be using, but of course it doesn't necessarily persuade your existing users to move on.

API Versioning

What else can you do? Versioning the whole API is one of the things. For example, Twitter does this, as an example of how you get tweets in v1 and v2, are completely different endpoints with different versioning. Twitter has got the problem now of moving everyone across to v2. I don't think we've done this idea of versioning a whole API like that Twitter one there. I don't have anything against it, which typically you find if we are going to rebuild everything, you're building a whole new system working in a new way, so we haven't had to do it. I think we have had to sometimes do a version 2 of an endpoint. That white box there is a little screenshot of the Amazon S3 API. We can see it's a ListObjectsV2, where they have created that new one, and they don't say they've used the old one. Because an individual endpoint versioning is probably where we've done more of it, both cases are perfectly valid.

Whichever way you do, you've got this problem. You've got a v1, you've got a v2, the whole thing, or just a bit of it, and you need to move people across to it. How do you get all those clients doing those green lines, to the new one instead? Basically three options, you can force them. You can say, this is going to be switched off at a certain date. We've done that several times inside the BBC. It's not always popular, of course, because you're putting things in other people's roadmaps. It can be quite hard to persuade, certainly if you've got any legacy system that no one's working on anymore. Sometimes it's worth the fight. You can wait, of course, just hope people are naturally moving across. If you've got good relationships with your clients, with the developers, then that really works. It's another reason why it is really worth knowing your customer, know your clients, and persuading them. That can take a while as well.

Providing backwards compatibility. Let's say we're in this situation, we have our v2 of our API, but all our clients are using the existing version 1, and so, what do we do? Creating a facade has been a success, we've done in a number of cases. Effectively a backwards compatibility layer, like a shim, where you've made something that looks and behaves just like your initial API, but actually is calling the new API behind the scenes. You then hopefully can move everyone across by some means, either an API gateway, or changing the domain, or whatever. Been off that original version wants us to be, so you've got rid of quite a heavyweight system and hopefully replaced it with a much lighter one. It's not all perfect, of course, because you've got two API interfaces to maintain, but you have managed to switch off something, and hopefully, then you can begin to work on that version to improve it further, and persuade your clients to move across.

Designing APIs for Scale

We're going to be talking about scale. Let's learn about snooker.

Caporn: Snooker is a very popular sport in the UK. It has a very loyal, linear audience. In my area, we deal with both bringing people the on-demand IP experience of iPlayer, and bringing people from the linear broadcast into that world, which throws up its own very specific BBC challenges. Imagine this, it is the UK championships. First round, we've got the reigning champion playing the newcomer, the UK champion is in trouble. People are on the edge of their sofas watching this. It is 5-5, one last frame, the champion is 39 points behind out of a possible total maximum of 43 points. Will he survive? Will he get through? What should we do in this situation?

Announcer: "I'm really sorry to say this, but we have to leave you here on BBC 2, but hit the red button now, or the BBC Sport website, the BBC iPlayer, or the BBC Sport app for the big finish. Bye for now."

Caporn: The champion is about to take a crucial shot. What is this loyal audience going to do? They all decide to launch iPlayer at exactly the same time. How do we cope with that circumstance? You cannot autoscale your way out of this. This is not a circumstance you can have planned for, that spike is coming in the next few seconds. How do we cope?

Clark: You had millions of people watching normal telly. Then they were told by the announcer you need to press red to carry on, so they've all got their pixel remote controls, they've jumped to IP.

Caporn: They've all launched.

Clark: You created your own denial of service attack.

Serverless Launch

Caporn: Yes, and unique to snooker history, I'm sure. How do we go about coping with that? First thing off, we leverage serverless. Serverless scales amazingly well. We leverage edge functions as a CDN, so we're making use of two different things. We're making use of caching, for responding, and functions to be able to determine which type of TV is making the request and serve up the appropriate client side application for that TV. Lambda@Edge is Amazon's offering, we've combined with CloudFront, their caching service with functional serverless execution. We statically publish the responses for the TV application. This is bootstrapping the TV application to S3. When the request turns up, we categorize based on user agent, decide which flavor this TV application needs to be served up for that specific request, and send it back. We're not completely limited to static responses, we're able to inject things around the query parameters or dynamic values that we can't pre-calculate.

Fallbacks - Failure as a First Class Citizen

That's great. We've managed to launch the TV application, get it back to the user. What about the next one? The next one in our slate isn't serverless. How do we cope with this sudden thundering herd? The way we do it, is we treat failure as a first class citizen when we're designing our APIs. We think about how to fall back and cope with where the system is under strain, upfront, first class citizen. What we do is we serve fallbacks. If we're really struggling, we serve a fallback to the user. What are the characteristics of the fallback? It needs to be schema compliant, so that the clients don't care if we're getting served a fallback or a fully featured response, they can just cope with it as normal. We need to make sure that the fallback responses are context sensitive. If an adult makes the request, they're not served the kid's experience. If the kid makes a request, they're not served the adult's experience, or any other variations that we want to put, whether it is location based, or other variants. We also don't want to make sure that the fallbacks have exactly the same load characteristics as the main service, otherwise, we're just moving from one API to the other. We make sure that we enable scaling by offloading, serving the actual response payload to a CDN.

What were the tradeoffs with this? The tradeoffs are, we have decided to make sure that we favor availability of a degraded version of the service with less features than no availability. We may not have the absolute latest piece of content that's been published 5 seconds ago, because we are publishing in the background, these static responses. We have chosen to do polling over event driven publishing, just because of the state of our estate. Not everything is in the perfect imaginary estate. Sometimes you have to wait and make the most of what you have. In this case, event publishing is not quite yet there in the roadmap.

Serving Fallbacks

How does it work? At the top, we have the publishing flow, where in the background, the fallback publisher is requesting the real service, providing the characteristic of like if it's a child or not, and storing those fallbacks into S3. Then the service that is actually doing a GraphQL response goes off to fetch the full personalized response. If it gets an error, or it gets a timeout, it decides it will then fulfill the request by using the fallback. What that does is it makes a call through a fallback handler, which interprets the context of the request, picks, which is the fallback you're looking for? It doesn't serve it. It then returns a 303 response back to the client. The client is then able to go to the CDN, fetch the response and render it for the user. The key thing here is this setup needs to be prepared for these big spikes, which you cannot scale from. We're overprovisioning that very front door part. We're overprovisioning the fallback handler, so it's always ready so it doesn't go down when it's most needed.

AWS CDK Patterns Make Serverless API Development Straightforward

Clark: We've got some pretty big breaking news as well, maybe your anime hasn't quite broken as much as some news stories. Examples like this of web pages and app pages like this that show breaking news of any kind, can get massive traffic. Serverless, again, we use to render pages like this, both the HTML within them and the APIs behind them. If you haven't done any serverless stuff, I would highly recommend if you're on Amazon using the CDK Patterns, which is basically libraries for wrapping ultimately CloudFormation to deploy things, automatically set up in certain ways. I do think it's slightly embarrassing, but in just a few minutes, you can have something up and running that is fully serverless and therefore completely scalable, and comes with things like monitoring and tracing out the box. Your prototypes using this might be even more production ready than some of your production stuff. We found it's a very easy way to get started.

Scaling Instantly When you Need It

We've found that serverless response times are pretty darn good. You can see the orange line there, that's the p99, so the 1% of slowest queries. At 500 milliseconds is not especially fast, but that does include not just the serverless function, potentially a bunch of serverless functions followed by some datastore, doing some number crunching and data crunching in the backend, so we're pretty happy. You can see that p50, that green line down at 30 milliseconds. We're pretty happy with the performance of our APIs working this way. We've got that website, and then all of those API levels, those green ticks, so they're all serverless. The problem is, you do get to a point, as Paul was saying before, where something isn't serverless. The problem with being able to scale up massively horizontally when big traffic happens is you might just be passing the problem down to a stage further on. One of the reasons for why we cache significantly, other reasons being just to save cost and improve reliability. One of the things we're working on at the moment is making sure those caches stay nice and updated as well, so not just caching things and slowing stuff down with their max pages, but actually intentionally marking the content to stale when new stuff comes along, so that you've got a real time publishing element to it as well.

The Page and The Onward Journeys are Cached Separately

Other examples we do to improve caching. For example, this article is the same for everyone, with one exception being there are some more journeys on the right-hand side that we try to tailor to different people to make them more relevant to you. This is the example where we want to be able cache the whole page. We also want to separately have an API that can handle those individual personalized requests. It's quite nice keeping that separate of course because, APIs do one thing well philosophy. We want that one API on the right-hand side doing those more journeys to be highly personalized, very fast, and highly scaled. It's quite nice to be able to handle that separately.

Manifest API

Another variation on this is if you look at the news app, it pulls down a manifest of all the important articles, so gives you a bit of an ID and the point at which it was last updated. You can compare that against what it might have already got cached, and then only go and fetch the things that it needs. We've got two APIs, again, doing their own thing well. It's the manifest one that we're getting very often based on breaking news. Then the actual content itself, which might be the article body, or the video, or the images, which of course, you might even be able to make it fully immutable if you've ID'd it right. Which you can then define your API just to cache that very heavily.

Designing APIs for Validation - Schemas for the Win

Caporn: It's a really key point about making sure that you have schemas. Schemas provide a contract, they explain to your client what they're going to get and what you're expecting from them. They enable the client teams to stop your API so they can make progress independently of you. We make quite heavy use of OpenAPI, formerly known as Swagger, to expose and document our APIs. GraphQL is a very good way of documenting what you're going to be served back. Once you've got those schemas, they unlock the opportunities to add value.

Unreliable Upstream

A concrete example is a system we've got with unreliable upstream. Why would you build a system which is bad? What it does is it provides data over HTTPS, including mutual authentication that meets the API that clients are expecting. It allows us to test good days and bad days and where things are slow and occasional errors, so that we can see how our services behave in the face of not a good day from their upstream. It gives us some levers so we can switch between the service is down, the service is now all working fine, which helps us if we use this service to run load tests, and things like that.

Summary and Key Takeaways

Clark: We've looked a bit at the frontends. We've looked a bit at change. We've looked a bit at scale. We just touched on validation at the end.

I've thought less of this idea of having APIs that focus on one thing well, the same as you do with a microservice philosophy. Clear ownership and clear clarity on what the API use it for and what it does.

Caporn: The important thing there is knowing your clients, what do they need from the API? Who do you need to engage with if you're going to change it?

Clark: We talked a bit about serverless, this idea that serverless can get you up to speed, up and running pretty quickly. It can also help with the speed and scale of operating your system as well. We didn't talk about the costings, the fact that it's clear. It costs as much as number of users you have. Like all things, it's another tool in the box. It's not always the one to use, but we have found serverless to be a super good way of building APIs.

Caporn: Whether you're serverless yourself or some of your clients has serverless, you need to really think about how you're going to cope with thundering herds, so treat failure as a first class citizen in your APIs and work out how to serve responses when everything around you is on fire.

Questions and Answers

Reisz: When you were talking about event-driven, I think, Matthew, when you put up a slide that showed your event-driven, synchronous calls, different percentages. How do you reason about calls that are orchestrated, calls that use choreography? How do you reason about making them event-driven or making synchronous calls to things? What do you think about them?

Clark: It's pretty much case by case. It's really nice that you have both options. Don't get too caught up in one being better than the other. When you think about it, you're supposed to be logical. If you have one system that needs to tell another system something, it's often going to be event driven. If you have some concept of something happening, a purchase being made, or something like that, and you need to pass it on, then that's what you do. On the other hand, if you've got situations where you want to keep things simple, and you just want to from time to time be able to fetch some content, because that's what it is. It normally starts with the user pressing a button or requesting something. Then you go for the request post one. If that's not too much of a cop-out, I just find it to be fairly natural to what fits in the particular use case.

Caporn: It depends on what context you need to be able to serve the response. If you need the context of the user, request-response is a good fit. If there's no additional context required at the point when you've got like, knowing where in the timeline of do I have enough information to publish the answer, event driven can work really well.

Reisz: I've had an example that I've used in the past about within a bounded context, prefer orchestration and choreography outside the bounded context. Do you find that's the case, or is that too hard a rule to just follow?

Clark: Certainly the idea that you're creating boundaries, because you can have so many different systems, whether it's serverless, or microservice, or whatever. One of the questions often of course is, how do you reason about it when you've got dozens of different systems all over the place communicating in different ways? Yes, that idea where you actually abstract them to some point. Within one system, you might have all kinds of complicated ways of working it. You don't need to worry about that if you're outside the system. You have that clarity, be it event or request driven from which you're integrating with it. Then all of a sudden, even though you have hundreds of different components, you never have to worry about them all in unison, you don't have to worry too much about whether they're being consistent or not because they were never all connected to each other.

Reisz: Paul, you talked a bit about micro-frontends. There's different ways of doing micro-frontends. Sometimes it's a slice where it's different areas of the sites, and then there's the components where you build it and aggregate it together. What types of challenges did you run into? How did you solve them? I'm thinking CDN edge, for example. What types of problems did you run in assembling these micro-frontends, and how did you solve for them?

Caporn: The problems were they interface with the composer. As you're gradually introducing them into this spaghetti code base, how are you creating those clean boundaries, so that you go, this is all the information that's needed for this one, and it can operate independently. Then be able to develop independently and get different people involved with the single component, like the developer of the UX. Then not to worry about what's happening in the rest of the application. Then building up a trust that you can just slot them in, and then work independently, and then start to move quickly. The challenge is then is that people get so many ideas for each of the micro-frontends, and they want to run with them. You want to make sure that they don't extend the API which then breaks other components. Trying to get that clean break, so people know what their responsibilities are on either side.

We do make use of load shedding if you decide you're overloaded, returning an error early so that you can get the fallbacks to kick in, and using reserved concurrency on Lambdas. You're basically saying, this is the cap of what we've got, and then we know the consumer of that needs to know I need to deal with that failure. Rather than just assuming that everything can scale up.

Reisz: Talk about that a little bit more, specifically in the context of serverless. When we're talking Kubernetes, or we're talking sidecars, we're building a lot of this stuff into the sidecar. How do you deal with things like load shedding when you deal with some of these crosscutting concerns in serverless? Is it a function provided by serverless functions? Are these libraries?

Caporn: The concurrency limits for Lambda or with a managed service, so I'm just trying to keep a counter up. You're basically saying this is a reserved currency, and it will just throttle on those when you hit the limits. I think it's a little bit vague. I think they can do some bursts over for brief periods, but that's all under the covers. As far as we're concerned we just have to say, how do you deal with it when you get rejected, and what's the success part?

Clark: The platform handles it. On the whole, we've had Lambdas very good at scaling, even to many thousands a second. The problem is, of course, what is it then connecting to, your database or whatever? It might not be, so in your code, you kind of limit the number of Lambdas that run concurrently anyway. Though it might not be guaranteed, because it's not got complete, real-time measure of all that's happening. You probably, your code also need to think about, do I need to put some protection in if there's large amounts of concurrent things here to whatever the underlying database is, if that itself isn't serverless.

Reisz: You talked a bit about REST. Do you leverage binary protocols? How do you attempt to decide and make standardization between if you do binary and non-binary?

Clark: We've not mainly because of the value of simplicity. The only thing about JSON is it's human readable. The only thing about REST APIs, you can just call it from your browser. If you've got an operational instance, and you're sending links around, getting people to test things, to understand things or work out where the data is wrong, that visibility, just being so simple, anyone can call it and read it. It's so powerful. It's not to say that, yes, there are times when you want the efficiency of binary. We've yet to come across it.

Caporn: I've got one instance in the BBC sounds product, which is the audio on-demand. The backend there use Thrift for some of the inter-service communication. Actually, because people were trying to optimize for speed and reduced payloads being sent, it just means it's really opaque. You get new developers coming in, they're not familiar with the protocol. They're not familiar how to debug it. My assumption is, we're going to move away from that, because actually, it's got a mix of JSON and Thrift. The JSON bit, everyone knows what's going on. The Thrift part, it's like magic and witchcraft for people that aren't familiar.

Reisz: A tradeoff between just ease of hiring, ease of mental capacity. Your mental ability to take that.

Clark: There's a real key cognitive load thing to be building something larger. How much can you do, the more complex it is? Yes, you can be clever and do it, but is it really where you want to focus it? The biggest cost of our organization like many is people. The cloud costs, yes, you want to optimize it. End of day, you might find that just scaling up to handle the extra load or the speed is cheaper than engineers to build something.

Caporn: If we were doing something that was like high trade concurrency trading or something, then that is a completely different world. For the latencies we need to deal with, we can absorb that overhead of costs and things.

Reisz: Everything's a tradeoff, and cognitive load, that's what I was trying to reach for. That's the cognitive load of being able to operate these things.

You have to think about your future selves, I think Matthew, you actually dove into a little bit. Assuming you didn't think about your future self, are you horribly misinterpreted, your future self, and your clients are using this data that you need to deprecate, what does that look like? Is it just flag it as deprecation and it never gets removed? How did you deprecate some of these fields that you've published because you failed to think about your future self?

Clark: We're very lucky that so many of the APIs do offer internal use. We have some APIs that are then used by third parties, and they get so much harder, of course, because you have less control over them, depending on what you do. For us, chances are we'll have an API, there's 10 other teams that use it. You've just got to talk to the teams and work it out, and go, how feasible is it? Are you planning on rebuilding everything anyway? How willing are they to open up on it and make those changes? End of the day, it's all people. It's all relationships. If you can build that right relationship and convince them, you can make it happen is the likelihood.

Caporn: It depends on the feature you're unlocking. If the feature you're unlocking is super valuable to a broader set of the business, you can get that tradeoff in roadmaps where you're like, "I know it's painful. I know it's like toil work, but we're blocking this outcome by not doing it." That helps to mobilize teams, especially when it's across multiple different organizational boundaries.

Reisz: The question is about WebAssembly. Let's do a little role-play. I want to use WebAssembly on some of these frontend APIs, maybe at the edge, as architects dealing with me, how do you respond or have you thought about it? Have you had this conversation? How has WebAssembly entered into your thinking, or has it with BBC?

Caporn: First, what problem are you trying to solve? Secondly, yes, BBC R&D have been looking at the use of WebAssembly for TV clients. There's a massive challenge in getting an environment to be able to run that successfully into manifest because you're suddenly engaging with like a commercial environment. There is research interest, and we're watching what others are doing. At this stage, I don't think it's solving a problem that we have in terms of going to a TV, the rendering of the video is slow enough. WebAssembly is not going to help you, you hand it off to the device to say, play the video.

Reisz: How would you like to leave everybody in our audience?

Clark: Data is everywhere. It is the amount of data in the world that is growing exponentially, I believe. The design of APIs and our data within it is more important than ever. I just love this interesting battle between you want to be agile. You want to keep changing how your data is and how you work, because that's how we are. It's really unclear to know stuff without getting into the weeds. Yet you've also got your APIs that need to stay reasonably consistent over time because of all of your services. Just thinking through that, what is the lifecycle of the APIs in your data? What are your users? How are you understanding your clients, what they are, what they're doing, and how they're going to evolve as well? Because this stuff will continually change. If you get that bit right, the rest follows naturally, I think.

Caporn: It's about ownership and people, every API you spin up, like Matt said, that lifecycle. There was a great comment from Jessica Kerr in her talk, about unowned systems holding the power in the organization, because everyone is blocked by them. As you spin your APIs up, you really need to think about who your users are, and that owning team, and making sure that that team is aligned to the same mission. If they got an API that's completely unrelated to what they're working on, it's going to have a lack of love. It will be difficult to change and everyone will be blocked by it over time.

See more presentations with transcripts

Recorded at:

Dec 04, 2022

InfoQ Software Architects' Newsletter