Transcript
Picture this. It's the day of the Super Bowl and you're responsible for making sure that the system is correctly running operationally for that day. You're expecting 10 times the normal peak traffic, but you're not so worried because you've spent the last three months working heavily with the engineering team, who is excellent at scaling the system and making sure it can cope with the load. There's 800-plus microservices, so it's a nontrivial endeavor and things are going really well and you're feeling really smug. In the last 15 months, the engineering team and the company have created a product which has been outrageously successful. It's now the leading live TV internet-delivered service in North America.
Things are going fantastically, then two minutes before the end of the game, the Patriots are down. Tom Brady grabs the ball. They're about to go for a rush. He makes a snap and the video cuts out for tens of thousands of people across the States. I want to talk to you today, not about just how you win the war, how you get to that microservice architecture, that fine-grained architecture that provides all these amazing benefits. I want to also talk about how you keep the peace, how you operate these architectures, some of the pitfalls you can fall into, and some of the things we've found that can really help keep these systems running smoothly.
A little bit about me. So, I've worked in software for maybe 30 years, trained initially as an electronics engineer. I've worked in many different domains, speech research, applying neural networks to making speech sound more natural. I've worked in telecom, smart cards, digital broadcasting. And for 15 years or so, I worked on trading and risk systems in investment banking, which is another whole different world. I stumbled into a Ph.D. on software components, which was enormous fun.
About six and a half years ago, I got contacted by Riot Games. Riot Games made the enormously popular game League of Legends. How popular is it? Well, it's played at least a billion hours worldwide every month. So, I worked for four and a half years there, got a fantastic opportunity to create a lot of the server-side infrastructure that's used in over 200 of the projects there. Also got a chance to architect the game client, the new game client that's now on 120 million plus desktops around the world.
A couple of years ago, I was contacted by Hulu. They said, "Come over, be the chief architect. We're going to develop a live TV offering." And so I joined Hulu and worked with the exceptionally capable staff at both of these companies.
Microservices Work
The takeaway that I want to bring is pretty simple, to be honest, and I will go through it in detail and explain what the benefits and what the challenges are. But I want you to understand that microservices work. In many ways, this is my own personal retrospective. I wasn't always a true believer in microservices. I thought we need just enough to get by, but if it's going to be too granular, we're going to suffer from this. This is about me gradually learning, and making mistakes on the way, gradually learning that these architectures are very, very powerful indeed, and do convey all of these valuable benefits. But there are challenges as well. And I'm going to talk about some of those challenges, talk about some of the things that bit us. And I'm going to do it in a very open way. In many ways, these are my own mistakes, these are my own lack of learnings, etc. etc. And in many ways, the only real mistake is one that you make twice. So, in that spirit, I'm quite happy to share these findings with you.
I left Hulu around two weeks ago, so in some ways I'm freer to talk about it than I was before. So, what I want you to take away is that microservices work, but there are challenges as well. Let's prepare for the challenges, in particular, with some of this very powerful new infrastructure that has come about.
I've divided the presentation up into four parts. The first part, I'm going to talk about microservices in gaming, "League of Legends" in particular. Then I'm going to talk about microservices in videoing, and these two different domains have different characteristics which color the architectures. Finally, in sections three and four, I'm going to talk about the benefits. Believe the hype. I didn't always believe, but when I saw an 800-plus service microservice architecture running at scale, I realized the power of it. Believe the challenges as well. These things are not trivial to operate. You get all these benefits, but unless you're careful, something small can go and fire, can percolate up through the architecture leading to people not watching the Super Bowl and then you're having to talk to the "LA Times" and give them quotes. You don't want to be in that situation.
Microservices in Gaming
So, microservices in gaming. So, I've subtitled this, "A micro service architecture @ scale." On the top right there, bottom right there, is a character from the game called Teemo. So, League of Legends is an enormously popular multiplayer online battle arena game, where you play the role of a champion in a team of five pitted against another team of five, competing for dominance on a map. Games last around 45 minutes and each of the champions that people play have a dizzying array of stats regarding attack damage, cool down factors, and a whole set of capabilities that are asymmetrical. Now, what happens is this is an incredibly engaging sport. It's behind a lot of the popularity of e-sports. As I mentioned, people play it over a billion hours a month. To put that into context, in 2014 when we measured that, all the hours ever played on Halo up to that point, including all of the successive versions of Halo were 2 billion hours. This is doing that every two months.
It's at scale. These figures are from 2014, so you can extrapolate out from there. We had 67 million monthly active players and more than 27 million daily active players. Finally, we had seven and a half million concurrent at peak, that was 10 million or more by the time I left. One of the things I would say as well is, because I'm currently no longer working for Hulu or Riot, I'm going to only speak of publicly available information. So I'm not going to share anything particularly private, but by the same token, a lot of these things have been talked about already in the press. So a fairly large game and it's super exciting one to work on.
So some of the gaming particulars, that might be particularly unusual as a domain is you obviously have low latency. You want things to happen quickly, so a level of real time nature is very important. When you're doing matchmaking, you're basically picking five players on one side and five players on another side, such that they have a 50% chance each of winning. That requires a lot of shared state and is often viewed as a global population problem depending on what particular algorithm you use. That shared state coupled with microservices makes it a little bit harder to pull them apart. Gaming has rapid development cycles and there are lots of engineers working on one game. Like, when I joined Riot Games, they had 80 engineers. When I left, they had well over 800. And that doubling year on year causes interesting problems on its own right. Lots of engineers working on one game mean that if they're in a monolith, they're all stepping on each other's toes all the time, gets very, very slow.
Winning the War
So winning the war. How did we evolve from the monolith at Riot Games into a more granular microservice architecture? How did we get those benefits? Well in 2009 when the game was released, and I wasn't at Riot at this point ... And one of the things I would say as well is I'm going to talk about many things that I was peripherally associated with or wasn't necessarily at the heart of. I'm just going to talk openly about the successes and bear in mind, these are two very successful companies. Both Riot and Hulu are multibillion-dollar corporations that have been extraordinarily successful.
So in 2009, they had a large service monolith they developed and released. It gradually evolved. By 2012, it was 435 Maven projects. The tar file itself was, sorry, the WAR file itself was massive. It was in the hundreds of megabytes. And we realized that we could no longer keep working on this because in August of 2012, we had this disastrous release where the service was down for four days or three days or something, and they had to basically roll back the release. And at that point we realized we had to start moving to microservices. There's a presentation there. If you get the slides afterwards, you can click on it, where I spoke at GDC, the premier game developers conference in San Francisco, on the evolution of League of Legends and how we evolved in a more granular fashion.
So this is roughly the architecture of League of Legends with the monolith. So reading from the bottom, you can see there's MySQL, and it's holding all the state. And then we put a distributed memory cache on top of it, it's a product called Coherence from Oracle. Very, very powerful. It effectively turns MySQL into a very, very fast NoSQL database and allows you to handle writes in a linear fashion. On top of that, we had this platform WAR, this very large monolith I talked about. We deployed it in multiple cases with different flags. So we had a little bit of differentiated function in each case, but it wasn't anywhere near a true microservice architecture because we were still reading and writing the same objects in a distributed cache using the same code into different silos.
So the trouble with that, is when one part of the system got hot, when MySQL is being affected, for instance, it would affect the other systems as well. So there wasn't enough separation. We knew that, and so we wanted to change and introduce proper microservices. How we did it, is we started with the platform on the left hand side. So it's the same thing I showed before, just in a compressed form. MySQL, Coherence sitting on top of that, and then the platform WAR. And then we put it in an API called service proxy, and that exposed the different functions that were currently in the platform, and the first thing the teams pulled out is team builder, tb.jar, which is a very fast and very efficient matchmaking service. So they pulled matchmaking out of the monolith. That eventually became the matchmaking system for League of Legends. So at this point we had a way to make new features as microservices, pull them out of the platform. So we'd won the war. We'd created some standard infrastructure.
So I'm a huge fan of REST and SWAGGER. I love it for APIs. It makes them very tangible, very real. I like the way you can evolve them and you can call them using curl. And we built some fundamental infrastructure. We built a set of dashboards so we could see the metrics of the software, the services. We built service discovery, which we patterned off Eureka's Netflix. Now Netflix is a giant in the space of microservices. I can say that now I've left Hulu, but basically you could do a lot worse than just read all of the Netflix blogs and understand some of the amazing things they've done. So we patterned this off Eureka. Very powerful architecture, if you get a chance to look at it. We called it Discoverous.
The service on the right is the configuration service, providing flags for the applications when they start up, when they also change things at runtime. We called that Configurous. We had a dinosaur motif going. We then created a whole set of libraries in Java called Hermes and associated other libraries to make it very easy for people to create microservices. So we've got a server-side REST library, which can sit on top of a pluggable HTTP server, and then we have metrics reporting. So people can spin up a microservice like that, or they can create one at least. I'll talk about spinning it up in a second.
We then created the client side libraries as well. On the client side libraries, we use a REST library, which was the Apache stuff, and we used a software load balancer. Now what's happening is this is all operating inside Java process. So what will happen is when a client wants to know about a service, it talks to service discovery, gets introduced, and then it talks directly to the service. From then on, it's balancing over that service. The services, pinging the service discovery mechanism and heartbeating to it every 30 seconds. If they don't heartbeat, then eventually it will drop out of the load balancer and you won't see it from the client's perspective. So it's a very strong architecture. Very powerful.
But the big problem with this one, and I did it this way, so it's my own mistake, I'm putting my hand up, is, and it was very common at the time, we built far too much functionality in these fat libraries. So as soon as we moved to a polyglot environment, we're suddenly having to recreate these powerful facilities inside each and every other language. Now, as we'll see, what's happened, a common trend, is that a lot of these shared cross-cutting concerns have been pulled out into infrastructure. So things like Istio. Infrastructure like that makes it very, very powerful in a polyglot environment and takes away a lot of the heavy lifting that used to be done by these big fat libraries.
Keeping the Peace
Keeping the peace. And this is what it's like to run the service at scale and I've subtitled it "Held back by the remains of the monolith." So we weren't quite free of the monolith, because as you can imagine, we'd just pulled out enough so we can create the features into separate microservices, but the platform still got very hot. The things like your inventory, the stats, the summoner details, are still there. They're still a lot of work, and the trouble is now, whenever the platform goes down or whenever we have something going on which causes performance problems with it, every microservice suffers.
We got away with this for a long time, but just at the start of this year, you see that Riot had to pull back a couple of game modes. So we'd spent a long time developing these game modes, and when these game modes went out, they realized that in the wild, you get performance problems with the monolith. And this is actually from one of the Riot explanations. "Technically, the platform was written as a monolithic service, which means when things go wrong, it's difficult to debug." No kidding. It's hard to debug. It's hard to independently scale. It's hard to operate.
If We Could Redo?
So that is the consequence of not going far enough with our microservice evolution. If we could redo, what would we do differently? Well, with the benefit of 20/20 hindsight, and remember this is an enormously successful game, even to this day and was the largest game in the world for the longest time before the release of Fortnight, we would decouple the state completely, and we'd pull it into separate microservices and do that work. It's not impossible work to do. It would have been painful. And to do that we'd have to socialize to get prioritization. We spent a lot of time pretending that we could do it without having any adverse effects of feature release velocity. That's typically not the case. You need to get buy-in so that the organization sees the benefits of decoupling in this way.
I would also simplify infrastructure. And again, my own personal sort of learnings, I made the configuration system far too clever for its own good. It was hard to debug, hard to run in operations, and sometimes you'd get results you didn't expect. And I put far too much smarts in the fat libraries. It was common at the time, so I don't regard it as too bad a mess. But what's thankfully happened is those facilities have been pulled out into tools like Envoy, systems like Envoy, which make it a lot easier to run in a polyglot environment.
Microservices for Internet Video
So, moving on: Microservices for Internet Video. And I've subtitled this "Hundreds of tiny pieces ..." So if you don't know Hulu, Hulu is a very popular U.S.-only internet video service provider. Until 2016, it was video on demand only. In 2016, 2017, we released the live TV offering. Hulu is enormously popular, particularly since the release of such premium content such as "The Handmaid's Tale," and it's certainly one of the big three video distribution companies in the U.S.
So when I joined in 2016, Hulu already had a microservice architecture. It wasn't as advanced as it later became, but we decided to add live TV to this. And you can imagine you've got video on demand, dealing with just a video catalog, and then live TV with all of the attendant changes in the programs, the ability to browse channels, etc., etc. In a 15-month development, we basically reworked or recreated 400 microservices out of the 18, out of the 800-plus microservices. There's an article there on medium.com, if you follow that link, where I wrote about our experiences and how we created all those microservices and how they interoperate and run.
So, incredibly successful, and also on the face of it, slightly crazy; we replaced the entire backend and all of the frontend in one fell sweep and then released it on a single day under a lot of pressure to meet a certain date. But it has been incredibly successful. From the public figures we've got, Hulu had at least 20 million overall subscriptions and at least 1 million plus live subscriptions. The numbers are higher than that. These are the official ones.
Some of the video systems particulars which differentiate it from gaming. Typically when you're looking at something like Netflix, you're spending 80% of your time browsing. I mean, I don't know about you guys, I can't find the show I want to watch half the time on these video services, and this is where recommendation engines are so important, but regardless, I still spend 80% of my time browsing, 20% watching the show. And so there are a lot of people doing browsing, and in the Hulu system, it's heavily personalized. You log in, you see stuff which is relevant to you. Now, that places a big burden on caching. TV show metadata is needed everywhere. We need to know how long the show is, what the restrictions are on the content, what devices it can be played on, and that data needs to percolate through the entire set of microservices. It's a data distribution problem. We often talk about we want to locally capture just the data and the functionality, like a good object-oriented class design. We want to capture that in one service. In this particular case, it's not possible and so we needed to create patterns to funnel the data through the whole architecture. There are obviously real time playback concerns, because it's got to support live TV, and there's a huge amount of integration. We're always integrating with another billing provider, another ads provider, etc.
So, Hulu has some outstanding engineers and, well, actually lots of them, and they created some fantastic infrastructure for dealing with microservices. So the thing in yellow there is the Donki PaaS, or the Donki platform as a service. You can think of it like an in-house developed version of Heroku, but written in Python instead of Ruby. And the Donki PaaS was retrofitted to run the applications to containerize them and run them on Mesos Aurora. But one of the issues with Donki, and this was something we only realized later, is it didn't have infrastructure as code. It didn't have that ability to script the deployment of your application. As a result, what would happen is people would have to go into these UIs. So the only way you could access and change the system was effectively from UIs. And so people were having to go in and change the UIs, and it made it very hard to just spin up another environment with all of your software in it. It was possible, of course, but the ease of use of Donki, in some senses, became its worst enemy later on.
On the left hand side, you can see Github and Jenkins used for the CICD pipeline. Hulu built its own provisioning system. And you've got a set of VMs there which it's managing. And then on the top right-hand side you can see that we use DNS for service discovery, which had its own pros and cons. Everything, including service to service traffic, is going through load balancers. And on the bottom right-hand side, you can see we built our own RDS equivalent for MySQL and Redis, so we could self-provision databases and infrastructure.
Funny story, Hulu loves building its own things. And they've had a culture of build for a while, which is gradually changing. The funny story is when the company was first founded, the first thing that Hulu did, is they built a telephone conferencing system because, why not? As I said, they're immensely capable engineers. They built a lot of this stuff from scratch because they were operating around 10 years ago and it has evolved and still runs the system today.
Fine-grain granular microservices allowed us to perform what is known as the Inverse Conway Maneuver. Conway's Law states that your architecture tends to resemble your organization, which means that you get all of the inefficiencies associated with bad organizational structure. Because our services were so granular, we were able to reorganize our teams around the architecture and redistribute the services and the ownership of those services depending on how we saw fit. So a very powerful approach. We were able to change the organization very fluidly. Every team owned their own microservices. They owned the operational side, all the way from the start to the end.
This is the data distribution pattern I was mentioning. So unique in this architecture that I've experienced is the need to funnel out reference data across every microservice. Literally every microservice needed some aspect of this data, and they needed it in a way where we couldn't possibly just centralize all that in one service. So the pattern that we used is down at the bottom there, you can see that we have a mastering system. That system is where you master the data. It's the system of record. It's where you ingest the data and it's also where you do admin on the data and modify it if you need to make changes. It's the read/write part.
Then we publish out to these caches, and we allow these caches to be instantiated on demand, such that we can put one in front of a service that needs to have a lot of reads of the metadata, and suddenly we don't have a performance problem. So we read through the caches. Now, we had two types of notification mechanisms. So the first one is… two types with expiring mechanisms. The first one is we just use an HTTP proxy, set a time to live on the data, and we could read through it and it would sit in the cache for a while and be very fast. However, if we needed more responsiveness, what we did is we published out on Kafka, and the caches were bespoke implementation, which basically listened to the message, knocked the elements out of the cache and dealt with it in that way. So again, we used this pattern to funnel data through our architecture.
On launch day, we had no idea how many people to expect who would use the service. And we had some models, so we constructed a lot of data around what we could expect. And for the most part, we tracked correctly. You can see up until that blue spike, things are going very well. And so we felt pretty proud of ourselves. Now remember as well, we switched over as many clients as we could on the first day. We literally had a big bang, where, like, day one, no traffic, day two, all traffic. The blue spike that you can see in the top left there, we couldn't understand. We are getting so much browsing traffic, we were worried it's going to chew up all of our excess capacity. It turned out to be college kids falling asleep with Xbox on. We'd forgotten to put in an idle timeout on the Xbox release, which meant that when you're watching on the Xbox and you fall asleep, it's still pinging our system every minute for browse queries which are personalized. Once we fixed that, the problem went away and we started tracking the predicted amounts again. The moral of the story there is you've got to overprovision, because you simply cannot understand every possible use case of the system or everything that will go wrong.
Keeping the Peace
Keeping the peace. What's it like running this system? And I've subtitled that with “everything gets magnified.” When I was at Riot Games, we also used to joke about this. We'd say, “if anything could go wrong, it will”, and that was absolutely true at the scale we were operating at. Unbelievable things happened, down to the hardware level. It's a similar thing in Hulu. It's running a very large service and remember, this is a very successful service. On every front, this has been enormously successful, but there have been problems. There have been challenges.
One of the first things we did, is we built out an architectural operations dashboard. What I mean by that is that when you have 800 microservices, it's hard to see how they all connect together and achieve a certain function. What we did is we made it so that there's a whole set of domains, and this is basically a mockup, because I don't want to show the real thing of the dashboard we created, and the top is a set of tabs for each of the different domains. So there's login, there's browse and search, there's playback, and we're looking at playback. And this is a diagram we constructed for the press for a previous thing, but it's very similar to what we had, and it shows it on the left-hand side, encoded signals are coming in, they're being repackaged, it's being pushed out to the CDN, and then it makes its way to the TV set on the right.
Now, we have another system. Again, this is very powerful, lets you see all of the microservices in context. Maybe not in the granular fashion, but rolled up into these boxes and lines. And so it became a very powerful way of understanding the system. We also had a system called Homez, which is basically a giant test runner. It's a clone of Gomez, the commercially available system. And what that allows you to do is run tens of thousands of smoke tests and health checks against the running live system every hour. And we made it self-service, so if someone went into Gomez, sorry, Homez, and added client gateway tag and that test failed, it would appear on the dashboard by coloring it in red and then coloring the tab in red. And so you can see in this particular case, you can see the knock-on effects, you're like, "Oh, the metadata service is okay, but the client gateway is not. It's going to start having an impact in this way, on the playing videos."
Scaling for growth, as I mentioned, was a challenge. We initially started with the Super Bowl. And the Super Bowl was literally predicted at 10 times the traffic that we had at peak. So it's quite a scary thing. What we did to reduce some of the engineering anxiety is firstly, we only looked at systems which were directly facing the outside world. We didn't care so much about the internal systems. We figured they would … The teams dealing with those would handle them anyway, so we just concentrated on a key number of edge-facing systems. And what we did is we classified the systems according to whether they scaled; the number of queries per second scaled linearly with the number of viewers, scaled 0.5 times or 0.3 times.
Now you could imagine if someone says to you, "You're going to get 10 times the traffic, but you're only going to get three times the actual queries per second extra." It's a lot less scary than someone saying, "Oh, by the way, prepare for 10 times the traffic." And as we classified the systems, we found that a good 80% of them fitted into that 0.3 times. So again, it took away a lot of the anxiety. We were able to load test at the right levels. We were very confident, and we actually had no scaling problems for Super Bowl, March Madness, or the Olympics. Oh, that is an article that myself and another engineer wrote on Medium again, for Hulu talking in more detail and showing some of the graphs of how we achieved the scaling.
One of the things I wish we'd done really early on is put in circuit breakers. So circuit breakers, I'm going to motivate why you have circuit breakers in a second, and they avoid these firestorms, or what we call cascading failures. And they're so important when you have this granular architecture, where, because eventually one part of the architecture will go on fire, and you want that fire to go out quickly, or you want those instances to be taken out of rotation.
So imagine this thing here. You've got services A, B, and C, and they're calling system X, service X, and X is calling service Y. Now Y might be a tiny little service. It might literally be doing geolocation of the user, to make sure they're in the right region before playback starts. So it's something that's only very peripheral to actually continually playing back. It's just something that's used when the video starts. What will happen every now and again is service Y will die. And it might only be one instance out of 10 that dies. And you think, "Well, that's not a big deal, surely. Only one's died." What will happen though is X will then call out to it and it will load balance over the available set of instances, because it hasn't been pulled out of the load balancer yet, and it will chew up all its threads, because we're dealing with many tens of thousands of queries per second. So suddenly, X is dead, because it hasn't hit the timeouts, but in a couple of seconds, it's literally got to the point where it can no longer run. And that will just cascade through the architecture. Once X is dead, it's exhibiting the same symptoms, and A, B, and C die.
This is how you can have a tiny little service in one of these complex systems flake out, it can get on fire, and then all of a sudden, you've got the CEO on the phone to you going, "Where's the video gone? And what's died?" And you say, "Oh, it's the geolocation service, and it..." Say, well, how has that affected the whole thing? So circuit breakers prevent this. And something like Envoy, basically it puts a little proxy in front of every service, and that proxy records every failure in a way across the architecture and then says, "By the way, we've got so much failure accrual. We're going to take this out of service immediately, this one instance out of service, and every process that's talking to Y will no longer talk to those bad instances." Now what that means from X's perspective is it can fast fail. It knows when it's talking to Y, it will say immediately, “Y is not available,” so you're not waiting for the timeout, and it can take another thing and say, "I'll take alternative action. I'll call service Z," which might be a cache of geolocation data. It might even be something that simply says, does some very basic checks. Are we allowed to play this content?
So circuit breakers prevent firestorms. It's probably one of the biggest things I wish we'd managed to get into our architecture early. We had use of libraries like Hystrix in certain parts of the architecture, but the real value in these comes when you bring it in early so that people are thinking right from the start about, well, if that key service is not available, what's the alternative? What should I do in this particular case? Because at the moment, without circuit breakers uniformly applied, the only alternative is just simply cascade the failure.
We got cross-cutting requirements, so we've got 800 microservices. Every now and again we get a requirement like a serious feature, like upsell. An upsell is where through various parts in the UI, it can present the option to upgrade your package. That would cut through 50 to 60 microservices. Now, this is not a million miles away from OO design, when if you get the class boundaries slightly wrong, someone will come in and say, "Well now we want it to do X," and you say, "Well, I'm going to have to completely rework my class design and all of the architectures." Now, in a microservice world, that becomes quite hard because you're having to modify N different APIs, you're having to make the changes that stripe across the architecture. I don't feel we fully solved this part. Some of this comes down to the need for creating platforms. Some of it also comes down to the need to look at the vectors of change and factor your architecture around this. I don't think we were successful in all cases.
Cloud versus data center. So Hulu has a big data center presence, has moved to the cloud for its live pipeline. But we're using the Donki system, and we took an approach that meant that Donki abstracts out the difference between the cloud and the data center. I initially was a huge fan of this. I gradually changed my thinking over time, because it tends to bring in a lowest common denominator approach. You're wanting to use the serverless architecture of AWS? Sorry, we can't support that. You want to use a particular database, or a Kinesis queue in AWS, or you want to use Google Spanner? Then bad luck, because it's got to run the same in the data center and the cloud.
And so we had no elasticity either. We had to over provision. As Meg Whitman, the ex-CEO of eBay is fond of saying, when you're in the data center, every day is New Year's Eve. So you have to provision for the biggest traffic you're ever going to get, and it gets expensive to run. In the cloud, you obviously have elasticity. It's a very powerful tool. We didn't have that. So it made it hard to do proper blue-green. Blue-green deployment, where you bring up that separate cluster, you funnel traffic over on a gradual basis, and then you fold that cluster away if anything goes wrong, or you kill the original cluster. It allows you to do rollbacks so quickly.
Believe the Hype
So, on to section three: believe the hype. What are some of the benefits we got from microservices? As I mentioned, a lot of the things I've talked about, I wasn't a true believer in the start. At Riot Games, I was like, "Oh, it's good, but I really wouldn't want to run too many of these things." At Hulu, my opinion was completely changed by seeing the way and the benefits we got from these granular microservices.
So ownership and independence is a huge thing. You're no longer stepping on each other's toes in this giant platform monolith. You're just dealing with your own code, which is a lot smaller, so there's a lot less cognitive overhead and you get that development independence, which leads to that huge development velocity. Like, on the face of it, Hulu was able to achieve outstanding and amazing things, to completely rewrite their service in 15 months, and it's a very, very successful service. So we got that operational and development scaling as well. We could scale one small part of the system by throwing extra resources at it and spending a bit of engineering time making it operate better, without having to do what you have to do in a platform monolith, where you have to scale the whole thing for every use case.
Some other benefits. We got very granular deployment. One of the thinking that I had early on was that granular deployment would lead to more errors. It doesn't as long as you build the right checks in. So, at Hulu, it was not uncommon for us to deploy 100 times in a day. That's very powerful, because every time you deploy something small that has been tested, obviously correctly, but is advancing the system forward only a little bit, you have only a level of say, 10% risk. If you have to change all the entire set of software, or something very coarse-grained, you may be incurring 50% risk. So that granular deployment really helps out.
Evolution is easy as well because you can produce a v1 of the API, and then just roll out a new service with v1 and v2 of the API. REST really helps as well with that, although systems like Thrift and Protobuf also contain versioning primitives. And as I mentioned earlier, organizational alignment. Our services were granular enough, we could just spread them out through the organization, as we reorganized to get additional benefits of scale, because Hulu has maybe got 800 engineers worldwide. We were able to redistribute the microservices in a very powerful way.
Believe the Challenges
So believe the benefits, believe the hype, but believe the challenges as well, because they're very real and they'll bite you if you don't take advantage of it, don't take advantage of this new infrastructure. So we're very lucky. We live in very fortunate times where people have seen the need for these things to be abstracted out into infrastructure to avoid the pitfalls of microservices, and so I've subtitled this part "Standing on the shoulders of constantly improving infrastructure …"
So here is just a template of a build and CICD pipeline, deployment and rollback into your runtime infrastructure and it's sitting on operational infrastructure. The first thing I would say is, make sure you have that common CICD pipeline across your company. It's so important for auditability, for keeping track of what's been deployed across the company when, for rolling it back quickly and having that uniform set of capabilities. And it's so easy to build these things these days. I mean we've got some amazing tools and infrastructure available, so something like Jenkins for CI, this are just examples. Something like Spinnaker, or commercially available tool, Harness. And it's so important to use infrastructure as code, because you want to be able to spin up your environment and then tear it down. If you can't do that, you're in a very weak place.
So we deploy into the runtime infrastructure, and really, it's gotten so easy these days in such lovely tooling, like things like Kubernetes. Something like Istio, it's built on top of Envoy from Lyft, and it basically abstracts out these cross-cutting concerns as these little proxies which gives you best-of-class service-to-service monitoring, best class circuit breaking facilities. And build on something like Stackdriver for logging, etc.
Then for operational infrastructure, I've become a huge fan of the cloud, and I say this as someone who has worked for two companies that have heavy investments in data centers. The reason why the cloud is so powerful, is that you have this easy elasticity. You can bring up these environments quickly and you can maybe be even bring up a thousand machines. It's not uncommon for, at some of the companies I've talked to, for them to bring up 50,000 machines. You run them for an hour, you do all your testing on it and you fold it back down again. You're only paying for that hour. You don't have that deep investment in the data center where you need to pretend that every day is New Year's Day. So the cloud has many advantages over the data center. You got the elasticity, the ease of blue-green, and you've got better shared services. We could only devote, you know, 10, 20 engineers to managing the shared services at Hulu. And they were sensationally good engineers, but the bottom line is they can't compete with 10,000 people from Amazon or Google. There's just simply no way they can. It doesn't matter how good they are, how hard they work. And so the cloud has these advantages.
And my preferred approach these days- I used to be very big in terms of “let's make a set of abstractions so that we can deploy on any cloud.” I've changed my mind over time. I don't think any of these clouds are going anywhere. I think they're all competing on price. I think you're going to get the price advantages even if you can't arbitrage between them. Pick one cloud provider, possibly per workflow. You might want to put your big data in a separate place to your transactional systems. Consider costs early, because you need to consider how you can save costs in this elastic environment. And go multi-region, multi-account on day one. Multi-region for that resiliency, and multi-account so you get those little nice little bulkheads so that when something goes on fire, it's constrained within that one account. In our particular case at Hulu, we created a handful of accounts, and we were doing multi-region.
Put in circuit breakers by default. I cannot emphasize this enough, and this are some graphs that come from Linkerd. We're doing some experiments with Linkerd, another service mesh. On the top right-hand corner, you can see that we've got no failure accrual. And so that blue line you can see just below the orange and red lines, in the top right, is showing we're getting a consistent set of errors; even though only one service out of ten is failing, it's affecting everything. The bottom right-hand side, where we put in a failure accrual policy, you can see that the errors are minimized, and it's literally bouncing along the bottom of the graph. And what you're seeing is that every now and again, you get a little spike as the service mesh decides, "Oh wait, hang on a sec. I'm going to retry that one service." So again, the circuit breakers by default literally transform your architecture from going on fire as soon as someone shoves a match, to basically you can pour petrol on something and set it on fire and it hardly affects anything.
I would put in an API gateway earlier. And by an API gateway, I'm talking about something that allows you to do rate limiting, quotas, development keys. Also hold your API documents, so you can see that living, breathing set of APIs that you create in REST. I'm talking about something like Google Apigee, Amazon API Gateway, something along those lines.
Now, in the past, these gateways could create this terrible situation, where effectively you're going through the gateway for even service-to-service traffic, and you're effectively recreating an operational monolith. Well, they've become more powerful than that because they can interface to systems like Envoy now, and the gateway can apply its policies without getting in the way of the service-to-service traffic. So again, you're getting the best of both worlds. You've got that consistent set of policies, that consistent monitoring, but you have no longer got a single source of failure.
And finally, one of the things that often happens in these microservices environments, is that you need holistic treatment of things like load testing. You need to be doing browsing as well as playback, as well as logging in, as well as DVR functionality. But we want to make it so that the teams are not coupled to each other. We want that self-service ability. And so what we do is we build platforms, such that load testing can be where there's a scenario and people contribute tests to it. Or we make it so that when we do billing, and this is a great example of integration pattern in microservices, we abstract all of the common logic into a set of capabilities running in microservices, and then for every new billing partner such as Sprint, we create one microservice which handles all of the Sprint-specific logic. Browse caching, A/B testing, UI layout- these are all things that at Hulu for instance, we built platforms around such that teams could self-service but still contribute to that holistic vision.
Takeaways
So the takeaways from my talk that I want to leave you with, are that microservices offer many benefits. There's that strong isolation and independence, there's that granular deployment, scaling and evolution, and the benefits are real. I've tasted them and I've been amazed by the development velocity that you get. And I never would want to go back to anything even resembling a monolith.
There are a lot of challenges too, but luckily a lot of the new infrastructure that has appeared has made some of these challenges, dealing with these challenges, very, very easy, or at least very, very...you can deal with them in a very powerful way. So take advantage of the modern CICD pipelines, and particularly some of these very powerful continuous deployment tools that allow you to do things like automated canary analysis. Build out infrastructure as code capabilities so that you can spin up your environment and then tear it down again. Use circuit breakers to prevent firestorms, and use best of class tools like Istio to do your service-to-service monitoring. It gives you so much visibility that you wouldn't otherwise have. And then make full advantage of the cloud elasticity. Allow the cloud to be the cloud, and use it in a way where you take full advantage of the fact that you can spin up a whole array of machines, use them for an hour, and then spin them down again.
So, thank you very much for listening. One thing I would say, I'm able to take some questions now. I think I've got five minutes. Also, we've got an AMA tomorrow at 2:55 pm in boardroom C on decomposing the monolith. Are there any questions?
Man: Hi. Thank you. It's a good presentation.
McVeigh: Thank you.
Man: Covered all the details that I wanted to get out of this one. So my question is, you talked about Istio and service mesh and how important it is to get an observability and, all of the features from day one. But the service mesh solutions we're having today, Istio and Linkerd 2.0, are all based on Kubernetes. So for an enterprise or for any company that is not ready for Kubernetes yet, is there any tooling available?
McVeigh: Well, I mean you've got the essential component which is Envoy, which you can certainly place as a proxy in front of all your services. So, from that perspective, you don't have to use something like the auto-injection capabilities of Kubernetes. So I'd strongly suggest looking at stuff like Envoy. It's that building block and it's very powerful. You know, it's handling gRPC, it's handling REST, it's handling HTTP/2. And it's small. It's not got a big runtime and I just place one in front of every service.
Man: I thought Envoy [inaudible]. I did not know that it is doing something [inaudible 00:45:09]
McVeigh: Yes. Oh, sorry.
Shiva: Hi, my name is Shiva. I work for Visa and thank you very much. It's a very good presentation.
McVeigh: Thank you.
Shiva: And I have one question which was not mentioned in this particular presentation. It's more on how do you manage code and how do you manage versions between the microservices? And I would like to ask one more question here. What is the best way to manage the code in terms of mono-repository, or you have a repository per service. So what's the best practice here?
McVeigh: We had a repository per service and we used the team organization structure in Github Enterprise to tie the services together. So from the perspective of a developer, you can think of it … They're just sitting in their own little repo. I mean, you get that very fast development velocity. So at Riot though, it's interesting because the platform monolith was in a single repository, and we were always stepping on each other's toes. Very common in gaming to have a single repo for the whole company. And I found it much more freeing to have that individual repo per service.
Shiva: Okay. The latest programming languages like Go, right, they actually support this mono-repo model.
McVeigh: Sorry. They support …?
Shiva: They support- a lot of new projects are coming in that model. Do you think there are any benefits to it or no?
McVeigh: You mean benefits to having those granular repositories?
Shiva: Yes.
McVeigh: Yes, I mean, I think there are huge benefits because in essence, you're living in a small world now and it's all about getting back to that small world where you can get that huge development velocity, and Go in particularly is fascinating because it's so hard to create abstractions in it. You can't do anything super fancy, so it's very fast to develop in. And when you're keeping that small world, you've got your own repo, you don't have to step on anyone's toes, you can just get that enormous development velocity that you experience when you're just making a project for yourself.
Al: Hi, Andrew. Thanks for the presentation. My name is Al. We're actually an ecommerce. I really liked the idea you mentioned about having, based on the scenario, pick a particular platform or a cloud. But with the outages that we see for every cloud provider, we cannot just afford losing any downtime, right? I mean having any downtime. So we're actually a highly distributed microservices on Kubernetes and we're trying to be on all the major clouds like Azure, Google, and AWS. But that makes it very difficult for us to manage, right? GKE, EKS, AKS, being available in some DCs or not. What's your idea about that?
McVeigh: So I can't remember. I'd be interested in looking at what the downtime of every region is. Say if you create something that's multi-region, you've effectively got those bulkheads already because if Amazon dies in one region, it's very unlikely to die in another region. So in our particular case, we looked at the downtime across regions, realized that there was almost no downtime which was in common between all of them, and just created [inaudible 00:48:22] to multi-region. And that's exactly why you're going multi-region. Now, I haven't dealt with a medical system or a financial system which absolutely needs 100% uptime. So maybe, on analysis, you might say exactly what you've concluded and you just have to bear the engineering cost of that. Sorry. We got time for one last one?
See more presentations with transcripts