Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Netflix Play API - An Evolutionary Architecture

Netflix Play API - An Evolutionary Architecture



Suudhan Rangarajan dives deep into how Netflix used a set of three core foundational principles to iteratively develop their architecture. He specifically talks about what patterns they observed in their previous architectures and how they arrived at a list of practices to create an Evolutionary Architecture.


Suudhan Rangarajan works on the Playback API team at Netflix, responsible for ensuring that customers receive the best possible playback experience every time they click play. Prior to Netflix, he worked on the Audio/Video decoding pipeline in Adobe Flash and Adobe Primetime products helping many partners create a great video streaming client.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


All right, let's begin. So in the year of 2016, two big events happened at Netflix. In early January, at the flip of a switch, we launched Netflix globally enabling customers from across 195 countries in the world to instantly enjoy Netflix. In the same year, towards the end, I believe it was around December of 2016 if I'm not wrong, we released one of our most requested features. It was an ability to download and playback content offline. It was motivated by the fact that across the world, if you have a spotty network connection, it will come in handy to download the Netflix content and watch it offline. But according to me, it comes really in handy if you have kids who watches "Boss Baby" again and again and again in a loop.

These two events translated into two top-level goals for our services engineering organization. One was high availability. It was all always a primetime viewing in some part of the world. And number two, it was innovation velocity. The entire downloads feature was conceived, designed, developed, tested, and deployed in a matter of months. However, one key service, our API service, whose responsibility is to orchestrate all functionality flowing into the Netflix ecosystem, was struggling to keep up with these two top-level goals.

To explain things more concretely, I'm going to show you a couple of graphs. This graph indicates the number of times Netflix had some form of an outage in the year of 2016. The peaks indicate the number of customers who got impacted by the outage, and the width indicates the duration in which the outage lasted. And specifically, the red dots indicate the number of times our API service was directly or indirectly a contributing factor to the outage. So our high availability goal was at risk.

Separately, we have plotted two metrics. This plot, deployments per week and drawbacks per month. As you can see, over the years, we are seeing a decreasing trend in deployments and an alarming increase in the number of rollbacks due to the complexity of our API service. Now, I want you folks imagine this. Say you own and operate this API service. As I said before, it is a critical service orchestrating and acting as an access point for all requests coming into the Netflix ecosystem. On the one hand, it's increasingly becoming a bottleneck for our feature velocity. And on the other hand, our availability numbers are not where we wanted to be. In order to fix this, you're tasked with the responsibility of re-architecting the service. Where do you begin?

The goal of this talk is to provide you with a framework to think and reason about how to make such a big architectural change. I'm Suudhan Rangarajan. I work in the playback API team with Netflix and we are responsible for a dozen of micro services within the playback domain. I guess if you guys have been to Jessica’s talk this morning, she said she was an ex-monolithic engineer. I guess it makes me her ex-distributed monolithic engineer.

Previous Architecture Workflow

Before we dive into the specific, let's take a quick look at the previous architecture workflow. We support the Netflix application amongst thousands of devices, and millions of requests originating from those devices come into a Netflix ecosystem of services run and operated by various teams at Netflix. The requisites typically come via the API proxy service whose responsibility is to provide protocol termination, monitoring, and routing. And behind the API proxy service, we have the API service whose responsibility is to orchestrate all requests flowing into the Netflix ecosystem. And behind the API service, we have hundreds of micro services whose responsibility is a very domain specific niche responsibility.

Within the API service, we have three major workflows. The first is the signup workflow. It is enabled by a set of sign up API's in coordination with, say, membership, authorization, billing, and other similar microservices. Once you log in, you see rows and rows of content in your Netflix view and then it's all personalized for your taste. This is enabled by a set of discovery APIs in coordination with personalization, artwork, title, metadata, localization, and dozens of other services. And finally, you hit play button. From the moment you hit play till you exit a playback, it is enabled by a set of API's we call the playback API and its associated microservices.

So let's keep the view of this architecture in mind so that as we go through this talk we'll pick apart the technical aspects of the previous architecture. And we'll try and compare and contrast between the previous architecture and our current architecture choices.

At the very high level, we recommend thinking in terms of three fundamental principles. Number one, identity. Number two, Type 1 and Type 2 decisions. And number three, evolvability.


Let's begin with identity. Start yourself with asking why, why does your service exist? If you remove your service from your ecosystem, what would be the impact? Go a step further and ask why does your service exist with respect to why your company exists? Let me paint a picture for you folks. Why does Netflix exist? Netflix’s goal is to lead the Internet TV revolution to entertain billions of people across the world. And within Netflix, we have the product engineering organization. Why does that exist? Its purpose is to maximize customer engagement across all Netflix functions, from sign up all the way to streaming. And then you go one level down. Within the product engineering organization, we have the edge engineering board. Its sole purpose is to enable acquisition, discovery, and playback experience with high availability.

Within the engineering organization, we have the API service and the API service identity is to deliver acquisition, playback, and discovery functions around the clock. As we went through this process of hierarchically determining what is the identity of our service with respect to our organization, with respect to other services, the first thing which we questioned was, does it still make sense for one API service to play a role in all these major functions which enabled the Netflix application?

In retrospect, we are realizing that we didn't really apply a single responsibility principle. And we rolled multiple identities into one service, which made API service unnecessarily complex. This enabled us to make a first decision in our re-architecture. We said that in order for us to grow the business, like for the next several years, we wanted a separate API service for each of these functions. So we've split API service into a sign up API service, discovery API service, and playback API service.

So now, within the engineering organization, we have Play API Service whose responsibility is to just deliver the playback life cycle with high availability. In order to do that, the Play API Service interacts with three categories of media microservices. It talks with a set of service, which decides what the best playback experience is. It talks with a set of services which authorizes each and every playback. And then there's a third set of services, which collects the playback data for business intelligence.

So now to ensure that there is a specific role for the play API service, what we did was we removed the play API service from the equation and tried to reason about what would happen in that case. Immediately, we noticed that it introduces a high amount of coupling points. Before the coupling point used to be three. Now it used to be 12 or 13. That implied that the availability of each of the other services was going to be difficult.

Similarly, each of the API's which the microservices, the media microservices exposes; we are exposing those APIs directly to the devices. And again, it is a point of low volatility. In essence, this cartoon kind of captures the role of why do we need Play API Service. We’ve broken open from a monolith and then we have separate domain specific playback, domain specific micro services. And then we need a specific layer whose responsibility is to orchestrate amongst those services.

By doing those we came up with the play API’s identity. We said its purpose is to orchestrate the playback life cycle, while providing stable abstractions between devices and on the domain specific playback services. So the guiding principle here is that we believe in simple single identities; the identity must relate to your organization, to your company, and should complement the identity of all the peer services in your ecosystem. So that's the big first guiding principle.

Type 1/2 Decisions

Then let's talk about Type 1 and Type 2 decisions. Let's do a quick show of hands: how many of you guys have used Type 1 and Type 2 Decision frameworks before? Not too bad, I think like 5%. So the concept came from Jeff Bezos, actually. In his annual shareholders letter, he talks about the type of decisions which sustain innovation at Amazon. He says, and I quote, "Some decisions are consequential and irreversible, or nearly irreversible- one-way doors, and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. We call these Type 1 decisions.” He goes on and says, "But most decisions aren't like that- they are changeable, reversible, two-way doors. If you’ve made a suboptimal Type 2 decision, you don't have to live with the consequences for that long. Type 2 decisions can and should be made by high judgment individuals or small groups." This is a great piece of wisdom, which we can apply to architectural design.

So at Netflix, we believe there are three types of conditions to consider. Number one is around appropriate coupling. Number two, the choice between asynchronous and synchronous. And number three is the data architecture. Let's go over each of them. When we talk about appropriate coupling, we have to talk about shared libraries. And when we talk about shared libraries, typically, we talk about two types of shared libraries. There's a set of shared libraries which provides some common function, say, this could be the IDS jar or the metrics or an in memory cache solution. And then there is a set of shared libraries, which access client libraries, which enables us to talk between microservices.

For instance, most of Microsoft's code originated from a monolithic code base. So there was high proliferation of shared libraries from the data. In fact, there is one particular library, which we call the streaming utilities jar. And if you looked at the dependency tree of the library, it had 121 dependencies. To make things worse, the library is consumed by almost 80% of our micro services.

So we have thick shared libraries, which are hundreds of dependent libraries. So when you have these hundreds of libraries assembled as part of the micro services and it's kind of crossing boundaries across different services, what we have is a distributed monolith. Any fatal change in one of these libraries had the potential to bring multiple services down. Distributed monolith is worse than a monolith because it had all the effects of the monolith on top of having to own and operate each of these micro services separately. So that's one form of coupling.

Sam Newman captured this very well, actually, in this book "Building Microservices". He says, "The evils of too much coupling between services are far worse than the problems caused by code duplication." We have another form of coupling with respect to the client libraries. So the play API service talks to the playback decision service by another playback decision client. Whenever the playback patient service is unavailable, it uses a fallback from within the playback decision client in order to improve the reliability of the service.

However, in one particular instance, when that playback decision service was down, it resulted in the fall back execution from within the play API service. However, we saw that the latency of the API service went through the roof. This is because the fallback was so heavy and so CPU intensive, even though it provided reliability for the playback decision service from the perspective of the playback API service, the availability was still impacted, and Netflix was still down.

So what we have here is some form of operational coupling. We have two domain contexts, each with its own responsibilities; playback decision service does the playback decisions and the API service just the orchestration. However, via the playback decision client, the operational context of the playback decision service is leaking into the Play API Service.

Operation coupling worked very well for us for several years, mainly because many teams cannot be ready to fully own and operate a highly available micro service. However, as years progressed, the API service, being a Nexus service, which is incorporating several client libraries, became an untenable situation to operate the API service itself.

Another interesting issue which happens with the proliferation of shared libraries is some form of language coupling. It encourages people to stay in Java. Netflix has historically been a Java shop. However, there has been use cases which we are exploring specifically around, say, we wanted to build a backend forefront and services in Node, and we want to be able to take advantage of Node's functionality and the device team's expertise to be able to build such a service. In order to do that, any team who is owning the Node service will have to hand write the list of clients in order to communicate with the rest of the ecosystem. It was such a high friction client that it discouraged most people from even considering such an option.

With respect client libraries, there’s another form of subtle coupling which happens. This has been the communication protocol. Many services at Netflix were written on top of the Jersey framework by our REST interface. And most communication happens over REST over HTTP1. It works well, except it has one limitation. As soon as the connection is established between the client and the server, all communication is initiated by the client. So it's unidirectional; that means it can only support request and response style APIs.

Drawing from these experiences, we sat down and we kind of debated what our requirements should be, and then we came up with four requirements. Number one, we wanted operationally thin clients. By this we meant, whatever client libraries which we're incorporating, it should not have any heavy fallbacks, no special logic. And the dependencies which it brings in should be well defined and limited.

Second, we don't want, if possible, to use any shared libraries or have a well-defined set of shared libraries, which is acceptable to use within an API service. Specifically, if we talked about the streaming utilities JAR, we made a call to not incorporate that in our API service anymore.

The third requirement is around auto generated clients. This is mainly for the Polyglot support. This meant that we wanted to define the API of our service in some form of interface definition language. And we wanted some kind of tooling to generate the thin clients in multiple languages of our needs.

And finally, we wanted bi-directional communication because we wanted to explore beyond request and response style API's. One quick point of note here, is one of the things which we initially discussed around do we really care about REST versus RPC? We really didn't have this as a requirement because at Netflix most use cases are modeled as REST request response and REST was a simple and easy choice to use. But it was more an incidental choice rather than intentional one. By analyzing some of the services, we realized that most of the services were not really using REST-ful principles anyways. The URLs didn't represent a unique resource. Instead, the parameters which came along with a URL, determined the outcome of the call, effectively making them a RPC call.

So we said we're agnostic to REST versus RPC, as long as it meets our other requirements. Based on this, we squared on GRPC as our framework of choice. GRPC provides protocol buffers as their ideal and we are able to define our API's and have clients automatically generated in all the languages of our interest. It also supported Nettie and HTTP/2 for bi-directional communication. However, we did still have to enable the ecosystem confidence. We need to build some ecosystem compatibility within the GRPC framework to make it work within Netflix.

So this is our previous architecture compared to the current architecture with respect to coupling. We have very minimal operational coupling, limited or intentional binary coupling. We are able to go beyond Java, and we are free to explore beyond request and response style API's.

So the first type condition we made, around appropriate coupling, is around we want to consider thin auto generated clients with bi-directional communication and minimize code reuse across service boundaries. The next Type 1 question is around synchronous and asynchronous choices. In order to understand this, let's consider an example. Let's say we have an API called getPlayData API. It takes in the customer ID, title ID, and the device ID. Using the customer ID, it talks to the customer in for service to fetch the customer info. Using the device ID, it talks to the device service, it gets a device info, and using the interest customer info and the device info, it talks with third service, let's call it the playValidationService, in order to decide the playback data and in return the PlayData.

Let's see how we will code this up in both synchronous and asynchronous architectures. A typical synchronous architecture looks like this. We have a dedicated thread pool for all the incoming requests. Each request execution gets one dedicated thread. Separately, each of the clients, which are talking to some other external microservices, has its own dedicated set of thread pools, which is managing all the outgoing communication. So for the PlayData call, what happens here is that we have three calls. The first call is the getCustomerInfo. Okay, let me back up a little bit. So basically, we get a request for getPlayData. And then it gets a thread from the Request Handler thread pool. And then that thread blocks still the getPlayData returns. And within the getPlayData call, the first unit of execution is to call into the customer service in order to fetch the customer info. Further, it coordinates with a client thread pool that will correspond to the customer info and makes an outbound call to the customer service.

For this entire duration, the execution thread is blocked till you get a response back from the customer service. The same thing happens with the device info and with the decidePlayData till the PlayData is returned and it's available for return. So in a typical synchronous architecture; you have a blocking Request Handler and a blocking client IO. It works well for a simple request response to an API where latency is not that big of a concern. And also it works if you have to just worry about limited number of clients, which we are communicating with.

I've been talking about request response for a while now. Other use cases which go beyond request response. Let's say for the same PlayData call, how can we model beyond the request response patterns? So the request response pattern looks like this. You request PlayData for Title X and you get a response back for the PlayData for the title X. So it's one request and one response. It could define an API, which accepts a PlayData requests for titles X, Y, and Z, and as and when we have the PlayData available for each of those titles, we can stream the response back to the calling device. So this would be a request stream pattern.

Or we can flip it and you say the devices can direct PlayData as and when it thinks it's necessary. Then we can call on the request and then send a single response back with all the PlayData for all the titles. So that will be a request stream response pattern. And finally, we can have streaming happening on both sides where devices can request for PlayData as and when it needs and then the service can respond with PlayData as and when it's available. So to be a bi-directional stream data.

If you think any of these stream-specific API patterns will fit into one of your business domains, it's worth considering an asynchronous architecture. Asynchronous architecture looks like this. So you have an event loop for all the incoming communication. So it's called the requisition response event loop. And you typically have a specific number of worker threads. It's usually a function of number of calls in the machine. And then we have outgoing event loop associated with each of the client, which is managing the outgoing communication.

In order to take full advantage of this asynchronous architecture, we need to code up the PlayData call a little bit differently. We want to split the PlayData call into different independent execution units, which can run in parallel. For example, the getCustomerInfo and getDeviceInfo call can happen in parallel. And then when the results of both those calls are available, you can zip them together and send it to a third call, which will be the decidePlayData call.

Let's see how the execution workflow will work in such a call pattern. Say a request for PlayData comes in and as soon as a request unit is available off the network buffer, the event loop will trigger one of the worker threads to execute the getPlayData call. The first call, all it does is that it sets up the execution and it immediately returns. A separate worker thread will fetch the customer info and another separate worker thread will get the device info. And then there is another execution unit which will zip the results from the device info and the customer info and then pass it on to the decidePlayData execution thread. And once the PlayData call returns, we are able to return the response.

As you can see, the workflow spans multiple threads. All context as fast as messages from one processing unit to another. If you need to follow and reason about a particular request, we need some form of tooling to assemble and capture these requests so that we can reason about it. And finally, none of the calls can block because we have a limited number of worker threads. And it is designed with the complete population in mind. So if one of those worker thread blocks, it will significantly reduce the throughput of the service.

So in an asynchronous architecture, we have an asynchronous request handler and a non-blocking IO. The question to ask is, do you really have a need beyond request response? If you did, then you might benefit from an asynchronous architecture. However, for the purpose of Play API Service, we try to tease apart what is the Type 1 Decision in this, and what is a Type 2 Decision? And we decided to make both the IO's, the incoming IO and outgoing IO as non-blocking. However, we kept the actual request processing itself as blocking, so that we can reason about it.

This solved for current use cases, but it also left room for future use cases. Certainly, in one of the business use cases, we are considering a bi-directional stream partner. And when that use case arises, we'll be able to extend this architecture to support that.

So the Type 1 Decision here between synchronous and asynchronous is if most of your APIs for the request response pattern consider a synchronous request handler, but ensure that your IO is non-blocking. So that wraps up synchronous and asynchronous choices.

The third Type 1 Decision is around data architecture. Whether you're breaking up a monolith into microservices, or you're restarting one of your micro services design, please consider the data architecture like the period as a number first class citizen because it deserves that specific role. Without an intentional data architecture, data becomes its own monolith.

Let's take a look at what the situation was with Netflix here. So we have multiple data sources. For example, you can have encoding profile data, deployment status data, the title data, all the localization data. And there are several services which consume data sources assets. This pattern looks similar to the distributed monolith situation because any change in data sources instantaneously impact all the services which is consuming this data.

From within the scope of the API service, we have a subset of this data sources which we consume. And each of these data sources are loaded in memory asynchronously as and when the new data becomes available. The first thing which we noticed was, a very small percentage of the data was actually getting used. So it is a very inefficient use of all the resources, especially for a service like API service. And secondly, because there was a purpose of assumption of lot of all these data sources always being available across all the surfaces, not only the API service was using all the data models from these data sources freely, but also all the shared libraries, which were consuming were also dependent on some of these data sources. So it became really non-trivial to unwind all the use cases for the data sources. In fact, we don't even know what data sources are necessary to run the API service.

The third observation was that whenever there was a data update, we could see that there was a correlated degradation and performance. For instance, in this particular graph, you could see that whenever there was a data update, you could see an increase in CPU utilization. It also had increasing gzip pressure, increasing latencies. So if somebody comes and asked you, "What's your performance characteristic of an API service?" We are unable to say that this is our steady state performance because it kept on varying depending on when there was a data update. And finally, some of the data updates can be catastrophic. Most of our deployments are immutable, tested, and [inaudible 00:29:11]. However, because these data updates happen asynchronously, it has the capability in this particular situation and almost for 40 minutes, our API service will stop.

So we debated and discussed as to what we want to do with these data sources without changing too much of the architecture of the data services themselves. We want to isolate the Play API Service from the data in which is consuming. So we said, the classic adage in computer science, right? "All problems in computer science can be solved by another level of indirection." So that is what we employed here. What we did was we created a service called the Data Loader Service. Instead of the Play API Service consuming these data sources directly, we let the Data Loader Service consume all these data sources and all its refreshes. And whenever there is a new refresh of any of the data source, we would compute the data which is necessary for the Play API Service and will convert the original data source into a materialist view, which only the Play API Service needs, and we'll store that into the Data Store.

Separately, we also created an abstraction layer for the Data Store, we call the Data Service. And it was a very highly available, highly cache, high throughput service, which enabled us to fetch all the data which is necessary in order to provide different business functions from within the Play API Service. Let me quickly go over the benefits of this, right? So it uses only the data it needs. It doesn't load any data in memory. And because it doesn't load any data in memory, we are having a very predictable operational characteristic. And a nice side effect of this is that the number of dependencies which we need to assemble within the Play API Service was also significantly reduced.

So if you think building such a big reduction architecture for your data is an overkill for your use case, at least consider building an abstraction or anticorruption layer so that whenever the need arises, you can remove the data source outside of the service into its own separate services. So the Type 1 decision here is that for data architecture, isolate data from the service and then ensure that at least if you don't want to isolate the data, at least ensure that there is a layer of abstraction.

So that brings us to the close of all the three type conditions which we think are necessary in order to build our architecture from fresh. For Type 2 decisions, we suggest you choose a path, experiment, and iterate. It's simple because the decisions are not consequential. For instance, what we did was, we had around- within the Play API- maybe 20 or APIs which are to implement. We implemented one API. We figured out what is the integration pattern with the clients. We figured out a migration strategy. We figured out shadows testing strategy and then we learned from that and then we moved on to the other APIs.

So the guiding principle here is that identify what makes a Type 1 and Type 2 decision for your use case. Among the Type 1 decisions, spend 80% of the time debating and aligning on the Type 1 choices.


And the third part of my talk is around evolvability. How many of you guys heard the term Evolutionary Architecture before? It's 20%, 25%. So Evolutionary Architecture is a term coined by Rebecca Parsons and Neil Ford from ThoughtWorks. They define it as such: they say an Evolutionary Architecture supports guided an incremental change as first principle among multiple dimensions. There are three key words here. First, is “designed for change”, and “every change is guided”. And most importantly, we should be able to evolve across multiple dimensions.

Let's start with the multiple dimensions piece first. By choosing a microservices architecture with appropriate coupling- I think the appropriate coupling is the request emphasis here- it allows us to evolve across multiple dimensions. The Play API Service can evolve independently of the Play Decision Service, which can involve independently of the customer info service and the device info service. Each of the services should be able to evolve independently without impacting too much other services in the ecosystem.

With respect to change, let's understand how evolvable are the Type 1 decisions. As I mentioned before, if you wanted to completely try an asynchronous architecture, compared to the previous architecture, the current architecture is in a much better place. Because, as I mentioned, all we needed to do was adopt a asynchronous framework and build observatory tools around it.

Same with respect to Polyglot services. At the very least, we are already able to accept requests from a non-Java service and talk to a non-Java service. We are much better suited for developing Bi-Directional API's and any additional data sources, which come into play in order to enable a new business function, and we are able to incorporate that effectively as well. In some sense, these are our known unknowns. We already designed our architecture with these in mind.

So we ensured that our architecture was extendable along these dimensions. However, there might be some future potential Type 1 decisions which may come in the next three to five years. At least within Netflix, some sets of teams are seriously considering and have deployed a lot of their services and containers. And serverless is something which we are probably dabbling with it a little bit. So once we become really serious about these two, and I'm sure it will tell us how evolved our services are. So these are in some sense our unknown unknowns. We fully expect that and only time will tell how our architecture is able to evolve with those choices.

So typically, when you start an architecture fresh and you deploy it for the first few months, things are looking rosy and nice and fine. But as new business use cases come in, complexity usually creeps in often at the cost of the original principles which guided the architecture. So the key question to ask is as we evolve, how do we ensure that we are not breaking our original goals? Or if you are breaking, it has to be an intentional choice to break that original goal. This is where fitness functions come into play. And this is what evolutionary architecture also suggests with respect to guided change.

As part of every architecture's goal, we typically have the usual suspects, like we want it to be highly available, we want it to be low latency, we want it to be reliable, resilient. At Netflix we also care about observability, simplicity, developer productivity. Sure, these goals themselves are interesting, but what we want is the relative importance of one goal with respect to one another. For instance, this is a fitness function for our Play API Service. It actually categorizes each of our goals with relative importance to the other goal. A quick note of caution here, the fitness function for your service might be totally different, and it should be tailored for your particular business use case.

Let's go over a couple of choices here, and why we ranked one goal higher compared to the other one. For example, we choose simplicity or liability. As we talked earlier in the talk, we had a play decision service, and usually in order to improve the reliability of a service, we allow for fallbacks to happen. And when the fallback happens, it increases the operational complexity of the calling service. Especially if the fall backs are talking to a totally different service in order to service the fallback, or if it was a CPU intensive fallback logic itself. So if the choice was to have a heavy fallback to improve reliability or to keep the calling service simple, we want to choose simplicity.

Similarly, we prefer scalability over throughput. In order to increase the throughput for any service, usually it includes some form of caching, right? In one particular case, what we noticed was if you just introduced in memory caching solution, we were able to increase the throughput by 50%. However, in a dire situation where we want to quickly scale the horizontally scalable services because there is an increase in request load, this meant that we have to invest in some form of cache warming solution before we can allow the new instances to take traffic. So it meant that the throughput advantage came at the cost of scalability. So we decided to say like we prefer scalability, and we decided not to go ahead with that solution.

And finally, let's quickly touch upon why we prefer observability over latency. So if you designed a fully asynchronous solution because of the advantages of maximum paralyzing all the requests, typically it results in lower latencies. However, if you have to reason about what happened during records workflow, we need to build separate observe ability tools in order to do day-to-day functions about like debugging a particular issue. So when it comes to observability and latency, and if it has to be a choice between the two, we prefer observe-ability.

So those are the fitness functions which guide us for any intentional changes. But then in order to ensure that our service is not degrading due to any unintentional change, we have a separate set of fitness functions. These typically take the form of alerts, metrics, monitoring, or in some cases test. For instance, we have alerts for our availability and latency SLA's. In order to ensure that we are only taking in thin clients into the Play API Service, we have written tests to ensure that the dependency tree only contains dependencies which we've already white listed. And we also ensure that we always write test to ensure for any noncritical service communication If that communication fails, it doesn't bring down the API service. And for most of deploy time, we usually have a monitoring dashboard. We keep track of how much the merge deploy apply time is increasing or decreasing over time. So the guiding principle here is that define fitness functions to act as your guide for architectural evolution.

So in terms of all the different attributes which we talked about, this is how the previous architecture stacks with the current architecture. Current architecture has singular identities, operation isolation, and most limited or no binary coupling. It allows for asynchronous communication. It enables us to go beyond Java, and it has an explicit data architecture. And we have a set of fitness functions which is guiding us for evolution.

Coming back to the initial graph which I showed you at the start of the talk; with respect to the high availability goal, I'm happy to say that in the one year of its inception, we have not had any single incident in which the Play API Service was a direct or indirect contributor. And our goal of five deployments per week- for that goal we are averaging around 4.5. That means almost all days, all days during business hours, we are shipping. And then we just had to do two rollbacks that do not relate to a customer-facing issue, but is more on the data quality issue.

So to summarize, I would encourage you guys to think about building evolutionary architecture, build a strong domain-specific identity, ensure and iterate on that identity so that you can always keep the identity in the back of the mind while you're building your architecture. Invest in Type 1 and Type 2 decisions framework. Determine what constitutes your Type 1 decision and spend 80% of the time debating and aligning on those choices. And finally, ensure that your architecture is evolvable across multiple dimensions and use fitness functions to act as your guide. It's all I’ve got. Thank you very much.

See more presentations with transcripts


Recorded at:

Dec 12, 2018