BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Airbnb at Scale: From Monolith to Microservices

Airbnb at Scale: From Monolith to Microservices

Bookmarks
46:54

Summary

Selina Liu discusses what it takes to decompose a large and complex monolith into independent, performant services, and how Airbnb continues to evolve and scale the new architecture.

Bio

Selina Liu is a senior software engineer at Airbnb, the world’s largest platform for accommodation-sharing and unique travel experiences. She’s passionate about building performant and resilient services that scale and evolve well with Airbnb’s growing business needs.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Liu: I live in the beautiful city of San Francisco. One thing to know about the city is it's very hilly. Even after three years of living in the city, I still run into this issue where every time I'm trying to get from point A to point B in the city, whether it's for dinner or for a walk, I'll look at Google Maps and be like, that looks easy. When I actually get on my way, at some point, I will realize that up ahead is a hill with a non-trivial incline. That happens a lot in SF because it's hilly in so many places. When I think about it, this uphill journey resembles Airbnb's very own journey to service oriented architecture. In a sense that we started out on this flat and straightforward path of migrating to SOA, then realized that there's a surprisingly steep and interesting learning curve beyond that. I will share with you four key lessons that we have learned along the way.

Background

My name is Selina. I'm a software engineer at Airbnb.

Lessons from Airbnb's Journey to SOA - Invest In Common Infra Early

Here are the four key lessons from our journey to SOA and beyond. Let's dive straight into the first one, which is to invest in common infrastructure early in your SOA journey, because this will help to turbocharge your migration from monolith to SOA. To give you more context, let's spend some time first on why and how we went about our migration. Back in 2008, just like most small scrappy startups, Airbnb started out with a single web application written using the Ruby on Rails framework with a model view controller paradigm. Over the years, as Airbnb expanded its global footprint product offering and engineering team, the single web app grew into a massive bloated monolith. Since it is developed using the Ruby on Rails framework, internally, we call it the monorail. As more engineers jumped onto the monorail, it was becoming slow, unstable, hard to maintain, and a single point of failure. In 2018, we launched a company-wide initiative to migrate our tech stack from this slow moving monorail, to service oriented architecture. Similar to the MVC, model view controller paradigm, we decided to decompose the monolith into, first of all, presentation services that render data in a user friendly format for frontend clients to consume. Then we also have mid-tier services that are grouped around shared business concerns and can be used in multiple contexts. Then, under those, there are data services that encapsulate multiple databases in a uniform data layer.

A Case Study - Host Reservation Details (HRD)

To make this abstract diagram a little bit more concrete, let's use Host Reservation Details, or HRD, as a case study. Before I explain what HRD is, I want you to meet Maria. Maria is a new Airbnb host based in Colombia. She just published her listing on Airbnb last week, and today, she received her first ever booking request. When she opens her host inbox on the Airbnb website, she sees a message from the guest, Christopher, and this panel on the right here that displays key information about the trip being requested by the guest. This right panel here is the HRD or the Host Reservation Details. Taking a better look at it, we see that it has these call to action buttons that prompt the host to either accept or decline the booking request. It also pulls information from a myriad of sources in our system, including users, listings, payments, and reviews. In fact, there's a lot more information that it offers below the fold as well.

To migrate this product feature to our SOA paradigm, we basically broke down the product logic into, first of all, a presentation service that handles view layer business logic such as formatting and translations. Then we have a few mid-tier services that handle write operations such as confirming or declining a booking request. Then we have data services down the stack that encapsulate different product entities such as reservations, users, and listings. Of course, there are many other services on each layer, and we interact with many of them every time we serve a request.

Let's call our reservation presentation service Ramen, because, who doesn't like Ramen? Going back to this diagram, we do interact with a lot of services every time we serve just a single product service like HRD. You can imagine that to serve all the product services we own, we need a pretty sizable number of SOA microservices to power them. What made this monumental task of migrating all the complex logic and building out our vast SOA ecosystem possible are the building blocks that our infra team provided that allowed us to build services with confidence and speed. Let me walk you through some of them.

API Framework

First, we have this in-house API framework built on the Thrift language. It is used by all Airbnb services to define clean API and to talk to each other. Let's say that as part of our business logic, Ramen has to read data from the Tofu service. To serve that data, the Tofu engineer only has to define the endpoint in simple Thrift language, and the framework will auto-generate the endpoint logic in Tofu that handles stuff like schema validation and observability metrics. Secondly, a multi-threaded RPC client that the Ramen service can use to talk to Tofu service. It handles stuff like error propagation, circuit breaker requests, retries. This means that engineers can focus on implementing the core business logic and not spend time worrying about the plumbing details of inter-service communication. Based on the framework, we also developed productivity tools such as this API Explorer, where engineers can browse different services, figure out which endpoint to call, and even use an API playground like this one to figure out how to call the endpoints.

Powergrid

Second, we have Powergrid, which is a Java library built at Airbnb that makes it super easy to run code in parallel. Under the hood, Powergrid helps us organize our code as a directed acyclic graph, or a DAG, where each node is a function or a task. We can use it to model each of our service endpoint as a data flow with a request as the input and the response as the output of the data flow. Because it handles multi-threading and concurrency, we can easily schedule chunks of code to run in parallel. Let's take the host sending a special offer to the guest as an example. In this write operation in Ramen service, there's a bunch of checks and validation that have to be performed before we allow the host to send a special offer. Using Powergrid, we will basically first take the listing ID from the request body, use it to fetch information about the listing from the listing data service. Then we'll pass the information down to a bunch of downstream services for validating, which will happen in parallel. Then after that, we will aggregate all the validation responses, make sure that we got a green light from everyone, before we write back to the special offer data service to actually send it to the guest. This Powergrid library provides a few benefits. First of all, it provides low latency for performing network IO operations concurrently, which really makes a difference when your endpoint has multiple downstream dependencies. It also offers granular metrics for each node, which helps us to pinpoint bottlenecks in the data flow. These benefits help to ensure that our code is performant, observable, and easy to reason about.

OneTouch

Third, we have OneTouch, which is a framework and a toolkit built on top of Kubernetes that allows us to manage our services transparently and to deploy to different environments efficiently. This framework has two key aspects. The first part is the principle that all service configurations should live in one place in Git. For instance, other configurations for the Ramen service will live in an infrastructure folder that lives right alongside the source code folder. From there, we can easily configure stuff like CPU resources, deploy environments, alerts and service dependencies. Then the second aspect to the OneTouch framework is this magical k tool, which is a command line tool built on top of Kubernetes that allows us to deploy our services to different environments on the Kubernetes clusters. If I just type k all on a command line, it will automatically generate the configs, build the app, and deploy it to a remote cluster. If you think about it, it's actually just like making a bowl of ramen where first you generate the bowl, which is the configs. Then you build the ramen, which is the actual application. Then you deploy the entire bowl of ramen with garnishes, and that is the end product. Because all environments are deployed the same way, we can easily make five beautiful bowls of ramen soup for different environments. From the service governance perspective, it makes it super easy for everyone to orchestrate, deploy, and diagnose a service, because there's only one place to look and one process to learn.

Spinnaker

Lastly, we have Spinnaker, which is an open source continuous delivery platform that we use to deploy our services. It provides safe and repeatable workflows for deploying changes to production. One aspect that has been especially helpful for us is the automated Canary analysis. In this step of the deploy pipeline, we basically deploy both the old and the new snapshots to two temporary environments, namely, Baseline and Canary. Then we route a small percentage of production traffic to them. Then, key metrics about these environments such as error rates, are automatically ingested and fed into a statistical test that produce an aggregate score for the Canary, which has the new snapshot as measured against Baseline which has the old snapshot. Based on the score, the analysis tool will decide whether to fail or promote the Canary environment to the next stage in the deploy pipeline, which, for example, could be deployed to production or running more integration tests. In SOA, where so many services are deployed every single day, this helps us to release code changes at scale with confidence and ease.

Thanks to all these infra pieces, we were able to migrate our core product functionality to SOA within the span of two to three years, with an architecture like this. Of course, there are many other services that are not shown in this diagram. In total, we have more than 400 services running in production with more than 1000 deploys every single day.

What Does the Post-Migration World Look Like?

After all this work, you could think that we can now take a long nap in front of the computer just like this cat. We are not done. In fact, we haven't started climbing the metaphorical hill that I mentioned in the beginning. What we realize is that with a sprawling architecture like ours, there are some important SOA tradeoffs that we need to consider before we continue walking in the same direction. For instance, it could sometimes take more time for engineers to ship a feature with SOA, because engineers need to acquaint themselves with and make changes in multiple services. Then they also have to deploy these different services in order before they can push out a feature. There's also significant overhead in maintaining each of these services. On one hand, our system has become more reliable and highly available, because if one service is down, other parts of the SOA can still function. On the other hand, the proliferation of services along the stack also led to more developer friction. Furthermore, even though now our services are loosely coupled, it means that certain patterns of logic have to be repeated across different services, since it's now harder to share code across service boundaries than before.

To continue the comparison, on one hand, our services are now individually scalable, which means that we can fine-tune our resource allocation by service instead of scaling the entire monolith. On the other hand, we face fragmentation and inconsistency in business logic, which makes it harder for us to have a full picture of what's going on in the system. Lastly, on one hand, we achieved business agility by separating different parts of the product into different services that can basically iterate in parallel at the same time. On the other hand, it's also easy to end up with a complicated dependency graph if there's no rules around who can call who. That's what happened to us where due to unconstrained call patterns between services where basically anyone can call anyone else, we have this dependency graph with unruly arrows pointing in every single direction. This is unideal and potentially dangerous, especially when there are circular dependencies, which can make relationships between services hard to visualize and understand. On a good day, this could make it hard for a new engineer to ramp up on the system. On a bad day, when there's an outage, this could make it difficult to debug errors and to mitigate user impact. In a tangled dependency graph like this, highly stable services can also be easily brought down by more volatile services. These are all flip sides of the same SOA coin that we have to grapple with.

Simplify Service Dependencies

To address these issues, we took a few steps, starting with simplifying service dependencies, which is our second takeaway. We designed our architecture to be a tiered tech stack consisting of presentation, mid-tier, and data services. The motivation was to separate services into layers as shown in this diagram, based on their technical priorities. As we go up the stack towards application and UI layers, the primary consideration here is iteration speed and schema flexibility, leading to more specific and fast changing API. On the other hand, if we go down the stack towards platform and infra layers, they need to have more generalized API and higher reliability requirements, since their blast radius is bigger, which means that if they go down, many other services will be impacted. For an SOA to be reliable and resilient, it's imperative that stable services do not depend on more volatile services. Conceptually speaking, this means that a higher tier service can call a lower tier, but not vice versa.

However, the problem with our SOA system was that there was not enough service governance and dependency management to enforce this fundamental principle and to restrict who can call who. Hence, to enforce a topology-driven layer architecture, we introduce service blocks at a platform layer. You can think of each block as a collection of services with related business functionalities. For instance, the listing block, as shown here, will encapsulate both the data and business logic that inform core listing attributes. Then it will expose a simple, consistent read/write API to upstream clients through the facade. Under the hood, the facade will help to orchestrate the coordination between the underlying data and business logic services, while providing a layer of abstraction that conceals the underlying complexity from the clients. We also enforce a strict topology by prohibiting clients from calling any of the internal services, as well as prohibiting blocks from having circular dependencies with each other. With such a high level abstraction, it's much easier for upstream clients to discover and leverage our core functionality at the platform layer. It is also much easier for us to manage block dependencies and maintain high levels of stability.

Platformize Data Hydration

We also spent some time platformizing data hydration. Going back to this diagram again, you can see that we have quite a number of presentation services. If we zoom into any of these typical presentation services, there are usually three common and straightforward functions that these services perform, including, first, which is data fetching and hydration from different downstream services. For example, Ramen service calls thousands of services to hydrate data for Host Reservation Details. Second, there's also a simple transformation of data. Using the same example, Ramen service will transform and merge data from all these thousands of downstream services into something that the client expects. Then third, there's also permission checks that we have to perform before proceeding with more complex business logic.

As time went on, we realized that engineers were spending a lot of time on these three common functions, resulting in a lot of duplicated code and wasted productivity. Our approach to this problem is to introduce a platformized data access layer that provides a single consolidated GraphQL schema, stitching together different entities across all of Airbnb's data, for instance, listings, users, and reservations. Then, it also serves as one platform to host all the mundane data fetching and hydration logic rather than requiring duplication of this logic across many different presentation services. Together with a more complex presentation logic on the left and the write logic on the right, which we will ignore for this presentation, this common data access layer will eventually replace our old presentation services. Then the service blocks below the layer will also replace the old data and services. You can say that with this effort, we continue to simplify service dependencies.

Going back to the layer itself, it is this service building column that we call Viaduct, because it's like a conduit for the data flowing through our entire system. In essence, it is an enhanced GraphQL engine that's built in the language of Kotlin, that reimagines the way data is fetched in SOA, by going from service oriented to data oriented hydration. What do I mean by that? To give you a more concrete example, instead of writing code to explicitly call reservation data service to get reservation data, the caller will instead write a declarative query on the reservation entity. They will even fetch the associated listing and guest user data using the same query. Such GraphQL queries are made possible by our GraphQL schema that's enriched with special directives. For example, the ServicePoweredType annotation with its templated fields allows us to associate a GraphQL type with a service endpoint, where the response from the service will be automatically used to hydrate the GraphQL type. Here you can see that we are linking the ReservationDataService endpoint, getReservationsByIds to the reservation GraphQL type.

As another example, the ServicePoweredType key annotation allows us to link different types together. For instance, the guestId on the reservation type can link to a fully-fledged user type that is implicitly fetched from the user block service. How the magic happens is that at build time, Viaduct translates these templated directives into auto-generated code, as represented by the field and type resolvers in this diagram, that basically takes care of the inter-service data fetching and simple data transformations. A resolver in GraphQL terminology is basically a function that outputs the value of a given field or type. In this diagram, the third-party GraphQL library is responsible for first parsing the incoming query, and then calling the resolvers for every field that we can process. In turn, the field resolvers will call the type resolvers to fetch data from downstream services. Aside from these directives, there's also a privacy directive that wires in permission checks automatically, and ownership directive that makes it easy for us to find code reviewers and to route alerts to the right teams for all these different types at Airbnb.

Another cool directive is the derivedField directive, which allows developers to create new fields computed from one or more existing fields. For example, this HostFacingStatus field is used in Host Reservation Details page to call out important stages or milestones of a reservation to a host. For instance, it can take on the string values of pending payment, checking out today, review guests. To compute this status, we need to consult listing reviews and payment data, which will come from other parts of the GraphQL schema. With the derivedField annotation, we can tell Viaduct to resolve this derivedField using the business logic defined in the HostFacingStatusProvider class.

Going into this class, this provider class basically implements a standardized interface, and overrides two key methods. The first one is this getFragments method that allows us to define data dependencies for this DerivedField in terms of a GraphQL query. You can see here that we are fetching data from listings, reviews, and payment. Then, second, we have this resolve method, where we will write the actual business logic for resolving the DerivedField. What actually happens when a DerivedField gets resolved is that it makes additional field resolver calls, it can be multiple calls, to satisfy its data requirements. In effect, it creates its own GraphQL engine, which can instantiate field resolvers the same way that the other fields do.

Such an abstraction makes it easy for engineers to add all sorts of presentation logic without worrying about the nitty-gritty details of data fetching. You can imagine that when a query asks for multiple fields, especially multiple DerivedFields, it is easy to have overlapping data dependencies within the query. For instance, here you see that three field resolvers might end up needing data from the listing block service. To account for such scenarios, Viaduct has this built-in capability to batch the requests that it sends to downstream services in an optimized manner, making sure that it is not making more calls than necessary. The batching logic here is encapsulated by this dataloader wrapper around each of the type resolvers.

Optimized Batching

Under the hood, Viaduct relies heavily on coroutines in the Kotlin language, which you can basically think of as lightweight threads. Within a single Viaduct request, each of these field resolvers will trigger a coroutine to resolve its field value. Some of the field resolvers might end up calling the same dataloaders. For example, here, you can see that the first two field resolvers both call the listing dataloader. Same for the bottom two field resolvers, which call the user dataloader. It is important to highlight here that the loadById method calls here don't trigger immediate data loading, but are rather suspended, which means that it can be paused and resumed at a later time. This suspension feature is a feature of the Kotlin coroutines. Viaduct with its special coroutine dispatcher will basically wait until all the coroutines for this particular request have reached their suspension point, meaning that no more resolution is possible for that request, before dispatching the actual data loading requests to downstream services in batch. Here you can see that Viaduct has aggregated all the listing IDs in one call to the listing block service, and all the user IDs in one call to the user service respectively. Once the data comes back, Viaduct will pass back the data to the origin of your resolver coroutines, which will then resume with the results.

To go even one step further, we also added an intra-request cache, which basically keys on the global ID and the type of data to make sure that within the lifecycle of a single request to Viaduct, the same data doesn't get requested from its source more than once. For instance, the same listing ID, or the same user ID does not get requested more than once. These two measures, optimized batching and caching for data loading, prevents unconstrained funnel of requests to downstream services.

Example - Listing Details Page

To give you a real-life example, this is the listing details page that gets used to check out a listing and book the listing. Naturally, this page depends a lot on listing data. Before with our first iteration of SOA, the presentation service responsible for hydrating this page will end up triggering 11 calls all together to the listing block service through our SOA maze. After migrating most of our presentation logic to Viaduct, then the number of calls is reduced down to one. You can see how with batching and caching, Viaduct helps to improve performance and saves us computing costs by reducing the number of network requests needed to serve a single page. In addition, we also have an online IDE built on top of the open source graphical library that makes it super easy for engineers to explore the schema, issue actual queries, and to inspect the data fetched. It is also the primary way that we use to test our local code changes, since the IDE can be easily hooked up to local and test environments as well.

All in all, a GraphQL schema that's enriched with these configuration driven directives, allows developers to easily create types, construct an entire graph representing Airbnb's central schema, and to fetch data easily without having to navigate our vast constellation of services. This means that product engineers can focus on product innovation, instead of writing the same data hydration code over again.

Unify Client-Facing API

Lastly, as we continue to evolve our SOA, we also decided to unify our client-facing API. In our original SOA diagram, each presentation service is usually maintained by a different product team. The implication of this is that each presentation service tends to define its own client-facing API and solve problems its own way. There wasn't really a common set of best practices, and people sometimes ended up reinventing the wheel. This often resulted in inconsistent if not buggy user experience and lots of spaghetti code that is not really reusable.

UI Platform

Our solution to the problem is UI Platform, which is a unified, opinionated, server-driven UI system. To quickly visualize how it works, this is what the user sees on HRD, and this is what the UI Platform sees where everything on the page is broken down into a standardized section. Our idea is for the content and styling of the UI within each of these sections to be driven by the backend. This leaves the frontend with just a thin layer of logic that's responsible for rendering these sections. In this diagram, on the presentation backend, you see that we expose a common schema to the client. You can see that each of the frontend clients has a UI Platform runtime that is responsible for interpreting the API responses from the backend, and rendering it into UI for the user.

Taking a deeper look into the standardized API response, you can see that it is broken down into two parts. First, we have a registry of all the sections that's needed for a page. Then, second, is the screen structure, which expresses the layout of the sections on the page. For instance, we can say that we want the header section to be at the top of the page, and the card section to be right below it. Zooming further into each of these sections, here we have the schema definition on the left with a concrete example on the right. The example here is the user highlight section in HRD, as indicated by the section ID. Focusing just on the key attributes, we first have section data which represent the data model for the section. For example, here we have a list of user info, including where the user lives. Then, second, we have the UIComponentType field, which is an enum that is used to refer to the frontend UI component that should be used to present the data. In this case, we want to render the data as a bullet list. One thing to call out here is that it is possible for one section data type to be rendered by different UI component types, which gives us more flexibility and variation in UI design. More importantly, all these sections should be totally reusable across different services. For example, this user highlight section here can be shared between a guest facing and a host facing surface.

There are a few other key features of the UI Platform. First, we have support for different layouts and placement of sections on the page, which provides flexibility and range for product design needs. Second, with deferred sections, we can opt to delay the loading of more expensive sections to a second network request, which helps to improve the initial page load time and user experience overall. This is especially helpful for mobile clients that can have weaker internet signals than desktop web. Lastly, the framework also logs impression and UI actions of each section automatically, which is helpful for measuring user engagement when we launch a new feature. To make the developer experience easier, we also build out a web tool that allows backend engineers to visualize their API response in real time by copy pasting their API response payload into the textbox in a tool.

In summary, by driving the UI from the backend server, we can basically establish a clear schema contract between frontend and backend, and maintain a repository of robust, reusable UI components that makes it easier for us to prototype a feature. In addition, with pre-built sections, we could also easily push new content to the clients without waiting for the long mobile app release cycles on app stores. Since no client code changes are needed with pre-built sections. Lastly, by centralizing the business logic and presentation logic in the backend, instead of having it scattered across clients, we could also ensure a consistent user experience across all frontend platforms.

Where on the Spectrum?

On the spectrum of how server driven we are, we have been leaning very close to the side of fully server driven UI. Basically, after piloting the UI Platform for a few product launches, we noticed some issues. First of all, by driving the presentation logic from the backend, it makes it hard for the different frontend clients to leverage native improvements or building capabilities from their native OS platforms. For instance, the navigation styles or haptic feedback on Android and iOS, are done very differently and it is hard to server drive these things differently in a centralized manner from the backend.

Second, the UI Platform makes it easy for us to reuse existing sections to build new pages. However, for a brand new section, designing the schema and building it out involves close collaboration between the client and backend engineers initially. As our company continues to invest in the ambitious product roadmap, we need to find a way to decouple this very tight-knit frontend and backend workflows to speed up our iteration. We basically decided to scale back our server-driveness and move towards the middle of the spectrum where the UI Platform will become a thin screen building layer in our frontend architecture. In this version, the section will basically remain the primitive for screen building, but now sections can be orchestrated with either server-side logic or client logic. Meaning that some sections can remain server driven, but for the client-driven sections, the frontend will handle the UI presentation logic, while the backend will handle the business logic. This pivoting in our frontend strategy will hopefully give us more flexibility in supporting new products and give us more synergy and speed in developer collaboration.

Recap

First, we have the first lesson, which is to invest in common infrastructure early to turbocharge our initial SOA migration. In our case, we leveraged both open source and in-house technologies to lay a foundation for our SOA. Second, as you continue to expand and scale your architecture, you can consider to start simplifying your service dependency graph for long term stability. For us, we did it with service blocks at the platform layer. Third, platformize common patterns such as data hydration, so that product engineers can focus on solving new and important problems. For us, we built a Viaduct service to do the job. Lastly, unify client-facing API into a robust system of flexible orchestration and reusable parts to support product iteration. For us, we use the UI Platform to standardize our API.

One overarching theme in the progression of these takeaways is that we continue to streamline and fine-tune our layers of abstraction based on the way we work and the way we build our product. Starting from the infra layer with the common building blocks, to platform layer with the service blocks, and then to application and UI layer with Viaduct and UI Platform. What informed these stepwise improvements, were the common pain points experienced by engineers and end users. It's true that sometimes it means that we have to undo our earlier work. That is fine. It is really hard to get everything right the first time anyways. The point is to keep evolving the architecture to improve developer experience and to serve prevailing business needs.

Conclusion

Going back to the metaphorical hill in SF again, when we set out to migrate to SOA, we were not expecting our path to include this steep climb up ahead. The lessons along the way were enriching, and the learning curve was in fact quite an exciting and fulfilling ride. We can't say for sure that we have made it to the top of the hill, but when we survey our current tech stack, we begin to see that SOA is not a fixed destination. Instead, like a real city, it is constantly changing and evolving into something more resilient and lasting.

Questions and Answers

Wells: What's the next thing you're thinking we need to make a change here to improve stuff?

Liu: At Airbnb, we are now focusing a lot on further improving developer productivity now that we have this complex system of so many services being deployed and features being launched at the same time. Something that we are trying to speed up is a local developing environment. We used to have all these staging environments, test environments, and developers used to just push their local changes to one of the test environments and issue some request to make sure that their changes work as expected. Those test environments are static and they are maintained by the service owner.

Right now we are trying to pilot something different that is basically allowing any developer working on any service to just spin up a private group of cloud applications to basically run your changes on, and those can spin up or ramp down based on developer demand. It's like an on-demand test environment that is connected to the wider staging environment of the different services in the ecosystem. In that way, hopefully we can speed up the deploy time, because in the past, when we deploy to specific test environments, we still have to go through CI checks and stuff like that. With a more private group that have safety boundaries around it, we can make things faster and bypass some of the checks. That's something that we have been tinkering on and trying to get more adoption of within Airbnb.

Wells: You talk about service oriented architecture. I've been around long enough to remember we had service oriented architecture and we started talking about microservices. What do you see as the difference between the two? Why do you say service oriented architecture when you're talking about these changes?

Liu: I think it's just a process of how we socialize the idea within our company. I think, tech leaders at that time, they really just used the word SOA to push the concept wider to every corner of the company. I think partially it might be because, at the start, we were wary of splitting our logic into tiny services. We wanted to avoid using microservices, just because the term might be a little bit misleading or biased in just how it's understood by most people, when they first hear it. It's probably just like the process of how we communicate it, more than anything else.

Wells: Basically, microservices sound like they're just going to be far too small.

Liu: From your perspective, is there a difference?

Wells: There isn't. I really don't think so. I think probably, it's a good instinct not to get hang up on the size of services. Because where we built microservices architecture at the "Financial Times," I think we did build far too many services, because we thought, they're supposed to be small.

What are the biggest benefits of the presented service oriented architecture at Airbnb compared to the monorail? Scalability.

Liu: Scalability, for sure. I think the most important part is how it's mapped to how our teams are divided. In very broad strokes, we have the guest facing teams and host facing teams, and we work on the logic on our website separately for those things. In terms of just iterating, and pushing out new features, breaking our monolith into different services allow us to just do things in parallel, instead of having just one long deploy train on this monolithic application. A guest change might delay a host change and stuff like that. I think that's the main benefit.

Wells: Downsides?

Liu: Just fragmentation of logic. Right now we are moving from being fragmented and having a lot of services to consolidating some of this logic. We're trying to start from the more bottom layers of our tech stack, so starting from the platform layers, we're trying to group all these very commonly used data and logic into bigger service blocks, instead of having multiple services re-implementing the same thing.

Wells: My experience is definitely you end up moving things around as you realize you're changing them all at the same time.

In addition to Ruby, what programming languages have you added to your stacks?

Liu: Ruby was used in the monolith. Right now we use a lot of Java in our backend services, and some Kotlin sometimes as well. For the data side of things, we use Scala.

Wells: Was the monorail completely decommissioned?

Liu: Unfortunately not. That's one of the fine print in our presentation. There's a different level of challenge when it comes to migration. There are just going to be some stuff that's really hard to migrate out, or stuff that is important but not that critical, and no one has the time to migrate them out. Right now, monorail still serves some very legacy traffic. Another reason is some very old frontend clients, especially mobile clients like iOS and Android might still be talking to monorail endpoints. For those reasons, we have to keep it around for a while longer.

Wells: Is there a plan to say we're going to get rid of it, or is it just we'll keep a certain amount of it?

Liu: Our goal is definitely to decommission it completely. I'm just not sure how long it will take. There's going to be a very long tail that might just take maybe a couple more years for the monorail to completely disappear. It's not critical anymore. It's just going to be a small component in our SOA. It will become one of the services basically.

Wells: How does the GraphQL layer talk with the service layer? Does it use Thrift?

Liu: Yes, that's correct.

 

See more presentations with transcripts

 

Recorded at:

Apr 07, 2023

BT