InfoQ Homepage Presentations Airbnb at Scale

Airbnb at Scale

Bookmarks

View Presentation

Speed:

Download

38:30

Summary

Selina Liu walks through what it takes to decompose a large and complex monolith into independent, performant services, and how they evolve and scale the architecture with changing business needs.

Bio

Selina Liu is a senior software engineer at Airbnb, the world’s largest platform for accommodation-sharing and unique travel experiences. She’s passionate about building performant and resilient services that scale and evolve well with Airbnb’s growing business needs.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Liu: I'm not sure how many of you live in a hilly city like San Francisco. Speaking for myself, I have lived in SF for the past three years. Even to this day, I still run into this issue where every time I try to get from point A to point B in the city, whether it's for dinner, for a walk, I'll look at Google Maps and be like, that looks easy. When I actually get on my way, at some point, I'll realize that up ahead is a hill with a non-trivial incline. That happens a lot in San Francisco because it is so hilly almost everywhere. When I think about it, this uphill journey resembles Airbnb's journey to service oriented architecture, or SOA. In the sense that we started out in a flat and straightforward path of migrating to SOA, then realized that there's an even steeper and much more interesting learning curve beyond that. I will share with you four key lessons that we have learned from our journey along the way.

Background

My name is Selina. I'm a software engineer at Airbnb.

Invest in Common Infrastructure Early

Here are the four lessons from our journey to SOA. Invest in common infrastructure early in your SOA journey, because this will help to turbocharge your migration from a monolith to SOA. To give you more context, let's spend some time first on how we went about our migration. Back in 2008, just like most small scrappy startups, Airbnb started out with a single web application written using the Ruby on Rails framework with the model view controller paradigm. Over the years, as Airbnb expanded its global footprint product offering and engineering team, the single web app you saw here grew into a massive monolith. Since it is developed using the Ruby on Rails framework, internally, we call it the monorail. As more engineers jumped onto the monorail, it was becoming slow, unstable, hard to maintain, and a single source of failure. In 2018, we launched a company-wide initiative to migrate our tech stack from this slow moving monorail, into a service oriented architecture. Similar to the model view controller paradigm, we decided to decompose the monolith into presentation services that renders data in a user friendly format for frontend clients to consume, as well as mid-tier services that's grouped around shared business concerns, and can be used in multiple contexts. Then under those, there are data services that encapsulate multiple databases in a uniform data layer.

Case Study - Host Reservation Details (HRD)

To make this abstract diagram a little bit more concrete, let's use host reservation details, or HRD, as a case study. Before I explain what HRD is, I want you to meet Maria. Maria is a new Airbnb host based in Colombia. She just published her listing on Airbnb last week, and today she's received her first booking request. When she opens her host inbox, she sees a message from the guest, Christopher. This panel on the right here that displays key information about the trip being requested and the guest, is the host reservation details panel. It pulls a lot of information from a myriad of sources in our system, including users, payments, listings, reviews. In fact, there's a lot more information that it offers even below the fold.

To migrate this complex product feature into our SOA paradigm, we basically broke down the logic into a presentation service that handles the view layer business logic. A few mid-tier services that handles write operations such as confirming or declining a booking request, and some data services behind them down the stack that encapsulate different product entities such as reservations, users, and listings. Of course, there are other services on each layer, and we interact with many of them every time we serve a request. Let's call our reservation presentation service, Ramen, because, who doesn't like Ramen? What makes this monumental task of migrating complex product logic possible are the common building blocks that our infrastructure team provided that allowed us to build services with confidence and speed.

API Framework

First, we have an in-house API framework built on Thrift. It is used by all Airbnb services to define clean API and to talk to each other. Let's say that as part of the business logic, Ramen has to read data from Tofu service. The Tofu engineer only has to define the endpoint in simple Thrift language, and the framework will create multi-threaded RPC clients to facilitate inter-service communication and handle stuff like error propagation, observability metrics, and schema validation. This means that engineers can focus on implementing the core business logic and not spend time worrying about the plumbing details of inter-service communication. Based on this framework, we have also developed productivity tools such as this API Explorer, which engineers can use to browse different services, figure out which endpoints to call, and how to call them.

Powergrid

Second, we also have Powergrid, which is another in-house library built here that makes it easy to run tasks in parallel. Under the hood, Powergrid helps us organize our code as a directed acyclic graph, where each node is a function or a task. We can use it to model each service endpoint as a data flow with requests as the input and response as the output of the data flow. Because it handles multi-threading and concurrency, we can schedule tasks to run in parallel.

Let's take host sending a special offer to the guest as an example. There's a bunch of checks and validations that have to be performed before we allow the host to send a special offer. Using Powergrid, we will take the listing ID from the request and use it to fetch information about the listing from the listing data service. Then we will pass the information along to a bunch of downstream services for validation, which happens in parallel. After that, we will aggregate all validation responses, make sure that we got a green light from everyone, before writing back to the special offer data service to send it to the guests. Using this library provides a few benefits. First, it provides low latency for performing network IO operations concurrently, which really makes a difference when your endpoint has multiple downstream dependencies. It also offers granular metrics for each node in the data flow, which helps us to pinpoint any bottleneck in the data flow pipeline. These benefits help to ensure that our service is performant and observable.

OneTouch

The third building block that we have is OneTouch, which is another framework built on top of Kubernetes that allows us to manage our services transparently and to deploy to different environments efficiently. This framework has two key aspects. First is that all service configurations should be managed in one place in Git. For example, all the configs for the Ramen service lives in this infrastructure folder. From there, we can easily configure dependencies, alerts, logging, deploy environments, CPU resources here, right alongside our source code folder. Then, second, we also have this magical k tool, which is a command line tool built on top of Kubernetes. That allows us to deploy our service to different environments on Kubernetes clusters. If I just type k all on the command line, it will automatically generate configs, build the app and deploy it to a remote cluster. If you think about it, it's just like making a bowl of Ramen where first you have to generate the bowl, the confligs, and then build the Ramen, which is the main app, and basically deploy with a garnish with your last step, which gives you the final end product. All environments whether it's staging or production are deployed the same way. In this way, from the service governance perspective, it makes it very easy for everyone to orchestrate, deploy, and diagnose a service because there's only one place to look and one place to learn.

Spinnaker

Lastly, we have Spinnaker, which is an open source continuous delivery platform that we use to deploy our services. It provides safe and repeatable workflows for deploying changes to production. One aspect that has been especially helpful for us is the automated canary analysis. In this step in the deploy pipeline, we basically deploy both our old and new snapshots to two temporary environments, respectively. Here we have baseline environment and canary environment, which has the new snapshot. Then we route a small percentage of our traffic to both of them. Then, key metrics such as error rates are automatically ingested and fed into statistical tests that produce an aggregate score for the new canary environment, as measured against the baseline. Then based on the score, this analysis tool will decide whether to fail or promote the canary to the next stage in the deploy pipeline. In a service oriented architecture, where so many services are deployed every single day, this helps us to release code changes at scale with confidence.

What Does Post-SOA World Look Like?

Thanks to all these infra pieces, we were able to migrate our core product functionality to SOA in the span of just two to three years, and reap the benefit of higher reliability, business agility, and loose coupling between our services. After all this work, you will think that we can now finally take a nap in front of the computer. Honestly, we are not done. In fact, we haven't started climbing the metaphorical hill. What we realized is that sometimes it could take more time to ship a feature due to new frictions introduced by SOA. Since now engineers need to acquaint themselves with more services and make changes in those services before they can ship a change. What's more, due to unconstrained call patterns between services where anyone can call anyone, our dependency graph ended up being a little bit complicated, and started to look something like this bunch of Christmas lights. This is unideal and potentially dangerous, especially when there are circular dependencies between our services, which can make it really hard to visualize and understand the intricate relationships between our services. Basically, it's a complex mental model that engineers have to maintain. Also, highly stable services could be easily brought out by more volatile services, because it's an ecosystem where everyone depends on everyone else.

Simplify Service Dependencies

To address these issues, we decided to simplify service dependencies. We designed our architecture to be a tiered tech stack, consisting of presentation, mid-tier, and data services. The motivation was to separate services into layers as shown in this diagram based on their technical priorities. Basically, as we go up the stack towards applications and UI layers, the primary consideration is iteration speed and schema flexibility, leading to more specific and fast changing API. This generally maps to our presentation services. On the other hand, if we go down the stack towards platform and infra layers, since their blast radius is bigger, they need to have more generalized API and schema, and higher reliability and stability requirements. For an SOA to be reliable and resilient, it is imperative that stable services do not depend on more volatile ones. Conceptually, a higher tier can call a lower tier service, but not vice versa.

However, the problem with our existing SOA system was that there was not enough service governance and dependency management to enforce this fundamental principle and to restrict who can call who. Hence, to enforce a topology driven layer architecture, we introduced service blocks at the platform layer, where each block is a collection of services with related business functionalities. For example, the listing block here will encapsulate both the data and business logic that inform core listing attributes. Then it will expose a simple, consistent read and write API to upstream clients through the facade. Under the hood, the listing facade will orchestrate coordination between the underlying data and business logic services, while providing a layer of abstraction and concealing the underlying complexity from upstream clients. We also enforced a strict topology by prohibiting clients from calling any internal services as well as prohibiting blocks from having circular dependency with each other. With such a higher level of abstraction, it is much easier for upstream clients to discover and leverage core functionality. It is also much easier for us to manage block dependencies and maintain high levels of reliability.

Platformize Data Hydration

We also spent some time on platformizing data hydration. Looking at this diagram again, notice that we have quite a number of presentation services. If we zoom into a typical presentation service, they're usually performing three main functions across the board, which include, first, fetching and hydrating data from different downstream services. For example, Ramen service alone calls 10 services to hydrate data for host reservation details. Second, these presentation services also perform simple transformation of the data. For example, Ramen service can easily have to merge data from 10 different services into something that the client expects. Third, services can also perform permission checks before proceeding with more complex business logic.

As time went on, what we realized is that engineers were spending a lot of time on these three functions even though it is a lot of duplication, boilerplate code, and repeated patterns. Our approach to this problem is to introduce a platformized data access layer that provides a single consolidated GraphQL schema, stitching together different entities such as listings, user, reservations, across all of Airbnb's online data. It also serves as a platform to host all the mundane data fetching and hydration logic, rather than requiring duplication of this logic across many different presentation services. Together with a more complex presentation logic on the right and the right logic on the left, and right logic on the right, both of which attend to a different set of constraints and a detail which we'll ignore for now, this data access layer will eventually replace all presentation services. The service blocks below the data access layer will also replace old data services as well as mid-tier services. You can see that with this data access layer, we continue to simplify service dependencies.

Going back to the layer itself, in essence, it is an enhanced GraphQL engine that reimagined the way data is fetched in our SOA by going from a service oriented to a data oriented hydration paradigm. For example, instead of writing code to explicitly call reservation data service to get reservation data, the caller will instead write a declarative query on the reservation entity. Then they can even fetch associated listing and guest user data. Such queries are made possible by a GraphQL schema that's enriched with special annotations that we built in-house. For example, on the screen, you can see that the ServiceBackedNode annotation with its templated fields allows us to associate a GraphQL type with a service endpoint, where the response from the service will be automatically wired back to corresponding attributes defined in the GraphQL type.

As another example, the ServiceBackedNodeKey annotation allows us to link different types together. For instance, the guestId on the reservation type can link to the fully fledged user type. This allows callers to basically fetch user fields alongside the reservation fields in one query. Aside from these, there is also a privacy annotation that wires in permission checks, and also an ownership annotation at the top that makes it easy to route alerts to the right teams. All in all, these annotations with declarative templates, allows us to easily create types, construct an entire graph, and codegen the DataLoaders for each type in a way that is configuration driven, which reduces the potential for error. In addition, we also have this online IDE built on top of the open source GraphQL library that makes it easy to explore the schema and inspect the data fetched. To summarize, platformizing data hydration allows engineers to focus on product innovation, instead of writing the same data hydration code over and over.

Unify Client-Facing API

Lastly, as we continue to evolve our SOA, we also decided to unify our client facing API. In our original SOA diagram, each presentation service is usually maintained by different product teams by virtue of Conway's Law. An implication of this is that each presentation service tends to define their own client facing API and solve problems their own way. There wasn't a common set of best practices, and people sometimes ended up reinventing the wheel. The result is lower developer velocity, more bugs, and sometimes inconsistent user experience. Our solution to the problem is App Framework, which is an in-house unified, opinionated, service-driven UI system. To quickly visualize how it works, this is what the user sees on host reservation details. This is what our App Framework sees where everything on a page is broken down into a standardized section. The content and styling of the UI within each section are driven by the backend. This leaves the frontend with a thin layer of logic that's responsible for just rendering these sections. In this diagram, we can see that on the presentation backend, we expose a common schema to the clients. You can see that each of the frontend clients has an App Framework runtime that is responsible for interpreting API responses from the backend, and rendering it into UI for the user.

App Framework: Standardized API Response

Taking a deeper look into the standardized API response, you can see that it's broken down into two parts. First is the registry of all the sections needed for a page. Then second, we have the screen structure, which expresses the layout of the sections on a page. For example, this part can dictate that the header section should go at the top of the page, and the card section should go right below it. Zooming further into each of the sections, here's the schema definition with a concrete example on the right. Focusing just on the key attributes, we have section data, which represents the data model itself. For example, here we have a list of user info, including where the user lives. Then we have the UI component type, which refers to the UI component that will be using this data from the data model to render the UI on the frontend. In this case, we want to render the list data as a bulleted list. One thing to call here is that it is possible for one section data type to be rendered by a multitude of different UI component types, which affords us flexibility and variation in product UI. More importantly, all these sections should be totally reusable across different services. For example, a user highlights section can be shared between guest-facing and host-facing services.

Key Features

There are also a few other key features of the App Framework. First, we have support for different layouts and placements of sections on the page, which provides flexibility and range for product design needs. Second, with different sections, we can delay the loading of more expensive and sometimes lower sections to a second network request, which helps to improve our initial page load time, and overall user experience. This is especially helpful for mobile clients that can sometimes have weaker internet signals and takes longer time to load data between requests. Lastly, the framework also logs impression and UI actions on each section automatically, which is really helpful for measuring user engagement when we launch a new feature through App Framework. To make the developer experience easier, we also built out a web tool that allows engineers to easily visualize their API response in real time by copy pasting the payload into this tool.

In summary, within App Framework, we got to isolate a robust, commonly shared schema foundation, as well as rendering components, layouts, and tooling, which are designed to evolve slowly under strict scrutiny. This separates the infra tooling from more volatile product code that changes from day to day. Second, App Framework also empowers product teams to execute fast with flexibility by providing clear patterns for reusability and customization. For example, using pre-built sections, product teams can easily launch new features across clients without any mobile app versioning or deploys on a mobile frontend. Lastly, App Framework also helps to ensure consistent user experience and maintain product quality by consolidating presentation logic that used to be scattered across all three frontend platforms into one backend.

Recap

In conclusion, we have gone through a lot of material, and now to recap all the lessons we have covered so far. First, invest in common infrastructure early to turbocharge your initial SOA migration. Second, as you continue to expand and scale your architecture, prioritize simplifying your service dependencies for long term stability. Third, platformize and abstract common patterns such as data hydration, so that product engineers can focus on solving new and important problems. Lastly, unify client-facing API into a robust system of reusable parts and safe guardrails to support fast product iteration and to launch features with confidence. One overarching theme in the progression of these takeaways is that we continue to streamline and fine tune our layers of abstraction, based on the way we work and the way we build our products, starting from the infra layer to platform layer with the service blocks, down to application and UI layers with App Framework and whatnot. What informed these stepwise improvements, were the common pain points experienced by engineers and end users. It is true that sometimes it means that we have to undo some of our earlier work. That is fine. It is hard to get everything right the first time. The point is to keep evolving the architecture to improve developer velocity and to serve prevailing business needs.

Going back to the metaphorical hill in SF, when we set out to migrate to SOA, we were not expecting our path to include this steep hill up ahead. The lessons along the way were rich, and the learning curve was in fact quite an exciting and fulfilling ride. We can't say for sure that we have made it to the top of the hill, but when we survey our current tech stack, we begin to see that SOA is not a fixed destination. Instead, like a real city, it is constantly changing and evolving into something more resilient and lasting.

Questions and Answers

Richardson: Why SOA and not microservices?

Liu: In our mind, we were just trying to break down our logic into components, but we didn't necessarily want to break it down so that everything is like, just focus on one small task. I think in general, we tried to not use microservices, because some people might think that for every small business feature we have, we will spin up a new service for that. As it is, we already have a lot of services so we're trying to prevent this.

Richardson: There's a lot of interesting debate around how micro is micro. How granular are your services? Do you know the ratio, number of developers versus number of services, or number of teams versus number of services? Do you have any simple ratio there?

Liu: Probably right now, every team has at least two to three services. As we have evolved, we are, in general, moving towards consolidating some of these logic, because what we're finding is that there's a lot of overhead in maintaining a service just in terms of stuff like SLOs, performance guarantees, and test levels, and stuff like that.

Richardson: That's a common pattern I've seen working with clients where there's this tendency to build fine-grained services almost as much as one service per developer. Then it's like, why not just consolidate? Unless you have a very good reason to have more per team.

Liu: I think at some point, it becomes a single point of failure per service to when you just have one developer who knows all the contexts to that service. Because when it comes to on-call, it's really hard to just have a sustainable load for engineers when they have to cover different services that they're not familiar with. That's also one aspect of operation where we are trying to just slim down, consolidate.

Richardson: The other interesting thing is the conventional wisdom around microservices, or you could say SOA, is vertical slices around business domains, yet one distinctive characteristic of your architecture is horizontal, technical slices. Am I right in thinking is that maybe the technical/horizontal division was your initial thought back in 2018, but now with this block concept, they seem to be more like business domains?

Liu: That's interesting, because in my mind, it's more of going from vertical where, in the past, each vertical will have one presentation service of derived data, and then like data service at the bottom. Now we are trying to consolidate, for example like the presentation layer in one horizontal chunk, and then maybe towards the bottom of the stack, we have different block facades, but those are also using a standardized API framework like GraphQL gateway layer. In a way, I think we're trying to reduce the level of duplication and just repeated patterns in each of the layers, by trying to make it more consistent.

Richardson: How do you achieve reliable communication? Maybe this is by consolidating services, but there seems to be a bunch of Thrift based synchronous calls. That's synchronous coupling, essentially. How do you manage to still be highly available as Airbnb obviously is?

Liu: That's a complicated issue. I think, over time, we have resorted to a bunch of procedural as well as technical measures. Just in terms of processes, we now require every service to define their own SLOs. Basically performance guarantees on how fast their API should be, what's the error rate, and to have weekly check-ins at a higher team level to make sure that all the services that a team owns are performant according to their standard? Then we also have test level requirements. All these are part of our commitment to craft initiative, where we are trying to make sure that each of these services have a bare minimum of service quality. In terms of technical solutions, in our service ideal framework, we also have aspects that prevent retry storms, where if a service is unresponsive upstream, just retry multiple times, and that might cause the service to degrade even further. We have ways to have a circuit breaker that prevents stuff from escalating into a catastrophe, basically. Those are a few things that we have tried. The ecosystem itself is changing from day to day, just from new product development. I feel like that's also one aspect of maintaining the service, we have to be keeping a close eye on a lot of these services, especially the more fundamental ones, the facades and the data services at the bottom.

Richardson: You migrated to services. How did it help? Because originally, apparently, you were in monolithic hell back in 2018, and so are you in SOA Nirvana now?

Liu: No. I think it was just like the hike across SF where you're just like, we're still trying to scale the learning curve. We're not quite there yet. It is very interesting, especially with the evolving business requirements with Airbnb. Initially, I think it was very simple in our mental model where we're just trying to break down this big chunk into smaller chunks, and that is service oriented. Now I think we are trying to think at levels of abstraction that maps to a product, but at the same time is technically sound. That's why we have data access layers where everything is represented, as the different parts of our data constellation, instead of thinking of things as services. In those endeavors, I think that's where things get really interesting and complicated, because products can be anything. There's no very hard definition of what a product feature is. It can be a user, or it can be some user feature that is abstract. I feel like trying to navigate that as a team is something that has been sparking very interesting conversations, like how we want to organize our data schema, and how we want to map that to the underlying services. That process of mapping from how we represent our product to how we build our services has been just an ongoing conversation across different teams.

Richardson: I think building complicated software is challenging, no matter what your architecture is.

What patterns did you apply in the data access layer to improve performance?

Liu: In the data access layer? We do have stuff like caching. In a GraphQL request, there might be different parts of the query that ends up hitting the same service. In a way, we will cache the exact request that we know will get an idempotent response back. Then sometimes we also batch requests into a single request to the downstream service, stuff like that. There's a lot of underlying logic that the team behind the unified data layer builds that other engineers don't have to worry about.

Richardson: Presumably, this migration to SOA, actually improved how you work, though. Never mind it's still challenging, but it improved deployment frequency and reliability, and it made a big difference.

Liu: Yes, I think it did. Because initially, it was just one deploy train on this one monolith, so everything else will be blocked if you just have this one failure in one of the APIs. It's also because we have separated out the presentation layer from the bottom layers. The presentation layer tends to evolve much faster, so if things break there, they tend not to affect the more fundamental services with a bigger blast radius.

See more presentations with transcripts

Recorded at:

Mar 24, 2022

Selina Liu

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?