InfoQ Homepage Presentations Airbnb’s Great Migration: Building Services at Scale

Airbnb’s Great Migration: Building Services at Scale

View Presentation

Speed:

Download

51:19

Summary

Jessica Tai recaps her QCon SF 2018 “Great Migration” presentation then continues the story with a focus on how Airbnb is building, operating, and scaling its expanding network of services. Though their re-architecture to SOA is still ongoing, they are already seeing various benefits including improved performance, developer productivity, build and deploy times, and site reliability.

Bio

Jessica Tai has worked at Airbnb for 4 years, starting as a full-stack product engineer for the guest and host booking flow and is now an infrastructure engineer on the Core Services team. She leads the user data service, which is one of Airbnb’s highest QPS services and integrates with all business verticals. She is a member of Leadership & Development committee for women in tech at Airbnb.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Tai: When I joined Airbnb in 2014, the engineering team was much smaller then, around 100 people. We had a tradition of welcoming our new hires by having them run under what we call a human tunnel. At the end of the human tunnel was a large beanbag for the new hire to jump on to while the rest of us clapped and cheered. After engineering new hires completed their human tunnel run, we gave them an onboarding task to get them familiar with our monolithic code base. At the time, almost all of our engineers were coding in the monolith daily. As our engineering team grew, so did the length of the human tunnel. At a certain point, it was no longer practical for us to uphold this tradition as now, in 2019, we have over 1,200 engineers spread across the world.

My name is Jessica [Tai], and I'm an ex-monolith engineer. I pair program with my corgi, and for the first two and a half years at Airbnb I was working as a full stack engineer in the monolithic code base. I then moved to a more infrastructure role, where I am now on our core services team. Core services is responsible for building out the foundational layer of services as you move away from the monolith.

I'll begin by describing a brief motivation for why we decided to migrate into micro services. I'll then cover some of the service design tenets that we created as a way for us to make scalable, reliable services. I'll then discuss the sequence of which we decompose out of the monolith and a way to reduce the number of dependencies that we created. I'll cover our incremental comparison frameworks of how we ensure that we didn't break functionality when migrating from monolith to services. I'll then dive deep into some best practices that we learned and developed along the way, and conclude with some of the results we've seen so far from our migration.

A monolith is a single tier unit responsible for both server and client-side functionality. This means that the model view and controller layers are together in a single repository. At Airbnb, we have Ruby-on-Rails monolith, which we call monorail. Monoliths are really easy to get started with as they're quick to bootstrap and good for agile development. It makes sense that Airbnb started with monorail. During my monolith engineering days, I worked on various parts of the product including the homes description page, as well as the homes checkout page. After I completed my human tunnel run, I was given an onboarding task to implement a feature where the guest was required to send a message to their host. The model view and controller parts of this new hire task were all done in monorail.

Why Decide to Migrate?

If monorail could handle everything, why did we decide to migrate? This talk will also inform you of various animal migrations that happen in nature. Many birds migrate, with the Arctic tern having the longest migration path. It migrates over 1.5 million miles or 2.5 kilometers throughout its lifetime. That's like going from here to the moon and back, three times. Migrating your architecture is a similar million mile journey, so, why did we embark on ours? The short answer was, our engineering team was growing rapidly. We were often doubling year over year, which meant hundreds more engineers were adding code to monolith.

The code became more tightly coupled, ownership was unclear, and began to see more production incidents. Our deployment chains became slower as well. With hundreds of engineers working on monolith, the code, and each trying to deploy their code, it became to be a big traffic jam. At Airbnb we have a value known as democratic deploys, where we expect each engineer to be responsible for testing their changes and deploying it themselves to production. However, with this traffic jam, I liked to deploy my code early in the morning before the other engineers got into work so I could avoid revert studo incidents or merge conflicts.

Engineering productivity was low, frustration was high, so we needed to figure out a way to help alleviate these pain points. The solution we landed on was service-oriented architecture, or SOA. SOA is a network of loosely coupled services where some client makes their requests to a pay gateway, and the gateway then routes their requests to multiple services and databases. This adjusts our pain points because services are built and deployed separately. We can scale them independently, and ownership is more clearly defined to be within the scope of the services supported API.

If we look at that checkout page, I'm in my new higher change too, but now from a SOA lens, there could be a home service, a reservation service, a pricing service. Now this is looking like a lot of different services. Are we entering a different type of spearheading this?

Service Design Tenets

Knowing that that could be a possibility, we wanted to be disciplined as we went about this migration, so we created service design tenets to help guide us while billing services. Penguins have a migration path where the various penguin colonies have a shared understanding and know when and where to meet at the end of their migration journey. We wanted our engineers to have a shared understanding when they're building their services to ensure that they could build them in a reliable and consistent way.

The first tenet that we have was that services should own both the reads and the rights to their data. That means a particular data storage should only have one service that's directly accessing it. This is good for this data consistency, as well as encapsulation and isolation. Another tenet is, services should address a specific concern. We wanted to avoid decomposing our monolith into another service that became so large that it was a new monolith. We also wanted to take an approach a little bit different from the traditional microservices architecture. In microservices, the services are really granular and good at one thing. However, we want our services to be a bit larger and focused on specific business functionality.

Services should avoid duplicating functionality. For shared parts of infrastructure or a code, we'll create shared libraries and shared services. This will make maintainability easier. Another design principle we have is, data mutation should propagate via standard events. It's inevitable that a service may want to know when a particular piece of data has changed, but they might not own that particular piece of data. We enable this by using Spinal Tap, which is a Change Data Capture service that was open sourced by Airbnb. Spinal Tap listens to various data sources, and when there are changes, it emits them via the standard event, in our case via cofti cue. And then other services can then consume that event and act accordingly.

For example, if I made a reservation on Airbnb, a reservation service recreates a row in the database and we want our availability service to know about this new reservation. So it would consume this standard event and can be able to mark the home's availability as busy for those dates. The tenet that combines the others together is building for production. We know it's tempting to want to cut corners if you're building a service that was due yesterday, or maybe it was for admin-only traffic or only a prototype. However, we wanted each service to be built as if they were mission critical. That means having appropriate alerting, observability, and best practices for infrastructure.

Decompose by Request Life Cycle

With these design principles in place, we sought to figure out how to actually begin the migration. We looked at the life cycle of the requests for how to decompose out of the monolith. Monarch butterflies have an interesting migration path because their migration journey is longer than any one lifespan of a single butterfly. This means it takes multiple iterations of the butterfly life cycle to complete the migration journey.

We created multiple iterations of the service request life cycle to complete our migration journey, beginning with version one, where everything goes through monorail. Monorail was responsible for the presentation view, the business logic, and accessing the data. Version two is a hybrid where monorail is now de-scoped to only being the routing and view layer. It sends API traffic to our network of services, where the services are then responsible for the business logic, modeling, and data accessing.

If we look a little closer into what these services are, we define four different service types. We have these strict definitions to allow for a specific flow of dependencies as a request goes through its life cycle. Beginning at the bottom layer, we have the data service. The data service is the gatekeeper for our particular data entities reads and writes. A data service does not have any other dependencies on other services, it only accesses data stores. The derived data service is one layer above, and it could read from data services as well as its own stores, such like an offline store. It applies some basic business logic often used in a shared product context from various product use cases.

The middle tier was the service that we defined as we started decomposing and realized that there were large pieces of business logic that didn't quite fit in the data service level or the derived data service level. We encapsulate these more complex business logic pieces into the middle tier service. At the highest level, we have the presentation service, which synthesizes from all the below services. Its responsibility is to aggregate data from the data service and derived data services, and apply some business logic to return data to the front end, which the user can then see in our product.

With these definitions of services, we wanted to begin at the bottom layer, and we decided to start with the homes data service. We picked homes because homes was a foundational layer to the Airbnb business. At the time, Airbnb's business was purely in homes, and when we looked at how the data was accessed, it was mostly being used via Active Record, which is a library that accesses data in rails. Wanting to migrate this out into its own service, we then intercepted at the active record level instead of routing to the database to route to our homes data service. The homes data service was then responsible for routing to the data store.

Once we had validated the read and write paths to that data store, we later moved the shared data for homes out into a separate homes database. After creating some Core Data Services, we then migrated core business logic. An example of this is the pricing derived data service. This requires some information about a home, which it gets from the homes data service, as well as other stores such as its offline price and trends statistics. We then moved into migrating different poor product views. An example of this is that checkout page. There's a checkout presentation service, which needs information about pricing and homes which it gets from the derived data and data service. When migrating rights, we found that there was validation logic that was more complex, so we encapsulated that into a homes validation middle tier service, which then would be responsible for writing to our homes' data service. That was version two, monolith and services within the same request life cycle.

Version three is eliminating monorail completely. Instead, the mobile client will make a request to the API gateway, and the API gateway is a service layer responsible for middleware and routing. API gateway will then populate request context from a series of middleware services, such as a session data store, or risk single store. It then routes the request to our SOA network where the services, again, are responsible for the presentation logic and data accessing.

Our web plants are handled a little bit differently. We have a service layer specific for web rendering. Its responsibility is to return HTML back to the web client. It populates this data by making a request to the API gateway, which has the external, the exposed API endpoints, populates the request with the middleware services, and then propagates the rest of the requests throughout the SOA network where it flows in a very specific way due to the way we defined our services and dependencies.

Compare for Differences

Going from this monolith world to a world of services seems great, but it's not something that's done overnight. A lot of time is spent in the middle in this migration period, where both the monolith and services need to be supported as first class citizen text slacks. And with these two routes, we want to ensure that we didn't break functionality along the way. Walruses have a migration path where they can either swim to their end destination or float on ice sheets. Our requests can either go through monorail or through our services, but we wanted to make sure that the response have equal value.

To do this, we used comparison frameworks. We began with reads because reads are item potent, meaning you can issue multiple identical requests and get the same response. If we compare read path A of monorail going directly to the database against some read path B flowing through our service, we then compare the responses of these two paths. We emit them as standard events that can be consumed and sent to an offline comparison framework. We do the comparison offline for performance reasons and do not pack the online request cycle.

We put this comparison framework and traffic behind an admin tool that is easily configurable via web UI. This is important because we can ramp up or completely shut off this comparison traffic with the click of a button, instead of requiring changes to code and then a code deployment. We then start the ramp very slowly with a cautious 1% of traffic, looking at the comparison framework for differences. We continue to gradually ramp up the traffic, and while we're at 100%, we continue to wait. This is important to ensure that we gather enough traffic patterns to cover the different ways that this data can be accessed. We also want to ensure that our new service can handle the full load that monorail was supporting.

Once the comparison looks clean, we then move over to serving all of our reads through this particular service. Writes are done a little bit differently because we cannot dual write to the same database, so instead we utilize a shadow database. Monorail, making a request to some production data service, is path A. And say we want to introduce a checkout or a presentation service and a middle tier service. Let's call this right path B, where it flows through the two new services, and the middle tier service will then write to a data service which is connected to a shadow database separate from production.

We compare the requests sent to these two different data services and send these payloads, again, via standard events that can be consumed in an offline framework. The asynchronous comparison helps with performance. With a similar ramp up process, we're able to ensure that we have a clean comparison and then cut over the traffic to writing purely through our new presentation and middle tier service. When migrating to API gateway, we also do comparisons. We send the original request via some know-up through monorail and then monorail is responsible for adding the request context and handling the rest of the request.

At the API gateway level, we make a copy and the shadow request gets sent to be propagated to the middleware layer to get the request context. With this request context, the shadow requests can be sent throughout the SOA network, where, again, we compare the responses against the monorail path. Once the comparisons are clean, we were able to cut the know-up traffic to monorail and serve the API purely through our SOA network.

It may seem like a lot of this comparison is done carefully, but we further increment the migration in a few different ways. One is by migrating per endpoint. This allows a service to get into production with traffic being sent through even if it doesn't fully support all of the APIs within its scope. An example of this is when I worked on the user service. In the beginning it only supported one batch API endpoint load users. Load users is very simple. It loaded users from one MySQL table only by user ID. And that may seem pretty limited, but it was able to unblock several clients including presentation services and derived data services. This allowed multiple services to be built in parallel while I was able to add more query patterns and data sources to the user service.

Another way we were able to do incremental migration is migrating per attribute. For some presentation service, it requires multiple attributes that I needed to populate from services, but perhaps not all of them have been migrated yet. For the attributes that have been migrated, we have sent them to the services, and for the attributes that have not been migrated yet, we allow for a call back to monorail. These requests are made in parallel and allow the presentation service to get production traffic without needing to wait for all of its dependencies to be migrated to the SOA world.

SOA Best Practices

With these comparison techniques and incremental migration, it may seem like we had our bases covered, but there were some incidents along the way, but from them, we've developed best practices. Wildebeests have a very dangerous migration path, but they've developed the best practices of keeping their young in the center of their pack to keep them alive and healthy. We've developed various best practices to keep our services alive and healthy and as a way for us to build out our services in a more scalable fashion.

Service building is really important to focus on standardization to get benefits of consistency. Some ways of standardization are including frameworks, which auto generate a lot of the code for us. Taking out the manual typing is a way to ensure that the code is more consistently generated. Another way of standardization is making sure the testing and deployment processes are consistent. We heavily we use the replay traffic from production to simulate an environment very similar to production. Our observability practices have become more standardized as well. We use a lot of templating, which allows for templated metrics as well as templated graphs.

Previously, if I wanted to build a service, focus on some particular piece of business logic, I would want to create an endpoint layer to expose this. I would then need to write clients. At Airbnb we support both Java and Ruby as first class citizen languages. So I would manually need to write a Java client and a Ruby client. I would probably want to put some server diagnostics, and then metrics everywhere in the client endpoint and server layers. I would also want some data validation, resilience features, error handling, and in Ruby, I'd probably want to type checking.

Because we're building for production, this means we need to add run books, alerts, and dashboards. And this is looking like a lot of overhead work just to create a service focused on business logic. Recognizing a lot of common pieces in boilerplate, we invested in creating a service framework team. This service framework team aligned on using Thrift as our IDL or interface description language. Now, instead of writing all that boilerplate by myself, I just need to wrap the service in a simple IDL layer, and the rest will get auto generated for us for free.

If we look at what this Thrift layer might be, it's written in a configuration based way. We can have a Thrift struct specifying the request. It's strongly typed, which we're able to use in our communication protocols and storage. We can provide extra comments and additional annotation such as, "This is personal data." A request is similarly defined as a response. This struct is included the various fields populated in the response and then it's used when defining a particular endpoint. We give the response struct, the name of the endpoint, the risk cross struct, and we can also specify exceptions out of those. With the endpoint, we can give annotations such as, "This endpoint accepts replay production traffic," or, "This endpoint accepts rate limiting per client." And these features are auto generated for us.

If we look here, we can see that this documentation gets generated for us for free as well. This service API Explorer is generated off our Thrift files and is updated every time we deploy our service. Helpful information such as which slack channel to go to, or a tech design doc, are simply defined as Thrift annotations in the Thrift file. Some other features of this service explorer include comments about the endpoint or the particular fields. There's a really simple way to search for other structs and fields, and other services that depend and call on this particular service. Again, this is all simply defined within the Thrift file. So it's a really lightweight way for our developers to keep their documentation up to date. And because this gets updated upon every deploy, we know that the documentation is accurately reflecting the API that's used in production.

We have a lot of benefits with fail fast mechanisms that are automatically built into our IDL services. Failing fast is better for performance and we get retries and timeouts really easily configurable within the IDL. There are also more complex features, such as circuit breaking, or back pressure cues. Back pressure allows us to more proactively shed load off of our services. We get rate limiting on a per client basis, meaning one client can go particularly rogue, and then we'll start throttling those requests so it doesn't impact the other clients for that service.

Other features that we get are automatically putting our dependencies in each service into separate asynchronous worker thread pools. This technique, known as bulkheads as described in the book by Michael Nygard, "Release It!", allows a particular dependency to exhaust all of its thread pools without impacting the other dependencies. Combining this with graceful degradation allows us to mark dependencies as optional and provide smarter default values. This way, our service can return a 200 successful response back to the client, even if it wasn't able to get successful responses from all of its dependencies.

Testing employment has undergone standardization practices. Previously, our dev environment was completely different than production. So, it was difficult to be confident when merging a change and deploying it. Now, we've introduced a whole tier of services at the local development level. After merging that, it goes to a staging tier, a pre-production tier, which uses replay traffic and calls out to other staging services. This allows us to know if we broke any downstream or upstream dependencies at the staging level before moving to production.

At the staging level, we also get to use this nifty tool called Diffy, open sourced by Twitter. It compares the responses of staging against production. And if we take a little detour into what this may look like, it takes replay traffic as input, and then sends it out to three targets: staging, primary, and secondary. We compare the results from responses of staging and primary, and we get these raw response differences. We also pairwise compare the responses from primary and secondary; these two are running the exact same code, so any changes that we see here we determine to be non-deterministic noise.

The Diffy tool allows us to filter out that noise, and we're left with the response differences that can be attributed to the new code that we just introduced on our staging tier. This has been really helpful for us for determining regressions, as well as confirming that the changes that we expect to see in the response are actively there. Diffy is not specific for SOA or micro services, but it's much more practical. Our services support a much smaller set of endpoints and have a much smaller set of engineers shipping changes to them, so it's easier to map the changes to any differences that we see in Diffy. Whereas monorail has thousands of endpoints and due to the tight coupling, it wouldn't be practical to map differences in Diffy to our particular change in Monorail.

After our Diffy results look clean, we then deploy to a single instance of production that we call Canary. This is the last test to ensure that our code is ready for the full production fleet. After observing our graphs and dashboards, we can now confidently deploy to the rest of production. Here's an example of one of the templated graphs that we get for free due to our IDL. It has basic information about the service such as request rate, or air rate, or the clients calling it. If we look closely, all we need to do here is to find the name of the service. In this demo service, its name is Banana.

Empowering the Migration

It seems like a lot of work, right? It was. To empower the migration, it often felt challenging. Salmon have a challenging migration as they move from saltwater to freshwater and often need to swim upstream. Getting SOA to be adopted throughout an engineering org in the beginning felt like swimming upstream. We had a small infrastructure team in the beginning of 2016, focused on building that homes data service nearly bootstrapping our SOA efforts.

However, we found that it was really difficult for us to get the rest of the engineering teams to build services. This is because we have a product culture of wanting to ship things quickly. Building services in the beginning took weeks, whereas monorail was much faster. So we knew we needed to invest in making our service building a lot faster in order for other teams to be willing to adopt it. We also needed to provide additional benefits to incentivize people to adopt these services. Some benefits that we have are improved resiliency, fast deployment, and quicker testing.

However, with the incremental migration and more teams building services, that meant that a particular feature could spread across multiple services. Those services could be owned by multiple teams. So, service building in parallel was good for unblocking the creation of the SOA network, but it did introduce some overhead complexity when dealing with a feature that required coordinating between various teams.

We needed to make some changes to our on-call as well. Our product front-end engineers previously worked in monorail, and our infrastructure engineers also worked in monorail, as well as some other smaller services. We had a small group of engineers that were volunteers to be cis ops and on-call for the entire Airbnb site and all of its tech sects. However, as we moved into services, this is no longer practical. Instead, we shift the on-call rotation to be per team.

Each service is owned by one team. That one team has multiple engineers that are assigned as service owners. A team could own multiple services, and the on-call location would be for all the engineers on that team supporting all the services that the team owns. When we create a service now with a simple script that enables it to be bootstrapped within an hour, there are some required fields, one of them being, what is the slack channel of your team? What is the email channel for your team? How can we page your team should the service go down? This is important that each service has a team assigned so we can auto route the alerts.

Progress So Far

This has been a lot of work. How is the progress so far? Humpback whales have the longest migration journey out of any mammal. Our migration journey has been long as well, but we're making progress. Our whole engineering team is building services now, and we've seen some initial promising results. We have faster build and deploy times. Previously, monorail deploys were on a magnitude of hours, whereas now on services, they're on a magnitude of minutes and we don't get deploy locked as often since each service is built and deployed separately. We're seeing fewer reverts, as well as bug fixes being done quicker, both which are beneficial for our user experience.

Through various surveys, we've seen that our developer productivity has gone up as well as developer happiness. So it seems like our SOA has been addressing the pain points that we experienced in monorail. We also saw improvements in performance. This is largely due to the parallelization having lower latency than monorail. Monorail used Ruby, which is more naturally single threaded, whereas many of our new services are built in Java, which is more naturally multi-threaded.

We have internal libraries that help making multi thread decoding easier, which resulted in our search page being over three times faster and our home Description page over 10 times faster. We have a critical enough mass of teams-building services now that we've introduced what we call a monorail freeze. This is happening at the end of this month, and it means that no new features can be added to monorail. Only migration work changes can touch the monorail code base.

Migration has been more highly valued and given higher visibility. It's part of our company-wide goals and has the support from both product teams and engineering teams. Our engineers in 2016 were around a size of 500, with around two thirds of deploys being a monorail. Now, in 2019, we have over 1,200 engineers, and as of last week, there were less than 7% of our deploys in monorail. Forty percent of our traffic has been migrated to use our API gateway, and we have over 400 services built using the audio framework for the latest best practices auto generated for free for us.

If you look at that checkout page, again, there are a lot of services. There are data services, derived data services, a presentation service. But if I were to make a similar change to my new hire fix, it would be only in the presentation service. I could use our service API explorer to easily found out what the messaging data services API is.

SOA Has Its Challenges

Even though SOA seems to have done really well for Airbnb, I do like to caution that there are challenges that unlocked along the way. Building an SOA means that a single request now fans out to multiple services. Multiple services mean multiple chances of failure. It also means higher latency as our potential drawback for making more remote calls. Separating into different databases is good for consistency, but it makes transactionality and strong consistency more difficult to enforce as well. When we were migrating, we needed to break all joints within that particular data set before we can move it to a separate table. So it's a non-trivial amount of work.

Service orchestration becomes more complex as well. With every engineer now being a service owner, we needed to onboard them to be familiar with the service operability, and with hundreds of engineers on services, that meant we had many more EC2 instances being used. We're moving towards using Kubernetes to help build out our SOA network. And my colleague Melanie is actually speaking tomorrow afternoon about Kubernetes at Airbnb, so I highly encourage that you attend her talk.

Some of the takeaways that we have is, be prepared for a long commitment. Airbnb is at the beginning of its third year, and we're not quite done. Be sure to decompose incrementally with comparison frameworks along the way. It's important to not break functionality when migrating to your services. Scaling of services is more easily done with auto generation of code. This can be done via frameworks and tools. Also be prepared to shift your development culture. Migrating to services is not a technical challenge on its own, it requires the support of the whole organization to ensure a successful migration. So, look both ways before and during your migration. Airbnb is having a positive experience so far. I look forward to the migration completing. Thank you so much for listening.

Questions & Answers

Participant 1: We work in a Ruby project as well. The biggest problem we had, we used our iWeb for new services, we kept away from generating code. But we had this problem about having services stacks and maintaining them in all dependencies and being able to generate a new service with a CLI or something like that. What was your intake on that one?

Tai: So the question is around having all dependencies and generating new services from a script?

Participant 1: Yes. And being an architect and having an overview of what's happening.

Tai: Yes. A lot of it begins by having good observability to first know what your dependencies are without needing to look at the code. So, we have automated ways and distributed tracing that allow us to see where requests are fanning out to, so in the beginning we know what are the services that are being called from particular service. And then if we wanted to deprecate it, we then can see the QPS going to those services and the callers, and then we go contact those callers and see, do you need this service? If no one owns a service, do you really need this service? And we have Sunseted some of the services, which is an important part of work, but we don't do that automatically within our script.

Participant 2: Thanks for the talk. I was just wondering, how are you managing your contracts between your services? For example, if you're changing one of the APIs, how are you ensuring the other teams are managing other services keeping to that contract? Are you using packed or you're doing it through automated testing, or?

Tai: So the question was, how do we manage the contract between services? Our IDL services have a feature called a API deadline. So, you can specify your SOO or SOA for how long that a particular endpoint should take, and if it exceeds that time, it'll automatically fail. And then we track these metrics with the dashboard and it shows which services are meeting their SOAs, or not meeting their SOAs. Other ways to ensure that a service is handling the QPS that is expected, is that we have different ways of auto scaling with Kubernetes. But usually it's determining the latency for how the contract is for one service calling the other.

Participant 3: Thank you for the presentation. I have a question about how you encourage your production teams to actually not to build on top of the queue deep chain of microservices, and they go sometimes in horizontal in terms of the core to your micro services, and do not reuse the RP on top of the one microservice by reusing the fields from the previous microservice. So, how do you keep the flat structure of the tiers?

Tai: The question is around, how do we ensure that the call chain doesn't get super deep and we're able to use certain fields. So, defining those service types helps each service to now what its responsibility is. And we do have these cored large data services, such as a homes' data service, a reservation data service, a user data service that a lot of the services are interested in that particular data. We don't have an automated way right now to detect that this is using the same information as that other service. But within the service owner's interest, they want to reduce the latency for their service, so if they're able to make less calls to the dependency, it lowers their services latency. We have a cool graph that shows the call chain and how it fans out in a more graph-like way, so you can easily see if there's a cycle or something that's like taking way too long. But a lot of it is still manual right now. There's work to be done on that.

Participant 3: So, as I understand, it's kind of a natural growth so that teams don't want to make the huge amount of dependencies on their microservice, yes?

Tai: Natural. Yes, that can apply.

Participant 4: Thanks a lot for your talk. I have a question more related to your organizational structure. So, you had to make some few architectural decisions on the way, like using Diffy, like doing comparisons, like generating code. How do you actually make these decisions at scale? You are now at 1,200 developers, If you would centralize these decisions, I would guess that would create some bottlenecks, right? So how do you do that at scale?

Tai: That's a good question. In the beginning, with the larger data service, they were the ones who often piloted these new tools, and it was done in a more manual way, and then as those tools became more valuable, then the service framework team would automatically enable the creation of them so you don't need to connect everything manually yourself. You just say, "Yes, I'd like replayed traffic." So, almost all of our services on the IDL use Diffy, and we've built an internal tool to help scale out the replay traffic instead of using the open source solution. In terms of other architectural patterns, it usually starts with a smaller prototype that's really fundamental to our SOA network and then propagates up. But it does introduce challenges by having many different services in partial migration states, so we're trying to align on a single set of, these are the best standards that you need to adhere to, versus trying out these different types of architecture.

Participant 4: Who's responsible in the organization for the alignment of all these things?

Tai: We have a group called our Infrastructure Working group that meets twice a month and reviews all those services that are being built. So, if you're building a new service, you send a proposal, and then folks from the group have a lot of senior engineers that attend and make sure that what you're doing makes sense within the current architecture, and make sure that you're not duplicating something that's already been built, or that you're using the latest best practices.

Participant 5: Thanks for the talk. When you want to run stuff on the local development environment, do you have tooling that enables you to spark up the systems that you need, where you don't necessarily know how they're implemented and what dependencies they need? To test your own.

Tai: So, the question is, in the dev environment, do we have a way to load up services that are dependencies, but we don't necessarily want to know how they're implemented? Yes. So, we have our shared development services that each service owner is responsible for maintaining. So, if I'm on the user service, I have a shared development user service and then my production user service. And then that share development service is kind of like a black box, anyone can call it, but they don't need to know the internal workings. That code doesn't need to be local on your machine, you just need to be able to hit that particular API. And then, if you want, you can enable that service to be developed locally, if you wanted to make some change to it. But by default, the shared development services are black boxes that you just hit their APIs.

Participant 6: Thanks for the talk. Did the decision to move from monolith to microservice come naturally to the engineering team as well as management, or were there pushbacks? Like, "Oh, no, we shouldn't do this." Or, any pushbacks?

Tai: The question was, did the migration to services come naturally or were there pushbacks. There were pushbacks, especially our VP of engineering came from Facebook, and they've done pretty well with scaling out their more monolithic-type structure. But given the statistics of how long it took to deploy a single change, it became to be a breaking point. We were no longer being productive, and the site was not as stable. So, we needed to do something to change and we decided to pilot this. With the initial results it's been really promising, so we were able to get more of the org to move over. But, there was pushback at first, since it's a big infrastructure investment and requires the whole company to move.

Participant 7: That was very interesting, thanks. We had a question a minute ago about dev level services for testing. And you also talked about how you have isolated data stores now for separate concerns. How do you manage or choose what data goes into your different development level databases? How do you make sure, in cases where database A might need to reference something that database C might need an equivalent sort of line in there - how may I explain this? Can you see what I'm trying to say? How do you test that something arrived in the other service? Or at least, how do you make sure that if you need information in database A, B, and C to test the service, that it's all orchestrated correctly?

Fai: The question was, how do you ensure that the local development has all the database setup to test like end-to-end?

Participant 7: Yes. Especially where do you go to pay for dependency between various databases.

Tai: Especially for dependencies between various databases. So, for our shared dev environment level, we require the engineers to populate the databases. So, if they know they need some particular flow that requires a reservation, a home availability, and a user, then they're required to populate that. And then we've created some IDL factory tools that'll allow us to more easily specify one user of this type of status or reservation of this type of status, and then it'll automatically make all those database entries for you because we recognize that it's difficult when you need multiple sets of data. And then our local development is separate from our staging database which is a snapshot of our production data that we can access the sensitive information. But, because it's a copy of prod, then the dependencies aren't there.

See more presentations with transcripts

Recorded at:

Apr 23, 2019

Jessica Tai

InfoQ Software Architects' Newsletter