Key Takeaways
- Stitch Fix, a clothing retailer, employs nearly as many data scientists as engineers. The data scientists work on algorithms critical to the company's success, and require a substantial amount of data to succeed
- Although microservices may be necessary for achieving a highly scalable solution, do not start with the complexity of a highly distributed system until the company is successful enough that microservices become justified and necessary
- All major companies that are now using microservices, including eBay, Twitter, and Amazon.com, have gone through a migration that started with a monolithic system
- A true microservices platform requires each microservice to be responsible for its own data. Creating separate data stores can be the most challenging aspect of a microservices migration
- The process for separating out a monolithic database involves a repeatable process of isolating each service's data and preventing direct data access from other services
GDPR for Software Engineers
This eMag examines what software engineers, data engineers, and operations teams need to know about GDPR, along with the implications it has on data collection, storage and use for any organization dealing with customer data in the EU. Download Now.
Adapted from a presentation at QCon San Francisco 2017, by Randy Shoup, VP of engineering at Stitch Fix
I'm Randy Shoup, VP of engineering at Stitch Fix, and my background informs the following lessons about managing data in microservices.
Stitch Fix is a clothing retailer in the United States, and we use technology and data science to help customers find the clothing they like. Before Stitch Fix, I was a roving "CTO as a service", and I helped companies discuss technologies and these situations.
Earlier in my career, I was director of engineering at Google for Google App Engine. That is Google's platform as a service, like Heroku, or Cloud Foundry, or something like that. Earlier, I was chief engineer for about six-and-a-half years at eBay, where I helped our teams build multiple generations of search infrastructure. If you have ever gone to eBay and found something that you liked then, great, my team did a good job. And if you didn't find it, well, you know where to put the blame.
Let me start with a little bit about Stitch Fix, because that informs the lessons and the techniques of our breaking monoliths into microservices. Stitch Fix is the reverse of the standard clothing retailer. Rather than shop online or go to a store yourself, what if you had an expert do it for you?
We ask you to fill out a really detailed style profile about yourself, consisting of 60 to 70 questions, which might take you 20 to 30 minutes. We ask your size, height, weight, what styles you like, if you want to flaunt your arms, if you want to hide your hips… — we ask very detailed and personal things. Why? Anybody in your life who knows how to choose clothes for you must know about you. We explicitly ask those things, and use data science to make it happen. As a client, you have five items we deliver to your doorstep, hand-picked for you by one of 3,500 stylists around the country. You keep the things that you like, pay us for those, and return the rest for free.
A couple of things go on behind the scenes among both humans and machines. On the machine side, we look every night at every piece of inventory, reference that against every one of our clients, and compute a predicted probability of purchase. That is, what is the conditional probability that Randy will keep this shirt that we send him. Imagine that there's a 72 percent chance that Randy will keep this shirt, 54 percent chance for these pants, and 47 percent chance for the shoes — and for each of you in the room, the percentages are going to be different. We have machine-learned models that we layer in an ensemble to compute those percentages, which compose a set of personalized algorithmic recommendations for each customer that go to the stylists.
As the stylist is essentially shopping for you, choosing those five items on your behalf, he or she is looking at those algorithmic recommendations and figuring out what to put in the box.
We need the humans to put together an outfit, which the machines are currently not able to do. Sometimes, the human will answer a request such as "I'm going to Manhattan for an evening wedding, so send me something appropriate." The machine doesn't know what to do with that, but the humans know things that the machines don't.
All of this requires a ton of data. Interestingly and, I believe, uniquely, Stitch Fix has a one-to-one ratio between data science and engineering. We have more than a hundred software engineers in the team that I work on and roughly 80 data scientists and algorithm developers that are doing all the data science. To my knowledge, this is a unique ratio in the industry. I don't know any other company on the planet that has this kind of near one-to-one ratio.
What do we do with all of those data scientists? It turns out, if you are smart, it pays off.
We apply the same techniques to what clothes we're going to buy. We make algorithmic recommendations to the buyers and they figure out that, okay, next season, we're going to buy more white denim or cold shoulders are out or Capri pants are in next.
We use data analysis for inventory management: what do we keep in what warehouses and so on. We use it to optimize logistics and selection of shipping carriers so that the goods arrive on your doorstep on the date you want, at minimal cost to us. And we do some standard things, like demand prediction.
We are a physical business: we physically buy the clothes, put them in warehouses, and ship them to you. Unlike eBay and Google and a bunch of virtual businesses, if we guess wrong about demand, if demand is double what we expect, that is not a wonderful thing that we celebrate. That's a disaster because it means that we can only serve half of the people well. If we have double the number of clients, we should have double the number of warehouses, stylists, employees, and that kind of stuff. It is very important for us to get these things right.
Again, the general model here is that we use humans for what the humans do best and machines for what the machines do best.
When you design a system at this scale, as I hope you do, you have a bunch of goals. You want to make sure that the development teams can continue to move forward independently and at a quick pace — that's what I call “feature velocity”. We want scalability, so that as our business grows, we want the infrastructure to grow with it. We want the components to scale to load, to scale to the demands that we put on them. Also, we want those components to be resilient, so we want the failures to be isolated and not cascade through the infrastructure.
High-performing organizations with these kinds of requirements have some things to do. The DevOps Handbook features research from Gene Kim, Nicole Forsgren, and others into the difference between high-performing organizations and lower-performing ones. Higher-performing organizations both move faster and are more stable. You don't have to make a choice between speed and stability — you can have both.
The higher-performing organizations are doing multiple deploys a day, versus maybe one per month, and have a latency of less than an hour between committing code to the source control and to deployment, while in other organizations that might take a week. That's the speed side.
On the stability side, high-performing organizations recover from failure in an hour, versus maybe a day in a lower-performing organization. And the rate of failures is lower. The frequency of a high-performing organization deploying, having it not go well, and having to roll back the deployment approaches zero, but slower organizations might have to do this half the time. This is a big difference.
It is not just the speed and the stability. It is not just the technical metrics. The higher-performing organizations are two-and-a-half times more likely to exceed business goals like profitability, market share, and productivity. So this stuff doesn't just matter to engineers, it matters to business people.
Evolving to microservices
One of the things that I got asked a lot when I was doing my roving CTO-as-a-service gig was "Hey, Randy, you worked at Google and eBay — tell us how you did it."
I would answer, "I promise to tell you, and you have to promise not to do those things, yet." I said that not because I wanted to hold onto the secrets of Google and eBay, but because a 15,000-person engineering team like Google’s has a different set of problems than five people in a startup that sit around a conference table. That is three orders of magnitude different, and there will be different solutions at different scales for different companies.
That said, I love to tell how the companies we have heard of have evolved to microservices — not started with microservices, but evolved there over time.
eBay
eBay is now on its fifth complete rewrite of its infrastructure. It started out as a monolithic PERL application in 1995, when the founder wanted to play with this thing called the Web and so spent the three-day Labor Day weekend building this thing that ultimately became eBay.
The next generation was a monolithic C++ application that, at its worst, was 3.4 million lines of code in a single DLL. They were hitting compiler limits on the number of methods per class, which is 16,000. I'm sure many people think that they have a monolith, but few have one worse than that.
The third generation was a rewrite in Java — but we cannot call that microservices; it was mini-applications. They turned the site into 220 different Java applications. One was for the search part, one for the buying part… 220 applications. The current instance of eBay is fairly characterized as a polyglot set of microservices.
Twitter has gone through a similar evolution, and is on roughly its third generation. It started as a Rails application, nicknamed the Monorail. The second generation pulled the front end out into JavaScript and the back end into services written in Scala, because Twitter was an early adopter. We can currently characterize Twitter as a polyglot set of microservices.
Amazon.com
Amazon.com has gone through a similar evolution, although not as clean in the generations. It began as a monolithic C++ and Perl application, of which we can still see evidence in product pages. The "obidos" we sometimes see in an Amazon.com URL was the code name of the original Amazon.com application. Obidos is a place in Brazil, on the Amazon, which is why it was named that way.
Amazon.com rewrote everything from 2000 to 2005 in a service-oriented architecture. The services were mostly written in Java and Scala. During this period, Amazon.com was not doing particularly well as a business. But Jeff Bezos kept the faith and forced (or strongly encouraged) everyone in the company to rebuild everything in a service-oriented architecture. And now it's fair to categorize Amazon.com as a polyglot set of microservices.
These stories all follow a common pattern. No one starts with microservices. But, past a certain scale (a scale that maybe only .1 percent of us is going to get to), everybody ends up with something we can call microservices.
No one starts with microservices. Past a certain scale, everyone ends up with microservices.
I like to say that if you don't end up regretting your early technology decisions, you probably over-engineered.
Why do I say that?
Imagine an eBay competitor or Amazon.com competitor in 1995. This company, instead of finding a product market fit, a business model, and things that people are going to pay for, has built a distributed system they are going to need in five years. There is a reason we have not heard of that company.
Again, think about where you are in your business, where you are in your team size. The solutions for Amazon.com, Google, and Netflix are not necessarily the solutions for you when you are a small startup.
Microservices
I like to define the micro in microservices as not about the number of lines of code but about the scope of the interface.
A microservice has a single purpose and a simple, well-defined interface, and it is modular and independent. The critical thing to focus on and explore the implications of is that effective microservices, as Amazon.com found, have isolated persistence. In other words, a microservice should not be sharing data with other services.
For a microservice to reliably execute business logic and to guarantee invariance, we cannot have people reading and writing the data behind its back. eBay discovered this the other way. eBay spent a lot of effort with some very smart people to build a service layer in 2008, but it was not successful. Although the services were extremely well built and the interfaces were quite good and orthogonal — they spent a lot of time thinking about it — underneath them was a sea of shared databases that were also directly available to the applications. Nobody had to use the service layer in order to do their job, so they didn't.
At Stitch Fix, we are on our own journey. We did not build a monolithic application, but our version of the monolith problem is the monolithic database we built.
We are breaking up our monolithic database and extracting services from it but there are some great things that we would like to retain.
Figure 1 shows a simplified view of our situation. We have way more than this number of apps, but there are only so many things that fit in one image.
Figure 1: Stitch Fix's Monolithic, shared database.
We essentially have a shared database that includes everything that is interesting about Stitch Fix. This includes clients, the boxes that we ship, the items that we put into the boxes, metadata about the inventory like styles and SKUs, information about the warehouses, times about 175 different tables. We have on the order of 70 or 80 different applications and services that use the same database for their work. That is the problem. That shared database is a coupling point for the teams, causing them to be interdependent as opposed to independent. It is a single point of failure and a performance bottleneck.
Our plan is to decouple applications and services from the shared database. There is a lot of work here.
Figure 2: Breaking up the shared database.
Figure 2 shows the steps taken to break up shared database. Image A is the starting point. The real diagram would be too full of boxes and lines, so let’s imagine that there are only three tables and two applications. The first thing that we're going to do is build a service that represents, in this example, client information (B). This will be one of the microservices, with a well-defined interface. We negotiated that interface with the consumers of that service before we created the service.
Next, we point the applications to read from the service instead of using the shared database to read from the table (C). The hardest part is moving the lines. I do not mean to trivialize, but an image simply cannot show how hard it is to do that. After we do that, callers no longer connect directly to the database but will instead go through the service. Then we move the table from the shared database and put it in an isolated private database that is only associated with the microservice (D). There's a lot of hard work involved, and this is the pattern.
The next task is to do the same thing for item information. We create an item service, and have the applications use the service instead of the table (E). Then we extract the table and make it a private database of the service. We then do the same thing for SKUs or styles, and we keep rinsing and repeating (F). By the end, the boundary of each microservice surrounds both its application box and its database, such as the paired client-service and “core client” database (F).
We have divided the monolithic database with everything in there so that each microservice has its own persistence. But there are a lot of things that we like about of the monolithic database, and I don't want to give them up. These include easily sharing data between different services and applications, being able to easily do joins across different tables, and transactions. I want to be able to do operations that span multiple entities as a single atomic unit. These are all common aspects of monolithic databases.
Events
There are various database features that we can and cannot keep through the next part of the migration, but there are workarounds for those we can't have. Before going into that, I need to point out an architectural building block that perhaps you know about but don't appreciate as much as you should — namely, events. Wikipedia defines an event as a significant change in state or a statement that something interesting has occurred.
In a traditional three-tier system, there's the presentation tier that the users or clients use, the application tier that represents stateless business logic, and the persistence tier that is backed by a relational database. But, as architects, we are missing a fundamental building block that represents a state change, and that is what I will call an event. Because events are typically asynchronous, maybe I will produce an event to which nobody is yet listening, maybe only one other consumer within the system is listening to it, or maybe many consumers are going to subscribe to it.
Having promoted events to a first-class construct in our architecture, we will now apply events to microservices.
A microservices interface includes the front door, right? It obviously includes the synchronous request and response. This is typically HTTP, maybe JSON, maybe gRPC or something like that, but it clearly includes an access point. What is less obvious — and I hope I can convince you that this is true — is that it includes all of the events that the service produces, all of the events that the service consumes, and any other way to get data into and out of that service. Doing bulk reads out of the service for analytic purposes or bulk writes into the service for uploads are all part of the interface of the service. Simply put, I assert that the interface of a service includes any mechanism that gets data into or out of it.
Now that we have events in our toolbox, we will start to use events as a tool in solving those problems of shared data, of joins, and of transactions. That brings us to the problem of shared data. In a monolithic database, it is easy to leverage shared data. We point the applications at this shared table and we are all good. But where does shared data go in a microservices world?
Well, we have a couple of different options — but I will first give you a tool or a phrase to use when you discuss this. The principle, or that phrase, is “single system of record”. If there's data for a customer, an item, or a box that is of interest in your system, there should be one and only one service that is the canonical system of record for that data. There should be only one place in the system where that service owns the customer, owns the item, or owns the box. There are going to be many representations of customer/item/etc. around (there certainly are at Stitch Fix), but every other copy in the system must be a read-only, non-authoritative cache of that system of record.
Let that sink in: read only and non-authoritative. Don't modify the customer record anywhere and expect it to stick around in some other system. If we want to modify that data, we need to go to the system of record, which is the only place that can currently tell us, to the millisecond, what the customer is doing.
That's the idea of the system of record, and there are a couple of different techniques to use in this approach to sharing data. The first is the most obvious and most simple: synchronously look it up from that system of record.
Consider a fulfillment service at Stitch Fix. We are going to ship a thing to a customer’s physical address. There's a customer service that owns the customer data, one piece of which is the customer's address. One solution is for the fulfillment service to call the customer service and look up the address. There’s nothing wrong with this approach; this is a perfectly legitimate way to do it. But sometimes this isn't right. Maybe we do not want everything to be coupled on the customer service. Maybe the fulfillment service, or its equivalent, is pounding the customer service so often that it impedes performance.
Another solution involves the combination of an asynchronous event with a local cache. The customer service is still going to own that representation of the customer, but when the customer data changes (the customer address, say), the customer service is going to send an update event. When the address changes, the fulfillment service will listen to that event and locally cache the address, then the fulfillment center will send the box on its merry way.
The caching within the fulfillment service has other nice properties. If the customer service does not retain a history of address changes, we can remember that in the fulfillment service. This happens at scale: customers may change addresses between the time that they start an order and the time that we ship it. We want to make sure that we send it to the right place.
Joins
It is really easy to join tables in a monolithic database. We simply add another table to the FROM clause in a SQL statement and we’re all good. This works great when everything sits in one big, monolithic database, but it does not work in a SQL statement if A and B are two separate services. Once we split the data across microservices, the joins, conceptually, are a lot harder to do.
We always have architecture choices, and there is more than one way to handle joins. The first option is to join in the client. Have whatever is interested in the A and the B do the join. In this particular example, let's imagine that we are producing an order history. When a customer comes to Stitch Fix to see the history of the boxes that we’ve sent them, we might be able to provide that page in this way. We might have the order-history page call the customer service to get the current version of the customer's information — maybe her name, her address, and how many things we have sent her. Then, it can go to the order service to get details about all of her orders. It gets a single customer from the customer service then will query for the orders that match that customer on the order service.
This is a pattern used on basically every webpage that does not get all of its data from one service. Again, this is a totally legitimate solution to this problem. We use it all the time at Stitch Fix, and I'm sure you use it all over the place in your applications as well.
But let's imagine that this doesn't work, whether for reasons of performance or reliability or maybe we’re querying the order service too much.
For approach number two, we create a service that does what I like to call, in database terminology, “materializing the view”. Imagine we are trying to produce an item-feedback service. At Stitch Fix, we send boxes out, and people keep some of the things that we send and return some. We want to know why, and we want to remember which things are returned and which are kept. This is something that we want to remember using an item-feedback service. Maybe we have 1,000 or 10,000 units of a particular shirt and we want to remember all customer feedback about that shirt every time we sent it. Multiply that effort by the tens of thousands of pieces of inventory that we might have.
To do this, we are going to have an item service, which is going to represent the metadata about this shirt. The item-feedback service is going to listen to events from the item service, such as new items, items that are gone, and changes to the metadata if that is interesting. It will also listen to events from the order service. Every piece of feedback about an order should generate an event — or, since we send five items in a box, possibly five events. The item-feedback service is listening to those events and then materializing the join. In other words, it's remembering all the feedback that we get for every item in one cached place. A fancier way to say that is that it maintains a denormalized join of items and orders together in its own local storage.
Many common systems do this all the time, and we don't even think that they are doing it. For example, any sort of enterprise-grade (i.e., we pay for it) database system has a concept of a materialized view. Oracle has it, SQL Server has it, and a bunch of enterprise-class databases have a concept of materializing a view.
Most NoSQL systems work in this way. Any of the Dynamo-inspired data stores, like DynamoDB from Amazon, Cassandra, React, or Voldemort, all which come from a NoSQL tradition, force us to do it up front. Relational databases are optimized for easy writes — we write to individual records or to individual tables. On the read side, we put it all together. Most NoSQL systems are the other way around. The tables that we store are already the queries that we wanted to ask. Instead of writing to an individual sub-table at write time, we are writing five times to all of the stored queries that we want to read from. Every NoSQL system is forcing us up front to do this sort of materialized join.
Every search engine that we use almost certainly is doing some form of joining one particular entity with another particular entity. Every analytical system on the planet is joining lots of different pieces of data, because that is what analytical systems are about.
I hope this technique now sounds a little bit more familiar.
Transactions
The wonderful thing about relational databases is this concept of a transaction. In a relational database, a single transaction embodies the ACID properties: it is atomic, consistent, isolated, and durable. We can do that in a monolithic database. That's one of the wonderful things about having THE database in our system. It is easy to have a transaction cross multiple entities. In our SQL statement, we begin the transaction, do our inserts and updates, then commit and that either all happens or it doesn't happen at all.
Splitting data across services makes transactions hard. I will even replace “hard” with “impossible”. How do I know it’s impossible? There are techniques known in the database community for doing distributed transactions, like two-phased commit, but nobody does them in practice. As evidence of that fact, consider that no cloud service on the planet implements a distributed transaction. Why? Because it is a scalability killer.
So, we can't have a transaction — but here is what we can do. We turn a transaction where we want to update A, B, and C, all together as a unit or not at all, into a saga. To create a saga, we model the transaction as a state machine of individual atomic events. Figure 3 may help clarify this. We re-implement that idea of updating A, updating B, and updating C as a workflow. Updating the A side produces an event that is consumed by the B service. The B service does its thing and produces an event that is consumed by the C service. At the end of all of this, at the end of the state machine, we are in a terminal state where A and B and C are all updated.
Figure 3: Workflows and sagas.
Now, let's imagine something goes wrong. We roll back by applying compensating operations in the reverse order. We undo the things we were doing in C, which produces one or several events, and then we undo the set of things that we did in the B service, which produces one or several events, and then we undo the things that we did in A. This is the core concept of sagas, and there's a lot of great detail behind it. If you want to know more about sagas, I highly recommend Chris Richardson's QCon presentation, Data Consistency in Microservices Using Sagas.
As with materializing the view, many systems that we use every day work in exactly this way. Consider a payment-processing system. If you want to pay me with a credit card, I would like to see the money get sucked out of your account and magically end up in my wallet in one unit of work. But that is not what actually happens. There are tons of things behind the scenes that involve payment processors and talking to the different banks and all of this financial magic.
In the canonical example of when we would use transactions, we would debit something from Justin's account and add it to Randy's account. No financial system on the planet actually works like that. Instead, every financial system implements it as a workflow. First, money gets taken out of Justin's account, and it lives in the bank for several days. It lives in the bank longer than I would like, but it ultimately does end up in my account.
As another example, consider expense approvals. Probably everybody has to get expenses approved after a conference. And that does not happen immediately. You submit your expenses to your manager, and she approves it, and it goes to her boss, and she approves it... all the way up. And then your reimbursement follows a payment-processing workflow, where ultimately the money goes into your account or pocket. You would prefer this to be a single unit, but it actually happens as a workflow. Any multi-step workflow is like this.
If you write code for a living, consider as a final example what would happen if your code were deployed to production as soon as you hit return on your IDE. Nobody does that. That is not an atomic transaction, nor should it be. In a continuous-delivery pipeline, when I say commit, it does a bunch of stuff, the end result of which is, hopefully, deployed to production. That's what the high-performing organizations are doing. But it does not happen atomically. Again, it's a state machine: this step happens, then this happens, then this happens, and if something goes wrong along the way, we back it out. This should sound familiar. Stuff we use every day behaves like this, which means there is nothing wrong with using this technique in the services we build.
To wrap up, we have explored how to use events as tools in our architectural toolbox. We've shown how we can use events to share data between different components in our system. We have figured out how to use events to help us implement joins. And we have figured out how to use events to help us do transactions.
About the Authors
Randy Shoup is a 25-year veteran of Silicon Valley, and has worked as a senior technology leader and executive at companies ranging from small startups, to mid-sized places, to eBay and Google. He is currently VP Engineering at Stitch Fix in San Francisco. He is particularly passionate about the nexus of culture, technology, and organization.
Thomas Betts is a Principal Software Engineer at IHS Markit, with two decades of professional software development experience. His focus has always been on providing software solutions that delight his customers. He has worked in a variety of industries, including retail, finance, health care, defense and travel. Thomas lives in Denver with his wife and son, and they love hiking and otherwise exploring beautiful Colorado.