BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Building and Running Applications at Scale in Zalando

Building and Running Applications at Scale in Zalando

Bookmarks
40:39

Summary

Pamela Canchanya shares practices and lessons learned when building and running critical business applications at scale.

Bio

Pamela Canchanya is a software engineer based in Berlin working in Zalando, Fashion E-commerce with operations in all Europe. Before that, she was a developer for ThoughtWorks Brasil.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Canchanya: I'm Pamela [Canchanya], I'm a software engineer. In this presentation, I'm going to talk about my lessons learned and experiences, building and running applications at a scale in Zalando. I'm particularly going to talk about the online fashion store and where I work, the checkout. Zalando is the biggest retailer in the online world in Europe. Our key difference with other competitors is that, in most of the countries, we offer free delivery, 100 days returns, and also, conveniently, free returns.

In Zalando, we take care of all the operations, from managing of suppliers, to customer applications, and also all the operations behind the warehouses where all the packaging is happening, and also the delivery of the orders to our customers. Since we take care of all these operations, we're operating at a really big scale. Zalando is also present in 17 countries in Europe, we have more than 250 million visits per month, we have more than 26 million active customers. We are more than 15,000 employees. Then last year, we had around €5.4 billion in revenue, so this is really big. The event that we are operating at even higher scale is Black Friday.

In our last Black Friday, we broke all the records of our previous years, and we had around 2 million orders. In the peak hour, we reached more than 4,200 orders per minute. This was really huge, and it was a great challenge. Of course, this couldn't be done without having tech behind the scenes. We have come from a long run, we have migrated from monolith to microservices around 2015. Nowadays, in 2019, we have more than 1,000 microservices. Our current tech organization is composed from more than 1,000 developers, and we are more than 200 teams. Every team is organized strategically to cover a customer journey, and also a business thing. Every team can also have different team members with multidisciplinary skills like frontend, backend data scientists, UX, researcher, product, whatever is needed that our team needs to fulfill.

<> Since we have all of these things, we also have end-to-end responsibility for the services that every team has to manage. Of course, when we migrated from monolith to microservices, it also meant empowerment for every team to take care of their software development. We also found out that it's not easy that every team do their way, so we end up having standard processes of how we develop software. This was enabled by the tools that our developer productivity team provides us. Every team can easily start a new project, can set it up, can start coding, build it, test it, deploy it, monitor it, and so on, in all the software development cycle.

The Checkout Landscape and Architecture

I particularly work in the checkout. Our goal is to allow customers to buy seamlessly and conveniently. For those who are not familiar with it, the checkout is the last experience of the customer in the shopping. This is the step where the customer says, how they want to deliver, where to deliver, how they want to pay, and finally, place some order. In our team, we take care of many microservices. We have different flavors, we have microservices in Java, Scala, Node JS. Our main communication is to REST, we also communicate with some dependencies through messaging. For our stable applications, we use Cassandra as data storage. For our configurations, we use ETCD.

All our microservices are run in AWS and Kubernetes. When we migrated from monolith to microservices, we also migrated to the cloud. We start to use AWS like EC2 instances and cloud formations. Since the last two years, we also have been working with Kubernetes. My team particularly is in the middle of this migration software, maintaining services in AWS and Kubernetes. All our microservices, not only checkout, but also lambda microservices are running in containers. Every microservice environment is obstructed from our infrastructure.

For our customer-facing applications, we use React. There are also many more technologies that we use. This is the landscape of how our own microservices interact in the scope of checkout, and also how interact with other dependencies. In the right side, we have the checkout service, which is a stable application. It holds all the business rules for the checkout, it interacts with other Zalando dependencies, and it has Cassandra as its data storage. Then we have the backend for the frontend. This is a type of microservice that aggregates data from the checkout service, and other Zalando services that provide data that is not available through the checkout. For example, if we want to do the checkout, we want to have the detailed information of the articles, but this is not the data that should be belong to checkout, so we have to communicate to other microservices.

After this, we also have frontend fragments, which are frontend microservices. Frontend microservices are services that provide server-side rendering of what we call fragments. A fragment is a piece of a page, for example, a header, a body, a content, or a footer. You can have one page where you can see one thing, but every piece can be something that different teams owned. After we have all these different pieces, we have one service, which is Tailor, which composed all these pieces. It's open source by Zalando, we use it to create one page. Skipper is another project done by Zalando and also open source that we use as HTTP router. This way, we have all our pages in the online web fashion store.

All these components are around the checkout context. All of them are very critical for our business. Checkout is a very critical component in the customer journey because it has direct impact in our customers and in our business. We have so many microservices in Zalando. When one microservice is down and it doesn't have that critical ability of checkout, it can be ok. If checkout or one of our microservices is down, the whole company may be looking to this. We have a lot of responsibility.

Checkout Challenges and Lessons Learnt

The main challenges of checkout in a microservice ecosystem is that, as I explained before, we have many microservices interacting. We also have the other Zalando microservice interacting with us. We have an increase of point of failures and we can have many combinations of touchpoints where something can go wrong. At the same time, we own some microservices, but we don't control the evolution of the other dependencies. If some other dependency is breaking, it can be possible that also our checkout is in risk.

We are going to review the lessons learned around reliability patterns to avoid this type of failures and improve the availability of checkout. We're going to review how we scale because we have to serve 26 million active customers, and also be ready for high traffic events like Black Friday. We are going to review the monitoring, which provides us the right signals to know how our services running, and if there is an incident how we can respond to it.

Building microservices with reliability patterns starts by the following. This is an example of the checkout confirmation page. This is the last step where the customer decides to place an order. To be able to do the checkout, there are many services and dependencies involved. For example, we need to know what is the delivery destination that the customer is going to use. It could be your home or your office address. Then we can have also the delivery service. Do you want to use the standard, express, or same day delivery, also, the items that you want to checkout, and also how you want to pay. All of these dependencies and all of these interactions with the checkout are very important. If one of them are not working, the customer cannot do checkout, therefore we have right there a big impact in our business.

Let's take one example. The delivery service provides us all the available delivery options to do one checkout based on the items that we have and based on the place where the customer lives. If this service is down, the customer cannot select which delivery they want to use, and they cannot checkout. To improve the situation, we try to apply some patterns to improve the interaction with these services. We're going to make sure that we avoid this type of errors. We don't want to show some ugly page. Especially for me, since I'm not a native German speaker, sometimes, German looks scary for me, so even not nicer to see this suffering.

To improve this, we do retries. This is a very simple one called, "Please don't put it in production," but it's just to give you an idea that if you have an operation, which is get delivery options for checkout, you can make a loop and retry it the amount of times that you want if they are failing. Still, this is something that can be improved. We want to make sure that we do only retries for transient errors like network or service overload. We do retries because we hope that the next time that we try, the error is going to disappear. If the error disappears, it means that we potentially can have a successful response, and therefore, a successful checkout.

This type of errors that are going to eventually disappear are transient errors. We also want to make sure that we don't retry for all errors, because, for example, if you have a service that introduce a change and break the contract with another service, this is something that is not going to be fixed if you just retry it, because it's just an expected business rule broken or a contract broken. These errors are never going to go away. We just make sure that we do retries for the errors that are going to eventually disappear.

This is another example of code. Now, I try to do some Scala code, and I try to apply pattern matching. If we get delivery options for checkout, if we get success, we do something with the result. If we were at the transient failure, we retry it. There is an error, we throw the error, and we try to handle the error.

Still, given the scale of Zalando, we can try to do this request, but we could have 1,000 requests per second. If we have, let's say, 3 retries, it means that we will have 3,000 more requests. If we are retrying for a service that, for example, is overloaded, we are contributing to the overload as well, because before it was 1,000 requests, and now we have 3,000 more requests. It's like a never ending error, we are also making them to not recover. For this, we'll make sure that we do retries with exponential backlog. We want to make sure that after we had a failure, we do not do a retry right away. We rather do some waiting time and then we'll retry. Between each retry, we are going to increase the time exponentially, so this way, we have the opportunity that allow the remote service to recover and not overload the service. For example, we have a request that has 100 milliseconds with one attempt, and then we wait some time. Then the second time, also fails, then we wait a longer time, and so on.

Still, things can get really bad. We can have exhaustion of retries, and then the errors become permanent. We already have done our best. Still, if we have this situation, why we should allow other requests to keep trying this remote service? To avoid this, we make sure that we prevent to call operations that are likely to fail. We do this through circuit breaker pattern. The idea of this pattern is that we wrap our operations into a circuit. As long as the remote service is healthy, we keep sending it operations. If the service is not healthy anymore, depending of what metrics we use, we shouldn't send it anymore. Then we should fail fast. This is better than waiting.

One of the most important parts of the circuit is when it's open. This is an example of a threshold, so the full threshold of Hystrix. If the threshold reached 50%, and the error is above this, there is no need to allow more request to go to this remote service. We just open the circuit and then the get delivery options for checkout fails immediately. Still, we fail immediately, but our customers still cannot buy. There is still hope. We can go from unwanted failure, no checkout, to having a fallback. This is where it's very interesting because, to be honest, before I really didn't think outside the box, I was happy doing, as a developer, my work, to have error handling and making nice code, but why not fallbacks?

Of course not all fallbacks can be designed by developers, but as we have a team who have all the involved parties to make the best for the checkout, as a team and with informed decision from our product owners, we know for sure that standard delivery can always be offered. We can also offer it with a default delivery promise. It's still a degradated service, but it's better than nothing.

Putting it all together, we do retries of operations with exponential back off. We wrap operations with the circuit breaker. We handle failures with fallbacks when possible. Otherwise, we have to make sure to handle the exceptions to avoid unexpected errors.

Scaling Microservices

The next topic that also taught as a lesson is scaling microservices. This is a typical traffic pattern of our services. You can see that from the 4:00 p.m. until midnight, there is a big peak. Then people go to sleep. Then the traffic drops drastically until the morning, 7:00 a.m. We want to make sure that our services are able to handle the different traffics along the day. We can see that with the time we have some pattern. Of course, this change from country to country, because every country has their own details in how they behave. There can be some peaks depending of campaigns or sales that are happening.

This is an overview of how we handle the traffic. Every microservice that we have has the same infrastructure. We have a load balancer who handles the incoming request. Then this distributes the request through the replication of our microservice in multiple instances, or if we are using Kubernetes in multiple ports. Every instance is running with a Zalando-based image. This Zalando-based image contains a lot of things that are needed to be compliant, to be secure, to make sure that we have the right policies implemented because we are a serious company, and because we take seriously our business.

In top of this, we run the container where our microservice is running. Inside of it can be Node environment, JVM environment, whatever. In our case, it's mostly Node and JVM. This is how it looks, the overview of how we handle requests. When it comes to scalability, we have two options. The first option is to scale horizontally, this is the most convenient one, and in my experience, the easy one. You have at the beginning a service, which is deployed with three replicas, three instances. If you get more traffic based on some signals, for example, after 80% of CPU usage, these instances are not healthy enough. They cannot treat the next upcoming operations nicely. We know that after this, we need to have more instances. We just use auto scaling policies, which are available in AWS and also in Kubernetes, to create more instances or more ports.

The second option that we have is scaling vertically. This is the option where, depending of the case, we apply it. It means that when we have scalability problems and our current setup cannot handle more requests, what we do is we change the instances and we make them more powerful. For example, by changing the resources, adding more CPU, adding more memory, if we are using AWS, we can have different instances types with more cores or less cores depending what we need. We just make them bigger. Both of them have their benefits and limitations. I think the horizontal way is better in the long run, because we don't depend of the type of machines that we have, and we can easily go up.

Of course, scalability also come with consequences. When we were on Black Friday, we did our capacity planning, business forecast, load testing, and we figure out that we need to scale 10 times more. We went there and we said, "This is simple. We just change the auto scaling policy. Let's say that the minimum now is going to be 60. Before, it was 6, now it's 60. If we have more traffic, it will show just scale up." What we didn't know is that when we have more instances, it also means that we have more database connections. Before, even if we were having 26 million active customers using the website in different patterns, it was not a problem. Now, we have 10 times more instances creating connections to our Cassandra database.

The poor Cassandra was not able to handle all of these connections. Eventually, when Cassandra was having a lot of traffic, it could get unhealthy. Then it got to have a lot of CPU usage. Then if one Cassandra is down, and then maybe another one, and then another one, even if we would have a scale our service, if we don't have a data storage, our whole microservice would fail. We learned that just scaling also can bring more problems. It can also bring more problems to our dependencies. We fixed this by improving the configurations in our Cassandra, and also by changing the way we handle resources in Cassandra. It's something that catch us.

It doesn't matter if you scale one microservice. If you have a journey like the checkout or another that is composed for multiple microservices and multiple dependences, if one of them don't scale, it's likely that the whole journey is not going to be scalable, and therefore it's going to fail. We should consider not just as microservices or silos, but also as a whole ecosystem.

Low Traffic Rollouts

The last thing is that window scaling during rollouts. This is a very nice example because nothing happens nicely. Every time that we do a deployment, we say what is the minimum amount of instances or ports that we want to start. We can say that we have four, for example, to say we identify after the patterns and all the empirical data that is always enough and minimum to have. The current service version one, which has 100% of traffic has four and version two, also created with four initially. You can see that for four is equal, the same. When we are switching traffic, gradually, in percentages, the capacity of the version one is easily going to be handled by the second version, because they have the same capacity.

Then what happens when we do rollouts when we have higher traffic? The version one has to scale up, because the auto scaling policy has kicking. Now, it's running with six instances. The version two, which was just freshly created, is running with four. We can clearly see that if we switch traffic, even if we do it incrementally, it's very likely that we are going to have a risk to overload the current instances and potentially have a drop in the availability of the service. This is an example where you can see it, this is a graph of our rates per countries. You can see that for all countries, there is a drop. This is a drop because the capacity was not enough.

Since we have a load balancer, which is always getting signals of the different instances to know if they are healthy, the load balancer was stopping to send traffic. Once it was stopping to send traffic, it was returning errors and service unavailable for our clients. Also, we have a situation where we don't see this in the microservice itself. We can see that all our rate of requests in countries are dropped.

Consider doing rollouts, consider having the same capacity for the current traffic that you have. Otherwise, your service is likely to become unavailable, just because you've introduced a new feature, but you have to make sure that this is also handled.

Monitor Microservices

The next topic is monitoring microservices. When we monitor microservices, we make sure that we monitor the whole microservice ecosystem. It's not only about the microservice itself, but it's also about the application platform, the communication, and the hardware where our microservices run. We can classify these metrics into pieces, in the infrastructure metrics and in the micro service metrics. They can provide us different information that could be useful to know where we need to improve our services.

The first example is some microservice that has a tech stack of JVM, stable application runs in AWS. For this microservice, we monitor the AWS metrics, like the CPU usage, the network traffic, the memory, the disk usage. Then we also monitor the communication metrics. How is the load balancer doing? How many healthy instances we have? Do we have connection errors between the ELB and the instances? What is our rate response in the load while it's inside? Because this could be very different from the service itself? How is the rate of request in the ELB? Are they getting to the instances or are they just not connecting and returning errors to our clients?

Then we have the microservice metrics itself. Since it's a backend service, we have API endpoints. We monitor every endpoint, we see the rate of request, the responses, so we can know how they are behaving. We also have the dependency metrics. As you could see, it shows you a lot of checkouts dependencies. For us, it's also really important that we make sure that we know how the interaction with our dependencies are going. We monitor the dependency error account. I didn't put it here, but we also monitor the fallbacks, which circuit breaker, which circuit is open, and so on.

Also, we have the language-specific metrics. For JVM, we want to monitor the number of threats to know if our applications are having enough usage of threats or if we are having some problem finishing some thread. We want to make sure that we monitor the memory heap so we can know if we have the right JVM setting, or if there is a problem to identify with this information that we need to improve.

The second example is the frontend microservice. This frontend microservice has Node JS as environment, it runs in Kubernetes. We monitor the Kubernetes metrics like the amount of ports, the CPU usage, the memory, and also the network. It's pretty much like AWS, but it still is infrastructure. Then we have to Node JS metrics - since we do server-side rendering, we want to make sure that the current setup of our Node environment is having a healthy way of using the heap, on how is the event loop running because in Node, we only have one threat.

Then we have the frontend microservices. Similarly, like a backend service, we have endpoints. The difference is that these endpoints are going to return HTML. We also monitor how is the rate of every endpoint, what is the time of every endpoint to response. All these dashboards are great. I chose them just to give some example of what metrics we have. What we avoid is to use the dashboards as outage detection.

For that, we do Alerting - you get a notification, you need to stop what you're doing, you take action, and then you need to identify the problem. This is an example from alert, unhealthy instances one of five. It could be many reasons, but maybe it's because no more memory, JVM was misconfigured, or service checkout is returning 400 errors above the threshold of 25%. It could be many reasons, one of them could be that the recent change broke the contract with the API and we didn't have enough test coverage for this. No orders in last five minutes. Downstream service is experimenting connectivity issues, therefore, even we have all our beautiful reliability patterns, checkout was down.

Checkout database disk utilization is 80%. A possible reason, saturation of data storage by increase in traffic. All these alert things are just symptoms about what is going on in our services and all these should be actionable. It doesn't matter if we have an alert that is telling something, and we are not able to do something, because if we are not able to do something, we let it pass. When something really happens, we will not know what to do, and worst, we might not take action, and then we cannot deceit ours. In checkout cases, hours or maybe some time not, in the expected time that you were expecting. We want to make sure that the alerts are actionable. If you read one alert, and you think "I have no idea what to do when this happens," maybe it's not the right alert for you.

Once we have an alert trigger, we have to assess the symptoms that the alert providers. We coordinate, depending if it's something that we can fix in our team, it's something that we need to put up other teams. We try to stop the bleeding as soon as possible with runbook of our alerts. Then after that, we stop the bleeding. We have the post-mortem situation where we try to identify the root cause, and we have to make sure that this post-mortem helps us to fix the real problem, and then also to follow up in other things. This is a very simple example of post-mortems.

Another thing that post-mortem should give us - this is going to be something that can be available for the whole company - it has to be very clear. For example, no orders in the last five minutes, impact of customers, 2K customers cannot complete checkout, impact of business, 50K euros loss of order that could be completed, analysis of root cause. We typically use "The five Whys", you can start saying, "Why there is no order?" You can say, "Because the service had overload and the auto scaling policy didn't work." "Why this was not working?" and so on, until you identify and gather all the factors that brought you to this incident. Then the most important part are the action items, because they are the actions that the team or the organization have to take to make sure that this doesn't happen again.

We want to make sure that every incident has a post-mortem. I know it's a little bit boring to write documents, but if we don't have clear information of what is the impact, we cannot go the next day and say, "We need to improve our test coverage. We need to improve this and that," without having empirical information that this is important for the business, or more priority that may be a new feature. It's very important to follow up with the post-mortem.

Wrapping up

Wrapping up - reliability patterns, scalability, and monitoring will come to Black Friday. For our Black Friday preparation, we have a business forecast for tellers, we want to make this and that amount of orders, then we also have load testing of real customer journey.

Given that we are running in a microservice ecosystem, we identify all the teams involved in the customer journey, that the customer might do when they are in the Black Friday, which is, for example, go to one catalog page, select the product, add it to the cart, make the checkout, all the steps that the customer could do in a Black Friday. Then all the services involved in all this journey are identified, then we had to load testing in top of this. With this week, we were able to do capacity planning, so we could scale our service accordingly, and we could also identify bottlenecks, or things that we might need to fix for Black Friday.

For every microservice that is involved in Black Friday, we also have a checklist where we review, is the architecture and dependencies reviewed? Are the possible points of failures identified and mitigated? Do we have reliability patterns for all our microservices that are involved? Are configurations adjustable without need of deployment?

I showed you before that making a deployment with the high-traffic can be risky. Especially in Black Friday, we don't want to make the place, not only because we are afraid, but when we run at this scale, we are one company doing Black Friday. Then we have other 100 companies or more also doing Black Friday. What happened to us already in one Black Friday, I think, or two, was that AWS run out of resources. We don't want to make a deployment and start new instances because we might get into the situation where we get no more resources in AWS. We want to avoid these situations and make sure that everything that we've identified that is business rule, customer microservice configuration, or any featured toggle that we want to turn off, we don't do it through a deployment, but rather to a configuration that doesn't require deployment.

Then we also review how is our scaling strategy. The typical auto scaling strategy that works for every day might need to be adjusted for Black Friday. Then is our monitoring place, are all alerts actionable? Also is our team prepared for doing 24/7 and be able to response to incidence?

In the final day of Black Friday, we have a situation room. All teams that are involved in the services that are relevant for the Black Friday are gathered in one situation room. We only have one person per team. Then we are all together in this space where we monitor, and we support each other in case there is an incident or something that we need to handle.

This is how the Black Friday pattern of request looks. You can see that it starts to grow and eventually goes up and then again, drops. This is where we reached more than 4,200 orders per minute.

My summary of learnings is, we need to think outside the happy path and think of different type of errors that we might face, and not just handling errors. We should implement reliability patterns. All services are scalable proportionally to the dependencies. Let's consider also our dependencies. Monitoring the microservice ecosystem is important to identify all the signals that our microservices and environment are running as expected and we are having healthy services. Here are some resources about the topics I talked.

 

See more presentations with transcripts

 

Recorded at:

Sep 18, 2019

BT