HelloFresh's Migration to a New API Gateway to Enable Microservices
HelloFresh recently migrated their applications to a new API gateway with zero downtime. Their Director of Engineering, Ítalo Lelis de Vietro, shared the challenges and the migration process in a recent article.
Before the migration, HelloFresh’s existing API was monolithic. To move to a microservices architecture and enable easier microservice creation and integration with their infrastructure, they built a new API gateway which would cover existing as well as new services. Their infrastructure already had some components which made microservices simpler, like service discovery, Ansible based automation and extensive logging and monitoring. To overcome the challenge of zero downtime and backwards compatibility, the team wrote a proxy script to convert the older services to the new model. The first attempt at migration failed whereas the second one passed as planned.
HelloFresh’s existing infrastructure used Consul for service discovery and client side load balancing with HAProxy. These two factors facilitated a move to microservices. The team also had a convention that any new service had to start with Ansible automation. This automation covered almost the entire stack including networking, DNS, CI and machine provisioning. Visibility is paramount when there are multiple services talking to each other in a distributed system, and HelloFresh had extensive logging and monitoring. Statsd and Grafana for monitoring and dashboards and ELK for detailed analysis about a service’s behaviour made up this part of the tech stack.
Along with the new gateway, a new authentication service was also planned, which would take over the authentication module in the older API. This required older applications to be refactored.
The team’s pre-migration challenges involved backward compatibility for mobile apps, ensuring that all service calls happen via the new gateway, security for all calls in transit, and that existing API rules are followed in the new gateway. Mobile application updates cannot be forced on users, hence the APIs had to remain backward compatible. The APIs in use consisted of public as well as private endpoints, and all of these had to be registered with the new gateway. Service calls in transit between the microservices and and the gateway had to be secured. The API was documented in the OpenAPI format, a standard for describing REST APIs. This enabled the team to write an import script in Go that would translate between the old and the new, and preserve rules like quotas, CORS settings and rate limiting.
The first migration attempt consisted of replacing the old API with the new one, first on staging and then on production, with relevant test suites at each step. This attempt failed due to the authentication database getting overloaded from the high number of requests. The team had underestimated the load and the database started refusing connections. In addition, there were a few CORS misconfigurations in spite of the import script. The monitoring system was able to catch the problem and the migration was rolled back.
The second deployment attempt built on lessons learnt from the first. The team planned a blue-green deployment with a replica of the production environment created with the new gateway in place. This setup made it easy to switch between the old and the new with a configuration change. Machine capacity was replanned based on running application metrics and the load metrics from the first deployment. Gatling, an open source load testing tool, was used to run performance tests. Some known issues were fixed in the auth service.
The infrastructure looked like this after the migration:
The API gateway forms the frontline of the HelloFresh infrastructure. Why did the team build their own gateway, instead of incorporating an existing solution? de Vietro responded in the comments section:
We've tried out the Amazon API Gateway and Tyk but since we had our own authentication provider, integrating it with the AWS Gateway was not ideal. We had to deal with lambdas to add a custom auth provider. Shipping metrics to grafana would be a bit more complicated and of course we would be locked down to the same provider. Tyk didn't give us (at least at the time) the option to have our own auth provider, we had to use the built-in policies, user management and ACLs that was something that we didn't want. I think that the product today is very different, but that was the main reason at the time. Also with our own gateway, we can version our route configuration files on git and have the change log of it which for us is extremely important.
The API gateway is available as open source on Github.
canary releasing with Vamp?
Re: canary releasing with Vamp?