How Heroku Manages High Availability - QCon London Talk summary
For a PaaS solution, high availability is quite a challenge, as not only hardware, but also software infrastructure services have to be available all the time. In his QCON London talk (PDF slides download), Mark McGranaghan, explained the patterns used at Heroku to provide a PaaS system which is always available.
Mark described, that in his view, those patterns fall into two categories
- Architectural, being certain design concepts.
- Socio-Technial, describing how people interact with the software.
According to Mark, the human factor is often underestimated, when it comes to how HA applications are deployed and operated. From the architectural side a lot of concepts have been taken from Erlang, as will become clear on some examples.
When deployed to Heroku, applications get a lot of services from Heroku and do not need to take care of them themselves. Routing HTTP requests for example is one service that is taken care of by Heroku. In fact the application couldn't achieve it on its own, because applications running on Heroku are running in multiple instances and monitored by supervisors which can restart them. So the application does not know where it is running, but the Heroku routing mesh is aware of the location of each instance of the application and can direct the HTTP traffic accordingly. The routing mesh itself consists of many cloud instances that are also supervised and scaled or restarted if required. This follows the error kernel concept of Erlang. All services are supervised and also these supervisors are supervised. Everything is allowed to fail and will be restarted. This concept eases error handling for both Heroku and the applications running on Heroku.
Because everything is allowed to fail, applications should be designed in a graceful degrading way. Mark gave a few examples on how that looks in practice:
In a traditional Message broker architecture, there is again a single point of failure: the endpoint where messages are sent to has to be available. Heroku uses a pattern called publish one / subscribe many, where each sender knows multiple possible message brokers. Receivers just need to subscribe to all of them. This change makes the messaging system reliable to individual broker failures.
The messages themselves are simple key value pairs, which can be easily extended, allowing upgrades of systems without downtime. Incompatible changes can be performed by using new attributes which are only used by instances of the new version, while the remaining old ones still can work with new messages passed to them.
Whenever an application depends on a service to read data, it should be possible to graceful degrade when that service is not available.
Whenever data shall be written to a service, the data can be buffered when the receiving service is not available. The writing will automatically retransmit the write request when the service becomes available again. For simplicity reasons Heroku is moving most services to directly buffer before trying to send the data across. This pattern is called write de-syncronizing.
What is interesting is that they still use PostgreSQL for this intermediate buffering, not their own distributed storage called Doozer. According to Mark, the reason is that the PostgreSQL limitations are pretty well understood and there is good operational tooling and experience, which are missing from Doozer. As we covered November 2011, Heroku offers PostgreSQL even as a service to non Heroku applications.
That led to the second part of the talk, the "socio technical" aspect.
Because humans interact with the system, it is important to make it as easy as possible. The most common reasons for system problems are human interactions. In a system which is designed for change, these human interactions of course involve change a lot.
Deployments and Roll Outs
Using custom tooling, Heroku can deploy in an incremental mode. Each deployment runs through various stages. But unlike traditional stages, which are separate environments, these are subsets of the production infrastructure. Those subsets are like "internal", "beta" and "all". The same concept follows for roll out of new features. Heroku uses feature flags to enable and disable features in productions in a similar way as deployments are made. This allows also decoupling deployment and feature roll out: First a new deployment is made, delivering the same functionality, with new features being hidden behind feature flags. If it its proven ok, features can be turned of gradually, allowing much easier localization of problem points.
Especially in a system with many moving parts, monitoring is essential. Heroku continues to invest heavily in monitoring and alerting because the visibility they provide are so valuable for production systems. Problems are detected using asserts. Examples of such asserts are
- 99 Percentile Latency < 50
- Active connections > 10
Mark gave two examples where those asserts could have warned them about an outage when they would have been in place already at the time.
A typical problem in a distributed system is that problems in a downstream component affect everything which is invoking it. While the architecture already took care of unavailable components with graceful degradation or write de-syncronizing, there is a concept needed to avoid overloading subsystems. Heroku uses "Flow-Control" for that. Certain code paths can be limited to a maximum number of invocations. Those rates are made configurable during runtime to adjust for changed code or hardware.