Lessons Learned Building Distributed Systems at Bitly
At the Bacon Conference last May, bitly Lead Application Developer Sean O'Connor explained the most relevant lessons bitly developers learned while building a distributed system that handles 6 billions clicks per month.
What is a distributed system?
The three most defining characteristics of a distributed system, states Sean, can be easily found on Wikipedia:
- Real concurrency of component nodes, with the related cost and complexity of coordination among them.
- Lack of a common clock, making it impossible to order in time events happened at different nodes.
- Independent failure, which should be understood as the capacity of a failing node of not affecting other nodes in the system.
Building a distributed system, then, should aim at handling those characteristics. Sean deems outdated the approach consisting in hiding the complexity deriving from the distributed nature of a system under an abstraction aiming at making the system appear as if it were not actually distributed in the first place. Such an approach is bound to failure, says Sean, because sooner or later the distributed nature of the system will leak through the abstraction. Instead, abstractions should model the distributed nature of the system and allow thus to handle any leaks.
Services as the building blocks
Bitly architecture is defined around the concept of service. A service is an abstraction specialised in providing some function clearly defined through its API. Services are to distributed systems what functions are to code, according to Sean, allowing to reason in higher-level terms without having to know all the implementation details. This idea brings a shift in the way of thinking about things.
Among the benefits of modelling the system around small services Sean mentions:
- Reduced size in lines of code, which makes a service readily comprehensible.
- Independent failure, since if a service goes down that only means that the system has lost the ability to deliver that specific feature, while keeping working on the overall.
- Ease of identifying which part of the system is having problems.
A key point in guaranteeing that a system can scale is using asynchronous messages, so a node does not have to wait on another node to provide its reply back, together with message queues between nodes to improve node isolation.
One benefit of doing this is that if a node has some issues and cannot temporarily handle all the incoming messages, they will be kept in the queue to be processed as soon as possible. Moreover, a failure at a node will not directly affect any of the other nodes.
Asynchronous messaging has its complexities, though, and in many occasions it can be more natural to handle a certain kind of operations synchronously. As examples of this, Sean mentioned that URL shortening is implemented at bitly as a fully synchronous operation, due to the requirement for it to be as fast as possible and consistent, meaning that the same shortened URL should not be returned to different users. On the other hand, analytics have different requirements altogether that make it a suitable candidate for going fully asynchronous. So, when bitly wants to collect and process some metrics data related to a user action on a link, it just enqueues it downstream, where it will be eventually dealt with without much concern for how long this will take. Trying to model an intrinsically synchronous operation as an asynchronous one can become very complex, according to Sean, so it is better to understand the nature of the operation beforehand.
Sean gave a final remark about node interaction hinting at the benefits that modelling messages as events provides, as opposed to thinking of them as command flowing from one node to another. An event is a description of something that has happened somewhere and does not require that the sending node knows anything about the receiving nodes. On the other hand, if a message is a command, then the sending node shall know about the receiving node's ability to execute that command. For this reason, thinking of messages as events greatly contributes to node isolation and it lends itself naturally to supporting multiple consumers and to dynamically adding or removing new consumers, without the producer node to know about that.
Thinking of messages as events also leads to another realization: annotating messages is better thatn filtering them at the producer level. As an example of this, Sean mentions private vs. public links handling. Private links could be filtered out by the producer so they do not reach uninterested parties downstream. This, though, requires the producer to make assumptions about what the downstream cares for. Instead, what bitly does is annotating a private link as such and let the associated message flow downstream freely, trusting downstream nodes that they will treat it appropriately.
Making services play nicely together
Bitly ensures that services play nicely together by making them:
Use back pressure and let requesting nodes know that a given service is busy or overloaded, so they can throttle down further requests. This can help maintain the system healthy and prevent cascading errors. As an example of this, Sean mentioned a service cache warmup: if during cache warmup requests continue to flow in, there is a risk of data corruption.
Load balance requests according to service healthiness. Bitly uses its own hostpool library for load balancing and keeping track of failing services to give priority to healthier services.
Sean introduces the monitoring topic by quoting Leslie Lamport saying that "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable". With over 400 working servers at bitly, monitoring is therefore a fundamental task, since it is the only way to know if something is not working as expected.
Sean provides a few recipes about monitoring:
- Use Nagios to check server status.
- Run integrity checks to ensure services return correct data.
- Log in a central location, since having all logs from different nodes in one place can help analyze and diagnose what is happening.
- Make the relevant information get to the right people at right time. Bitly is using its own nsq messaging platform at that aim coupled with Web UI to easily see what is happening during deployment.
Bitly is a URL shortening service for use in social networking, SMS, and email. In addition to shortening URLs, bitly also collects analytics about referral links, which are central to its business model.
Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin May 21, 2015
Kai Kreuzer, Olaf Weinmann May 21, 2015