Scalability Worst Practices
Over the course of building out a number of large, distributed systems I've had the opportunity to observe (and implement) some worst practices. Most of these worst practices start innocently enough, but if neglected they will jeopardize the growth and scalability of a system. A number of articles focus on best practices to ensure an easily maintainable and scalable system, but in this article I'll highlight some worst practices to avoid.
No one technology or architecture can fulfill all requirements. Understanding when to re-think existing ideas, how to look beyond the local scope or how to exercise diligent control of dependencies all are key attributes to scalability. Let's look more closely at each.
The Golden Hammer
The Golden Hammer refers to the old adage: if all you have is a hammer, everything looks like a nail. Many developers fall prey to the idea of using only one technology - the cost of this is having to build and maintain an infrastructure in the chosen technology which may be readily available in another technology which is more suited to the specific problem area's functionality or abstractions. Forcing a particular technology to work in ways it was not intended is sometimes counter-productive.
As an example, a common solution to the problem of persisting key-value pairs is to use a database. This choice is usually made because the organization or developer has a well-established database practice and naturally follows the same solution path for many problems. The problems arise when the features of a database (relational integrity, locking, joins and schemas) become bottlenecks or inhibit scaling. This occurs because the cost of growing a database-based solution for this application is generally much more expensive than other available technologies. As the access rate of the key-value store increases, the concurrency model of a database can begin to degrade performance while the advanced features that the database provides are not used. Many alternatives to the traditional relational database are available to address these short-comings such as CouchDB, SimpleDB or BigTable.
Another common Hammer is always using threads to program for concurrency. While threads do address concurrency, they come at the cost of increased code complexity and an inherent failure for real composability with the locking and access models currently available. Thousands of lines of code contain race conditions, potential deadlocks and inconsistent data access management because the most popular programming languages today manage concurrency with threads. There is a growing community demonstrating alternatives methods for concurrency without the scaling issues of threads, namely the concurrency models promoted by Erlang or Stackless Python. Even if alternative languages are not chosen for production code, studying them for their concepts, such as message passing or asynchronous I/O, is good practice.
Developers tend to be very comfortable addressing issues at a small scale: using a profiler, understanding the space and time complexity of an algorithm or knowing which list implementation to use for a particular application. Not everyone, however, is as adept understanding the constraints of a larger system such as identifying the performance characteristics of a shared resource, knowing the clients of services or understanding database access patterns.
The common practice for scaling applications is for continual horizontal deployment of redundant stateless, share-nothing services as the optimal architecture. However, an often overlooked aspect of this build-out, in my experience, is not addressing the shared resources that are utilized by the additional services.
For example, if a particular service uses a database for persistent storage it usually manages the connections to the database through a thread pool. The use of the pool is a good practice and one which helps shield the database from excessive connection management. However, the database is still a shared resource and the pools, while perhaps configured locally, must also be managed in the aggregate. There are two practices which result in failures:
- Continually increasing the number of services while not decreasing the maximum pool sizes
- Increasing the individual pool sizes without decreasing the number of services
In both cases, the aggregate number of connections must be managed in addition to configuring the application for the performance characteristics required. Additionally, database capacity must be continually monitored to keep the connectivity in balance.
It's crucially important to manage the availability of shared resources because when they fail, by definition, their failure is experienced pervasively rather than in isolation.
Big Ball of Mud
Dependencies are a necessary evil in most systems and failure to manage dependencies and their versions diligently can inhibit agility and scalability.
Dependency management for code has different flavors:
- Compile the entire codebase together
- Pick and choose components and services based on known versions
- Publish models and services comprised of only backwards-compatible changes
Let's look at these scenarios. First, in the Big Ball of Mud pattern the entire system is compiled and deployed as one unit. This has the obvious advantage of moving the dependency management to the compiler and catching some of the issues up-front, but it brings the scaling issues involved with deploying an entire system (including the testing, provisioning and risk involved with whole-sale changes) every time. It becomes significantly more difficult to isolate changes to a system in this model.
In the second model the dependencies are picked up as desired, however changes in transitive dependencies could mirror the difficulties of the first model.
In the third model the service is responsible for versioning dependencies and providing the clients backwards-compatible interfaces. This significantly decreases the burden on the clients, allowing gradual upgrades to newer models and service interfaces. Additionally, when the data requires transformation, it's done by the service and not the client -- this further establishes isolation. The requirement for backwards-compatible changes means patching, upgrading and rolling back changes all behave in similar matters with respect to client interaction.
Adopting a service architecture with backwards-compatible changes ensures the dependencies issues are minimized. It also facilitates independent testing in a controlled environment and isolates versioned data changes from clients. All three of these benefits are important for isolating change. The recently released Google Protocol Buffers project is another advocate for backwards-compatible service models and interfaces.
Everything or Something
Another thing to consider when managing dependencies is how to package the contents of the application.
In some scenarios, such as Amazon Machine Images or Google AppEngine applications, the entire application and all of its dependencies are bundled and distributed together. This all-inclusive bundled approach keeps the application self-contained, however it increases the overall size of the bundle and also forces a redeployment of the entire application bundle for a small change anywhere in the application (even to a shared library use by many applications on the same physical machine).
An alternative scenario pushes the dependencies for an application out to the host systems and the application bundle itself contains only some portion of the dependency graph. This keeps the bundle size small but increases the deployment configuration by requiring certain components be delivered to every machine prior to an application being serviceable. Not deploying the entire bundle in a self-containedmanner inhibits the speed at which applications can be pushed to alternative, non-standard machines because the dependencies might not be immediately available or because the machine is infrequently tested or the dependencies incorrect.
The latter approach, managing dependencies across different scopes (global, machine, application) is ripe for oversight and complication. It increases the complexity of operations by decreasing configuration and dependency isolation. While there exceptions to the rule, generally speaking isolation increases scalability, so choose the all-inclusive approach whenever possible.
In both code and application dependency management, the worst practice is not understanding the relationships and formulating a model to facilitate their management. Failure to enforce diligent control is a contributing scalability inhibiter.
Forgetting to check the time
In a distributed system it is often the goal to isolate as much of the sophisticated machinery responsible for distributing the invocations from the developer as possible. This leaves the majority of development to focus on core business logic and not worry about the requirements of failover, timeouts and other necessities of distributed systems. However, providing the means to make a remote invocation look local means the developer will code as though the invocation is local.
Too often I've seen code with the unreasonable expectation that all remote requests succeed in a timely matter. Java, for example, only introduced a read timeout for its
HTTPURLConnection class in JDK 1.5 leaving generations of developers either spawning threads to kill the process or naively waiting for a response.
DNS lookups in Java are another example of potential time mismanagement. In a typical long-running system, the initial DNS lookups are executed, and unless configured explicitly not to, are cached for the lifetime of the JVM. If the external system changes the IP address for the hostname the entry will now resolve incorrectly and in many cases the connect will hang because no connect timeout is programmatically set.
To properly scale a system it is imperative to manage the time alloted for requests to be handled. There are many means for doing so, some built-in into the language (as with Erlang) and others as libraries, e.g.
libevent or Java's
NIO. Regardless of implementation language or architecture, proper latency management is imperative.
Establishing a cost effective scaling model, managing dependencies and anticipating failure are all aspects of a fantastic architecture. However, the ability to easily deploy and operate the system in a production environment must be held in equal regard. There are a number of worst practices which jeopardize the scalability of a system.
One popular solution to the operation issue is a Hero who can and often will manage the bulk of the operational needs. The Hero Pattern can work in a small environment when one individual has the talent and capacity to understand an entire system, including the many nuances required to keep it functioning properly. For a large system of many components this approach does not scale, yet it is one of the most frequently-deployed solutions.
The Hero often understands service dependencies when no formal specification exists, remembers how to toggle features on and off or knows about systems everyone else has forgotten. The Hero is crucial but the Hero should not be an individual.
I believe the best solution to the Hero Pattern is automation. However, it also helps simply to rotate individuals from team to team if the organization has the capacity. In banking, it's sometimes mandatory to take vacation so any "I'll just run this from my workstation" activities can be brought to light.
A system too dependent on human intervention, frequently the result of having a Hero, is dangerously exposed to issues of reproducibility and hit-by-a-bus syndrome. It's important that a particular build, deployment or environment be reproducible and automation with explicit metadata is a key to successful reproducibility.
On some open source projects, the artifact release process was dependent on individual developers creating artifacts on their own workstation, with nothing that ensured that the version of the artifact produced actually mirrored a working branch of the code in source control. In these situations, it was entirely possible to release software which was based on code that was never committed to source control.
As described above, the activities of the Hero are ripe for automation to ensure that it is possible for one person (or many people) to replace another with relative ease. An alternative approach to automation is additional process -- Clay Shirky offers an entertaining definition of process:
Process is an embedded reaction to prior stupidity.
Prior stupidity is inevitable – automation should capture the education.
Monitoring, like testing, is often one of the first items sacrificed when time is tight. Sometimes when I ask specific questions about the runtime characteristics of a component, no answer exists. The lack of visibility into the internals of a running system and the ability to quickly dive into answers to such questions jeopardizes the ability to make accurate and critical decisions about where to look and what to look for.
Orbitz is fortunate to have very sophisticated monitoring software which offers both fine-grained details about service invocations and the ability to visualize data to pinpoint problem areas. The metrics available from the monitoring infrastructure provide for fast and affective triaging of a problem.
After Amazon's recent S3 outage, Jeff Bezos remarked:
When we have a problem, we know the proximate cause, we analyze from there and find the root cause, we will find the root fix and move forward.
Software and system development is an iterative process which is ripe with opportunities for both failure and success. Solutions which are simple but more difficult to scale have their place, especially as ideas or applications work through their immature phases. Perfect should not be the enemy of good. However, as systems reach maturity, the worst practices described above should be weeded out, and success will be yours too.
Many thanks to Monika Szymanski for feedback on a draft of this article.
About the Author
Brian Zimmer is an Architect at travel startup Yapta, a respected member of the open source community and member of the Python Software Foundation. He worked previously as a Senior Architect at Orbitz. He maintains a blog at http://bzimmer.ziclix.com.
Anti patterns also
DNS Caching - Non Deterministic Behaviour
Nice job outlining some very salient issues.
Your comments on DNS lookups in Java hit a sore point I have been wrestling with. We have been doing some planning, preparation and testing for disaster recovery and business continuity fail over events and one of the issues that keeps coming up is the prototypical scenario of a Java app server that communicates with a remote service ( a DB, legacy service, messaging server etc.) that fails and the Java app must quickly switch over to what [usually] amounts to a new IP address.
Sounds simple enough, but there are some subtleties that I have been trying to definitively iron out as to what exact behaviour we can expect under different scenarios. As I understand it (and anyone please correct me if this is wrong) that the JVM absolutely does cache a DNS lookup so neither a host file update or a fully propagated DNS change will invalidate that cache. However, if the original IP address is detected as "invalid" because it is unreachable, the JVM will ditch the cached entry and refresh from the configured naming service. So in theory, provided your remote service has a cold, hard failure, your Java app should automatically pick up the new address. The problem is that frequently failovers are driven by partially impacted servers and not cold, hard failures. In these cases, the original address is still reachable and the only way to pick up the new address to to either deliberately kill the old service's IP connectivity (not always an option since this stalls triage and remediation) or recycle the Java app which extends the outage window. It would be nice if the JVM had a management interface (through JMX) that could flush all or specified cached addresses.
Having said that, even if the above was completely addressed, each app has its own issues. For example, hopefully you are using validating database connection pools which will automatically ditch failed connections and create new ones (to the failover target) in the process of handing out connections from the pool. Failing that, hopefully your connection pools have a management interface that allows you to flush the pool and repopulate.
Unfortunately, there are all too many apps around that do not have this richness built in and once a remote resource is impacted, a recycle is the only way to re-engage. The long and the short of all this is that business continuity plans, or more specifically, the detailed procedures to handle an event, are challenging to compile since they may be different for each individual application and the exact nature of the failure may further dictate what the procedure should be, making what should be an absolutely unambiguous set of procedures into a series of decision making and analysis at a time when one can least afford to do so.
Do you see it in any more black-and-white than I do ?
Re: DNS Caching - Non Deterministic Behaviour
How do someone is able to run any modern application server with a security manager installed, when application server vendors don't even care to provide a starting policy file that will grant the minimal set of grant declarations to start their servers is beyond my understanding.
As for the failover options, as long as the libraries have configurable timeouts, you could go on with SRV records on your DNS, instead of As or CNAMEs. Google for "DNS SRV FAILOVER". SRV records were designed exactly to address this issue.
You can find a good description on SRV records and their usage here on:
Anyway, it's a good article. It's good to see people starting to realize that caring for non-functional as performance and scalability can really make or save your day.
Other day, after a very long session of profiling and code archaeology, I crafted a phrase, which I pledge the readers to take with some sense of humour before crucifying me.
The Great Knuth said:
"Early Optmization is the root of all evil"
But Marcos Eliziario, who is a poor programmer, known of no one, said at 2 AM after two sleepless days:
"Reckless regard for optimization is the cube of an even greater evil"
Developers are not taking advantage of the platform enabled technologies
ticking all the boxes...
Anyway, just wanted to say it hit the nail on the head for me - I look forward to your next article.