BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Scalability Worst Practices

Scalability Worst Practices

This item in japanese

Introduction

Over the course of building out a number of large, distributed systems I've had the opportunity to observe (and implement) some worst practices. Most of these worst practices start innocently enough, but if neglected they will jeopardize the growth and scalability of a system. A number of articles focus on best practices to ensure an easily maintainable and scalable system, but in this article I'll highlight some worst practices to avoid.

Technology

No one technology or architecture can fulfill all requirements. Understanding when to re-think existing ideas, how to look beyond the local scope or how to exercise diligent control of dependencies all are key attributes to scalability. Let's look more closely at each.

The Golden Hammer

The Golden Hammer refers to the old adage: if all you have is a hammer, everything looks like a nail. Many developers fall prey to the idea of using only one technology - the cost of this is having to build and maintain an infrastructure in the chosen technology which may be readily available in another technology which is more suited to the specific problem area's functionality or abstractions. Forcing a particular technology to work in ways it was not intended is sometimes counter-productive.

As an example, a common solution to the problem of persisting key-value pairs is to use a database. This choice is usually made because the organization or developer has a well-established database practice and naturally follows the same solution path for many problems. The problems arise when the features of a database (relational integrity, locking, joins and schemas) become bottlenecks or inhibit scaling. This occurs because the cost of growing a database-based solution for this application is generally much more expensive than other available technologies. As the access rate of the key-value store increases, the concurrency model of a database can begin to degrade performance while the advanced features that the database provides are not used. Many alternatives to the traditional relational database are available to address these short-comings such as CouchDB, SimpleDB or BigTable.

Another common Hammer is always using threads to program for concurrency. While threads do address concurrency, they come at the cost of increased code complexity and an inherent failure for real composability with the locking and access models currently available. Thousands of lines of code contain race conditions, potential deadlocks and inconsistent data access management because the most popular programming languages today manage concurrency with threads. There is a growing community demonstrating alternatives methods for concurrency without the scaling issues of threads, namely the concurrency models promoted by Erlang or Stackless Python. Even if alternative languages are not chosen for production code, studying them for their concepts, such as message passing or asynchronous I/O, is good practice.

Resource Abuse

Developers tend to be very comfortable addressing issues at a small scale: using a profiler, understanding the space and time complexity of an algorithm or knowing which list implementation to use for a particular application. Not everyone, however, is as adept understanding the constraints of a larger system such as identifying the performance characteristics of a shared resource, knowing the clients of services or understanding database access patterns.

The common practice for scaling applications is for continual horizontal deployment of redundant stateless, share-nothing services as the optimal architecture. However, an often overlooked aspect of this build-out, in my experience, is not addressing the shared resources that are utilized by the additional services.

For example, if a particular service uses a database for persistent storage it usually manages the connections to the database through a thread pool. The use of the pool is a good practice and one which helps shield the database from excessive connection management. However, the database is still a shared resource and the pools, while perhaps configured locally, must also be managed in the aggregate. There are two practices which result in failures:

  1. Continually increasing the number of services while not decreasing the maximum pool sizes
  2. Increasing the individual pool sizes without decreasing the number of services

In both cases, the aggregate number of connections must be managed in addition to configuring the application for the performance characteristics required. Additionally, database capacity must be continually monitored to keep the connectivity in balance.

It's crucially important to manage the availability of shared resources because when they fail, by definition, their failure is experienced pervasively rather than in isolation.

Big Ball of Mud

Dependencies are a necessary evil in most systems and failure to manage dependencies and their versions diligently can inhibit agility and scalability.

Dependency management for code has different flavors:

  • Compile the entire codebase together
  • Pick and choose components and services based on known versions
  • Publish models and services comprised of only backwards-compatible changes

Let's look at these scenarios. First, in the Big Ball of Mud pattern the entire system is compiled and deployed as one unit. This has the obvious advantage of moving the dependency management to the compiler and catching some of the issues up-front, but it brings the scaling issues involved with deploying an entire system (including the testing, provisioning and risk involved with whole-sale changes) every time. It becomes significantly more difficult to isolate changes to a system in this model.

In the second model the dependencies are picked up as desired, however changes in transitive dependencies could mirror the difficulties of the first model.

In the third model the service is responsible for versioning dependencies and providing the clients backwards-compatible interfaces. This significantly decreases the burden on the clients, allowing gradual upgrades to newer models and service interfaces. Additionally, when the data requires transformation, it's done by the service and not the client -- this further establishes isolation. The requirement for backwards-compatible changes means patching, upgrading and rolling back changes all behave in similar matters with respect to client interaction.

Adopting a service architecture with backwards-compatible changes ensures the dependencies issues are minimized. It also facilitates independent testing in a controlled environment and isolates versioned data changes from clients. All three of these benefits are important for isolating change. The recently released Google Protocol Buffers project is another advocate for backwards-compatible service models and interfaces.

Everything or Something

Another thing to consider when managing dependencies is how to package the contents of the application.

In some scenarios, such as Amazon Machine Images or Google AppEngine applications, the entire application and all of its dependencies are bundled and distributed together. This all-inclusive bundled approach keeps the application self-contained, however it increases the overall size of the bundle and also forces a redeployment of the entire application bundle for a small change anywhere in the application (even to a shared library use by many applications on the same physical machine).

An alternative scenario pushes the dependencies for an application out to the host systems and the application bundle itself contains only some portion of the dependency graph. This keeps the bundle size small but increases the deployment configuration by requiring certain components be delivered to every machine prior to an application being serviceable. Not deploying the entire bundle in a self-containedmanner inhibits the speed at which applications can be pushed to alternative, non-standard machines because the dependencies might not be immediately available or because the machine is infrequently tested or the dependencies incorrect.

The latter approach, managing dependencies across different scopes (global, machine, application) is ripe for oversight and complication. It increases the complexity of operations by decreasing configuration and dependency isolation. While there exceptions to the rule, generally speaking isolation increases scalability, so choose the all-inclusive approach whenever possible.

In both code and application dependency management, the worst practice is not understanding the relationships and formulating a model to facilitate their management. Failure to enforce diligent control is a contributing scalability inhibiter.

Forgetting to check the time

In a distributed system it is often the goal to isolate as much of the sophisticated machinery responsible for distributing the invocations from the developer as possible. This leaves the majority of development to focus on core business logic and not worry about the requirements of failover, timeouts and other necessities of distributed systems. However, providing the means to make a remote invocation look local means the developer will code as though the invocation is local.

Too often I've seen code with the unreasonable expectation that all remote requests succeed in a timely matter. Java, for example, only introduced a read timeout for its HTTPURLConnection class in JDK 1.5 leaving generations of developers either spawning threads to kill the process or naively waiting for a response.

DNS lookups in Java are another example of potential time mismanagement. In a typical long-running system, the initial DNS lookups are executed, and unless configured explicitly not to, are cached for the lifetime of the JVM. If the external system changes the IP address for the hostname the entry will now resolve incorrectly and in many cases the connect will hang because no connect timeout is programmatically set.

To properly scale a system it is imperative to manage the time alloted for requests to be handled. There are many means for doing so, some built-in into the language (as with Erlang) and others as libraries, e.g. libevent or Java's NIO. Regardless of implementation language or architecture, proper latency management is imperative.

Runtime

Establishing a cost effective scaling model, managing dependencies and anticipating failure are all aspects of a fantastic architecture. However, the ability to easily deploy and operate the system in a production environment must be held in equal regard. There are a number of worst practices which jeopardize the scalability of a system.

Hero Pattern

One popular solution to the operation issue is a Hero who can and often will manage the bulk of the operational needs. The Hero Pattern can work in a small environment when one individual has the talent and capacity to understand an entire system, including the many nuances required to keep it functioning properly. For a large system of many components this approach does not scale, yet it is one of the most frequently-deployed solutions.

The Hero often understands service dependencies when no formal specification exists, remembers how to toggle features on and off or knows about systems everyone else has forgotten. The Hero is crucial but the Hero should not be an individual.

I believe the best solution to the Hero Pattern is automation. However, it also helps simply to rotate individuals from team to team if the organization has the capacity. In banking, it's sometimes mandatory to take vacation so any "I'll just run this from my workstation" activities can be brought to light.

Not automating

A system too dependent on human intervention, frequently the result of having a Hero, is dangerously exposed to issues of reproducibility and hit-by-a-bus syndrome. It's important that a particular build, deployment or environment be reproducible and automation with explicit metadata is a key to successful reproducibility.

On some open source projects, the artifact release process was dependent on individual developers creating artifacts on their own workstation, with nothing that ensured that the version of the artifact produced actually mirrored a working branch of the code in source control. In these situations, it was entirely possible to release software which was based on code that was never committed to source control.

As described above, the activities of the Hero are ripe for automation to ensure that it is possible for one person (or many people) to replace another with relative ease. An alternative approach to automation is additional process -- Clay Shirky offers an entertaining definition of process:

Process is an embedded reaction to prior stupidity.

Prior stupidity is inevitable – automation should capture the education.

Monitoring

Monitoring, like testing, is often one of the first items sacrificed when time is tight. Sometimes when I ask specific questions about the runtime characteristics of a component, no answer exists. The lack of visibility into the internals of a running system and the ability to quickly dive into answers to such questions jeopardizes the ability to make accurate and critical decisions about where to look and what to look for.

Orbitz is fortunate to have very sophisticated monitoring software which offers both fine-grained details about service invocations and the ability to visualize data to pinpoint problem areas. The metrics available from the monitoring infrastructure provide for fast and affective triaging of a problem.

Summary

After Amazon's recent S3 outage, Jeff Bezos remarked:

When we have a problem, we know the proximate cause, we analyze from there and find the root cause, we will find the root fix and move forward.

Software and system development is an iterative process which is ripe with opportunities for both failure and success. Solutions which are simple but more difficult to scale have their place, especially as ideas or applications work through their immature phases. Perfect should not be the enemy of good. However, as systems reach maturity, the worst practices described above should be weeded out, and success will be yours too.

Many thanks to Monika Szymanski for feedback on a draft of this article.

About the Author

Brian Zimmer is an Architect at travel startup Yapta, a respected member of the open source community and member of the Python Software Foundation. He worked previously as a Senior Architect at Orbitz. He maintains a blog at http://bzimmer.ziclix.com.

Rate this Article

Adoption
Style

BT