Article: Scalability Worst Practices
In this article, former Orbitz lead architect Brian Zimmer discusses scalability worst pratices. Topics covered include The Golden Hammer, Resource Abuse, Big Ball of Mud, Dependency Management, Timeouts, Hero Pattern, Not Automating, and Monitoring.
From the article:
One popular solution to the operation issue is a Hero who can and often will manage the bulk of the operational needs. The Hero Pattern can work in a small environment when one individual has the talent and capacity to understand an entire system, including the many nuances required to keep it functioning properly. For a large system of many components this approach does not scale, yet it is one of the most frequently-deployed solutions.
The Hero often understands service dependencies when no formal specification exists, remembers how to toggle features on and off or knows about systems everyone else has forgotten. The Hero is crucial but the Hero should not be an individual.
Read the full article here.
Anti patterns also
DNS Caching - Non Deterministic Behaviour
Nice job outlining some very salient issues.
Your comments on DNS lookups in Java hit a sore point I have been wrestling with. We have been doing some planning, preparation and testing for disaster recovery and business continuity fail over events and one of the issues that keeps coming up is the prototypical scenario of a Java app server that communicates with a remote service ( a DB, legacy service, messaging server etc.) that fails and the Java app must quickly switch over to what [usually] amounts to a new IP address.
Sounds simple enough, but there are some subtleties that I have been trying to definitively iron out as to what exact behaviour we can expect under different scenarios. As I understand it (and anyone please correct me if this is wrong) that the JVM absolutely does cache a DNS lookup so neither a host file update or a fully propagated DNS change will invalidate that cache. However, if the original IP address is detected as "invalid" because it is unreachable, the JVM will ditch the cached entry and refresh from the configured naming service. So in theory, provided your remote service has a cold, hard failure, your Java app should automatically pick up the new address. The problem is that frequently failovers are driven by partially impacted servers and not cold, hard failures. In these cases, the original address is still reachable and the only way to pick up the new address to to either deliberately kill the old service's IP connectivity (not always an option since this stalls triage and remediation) or recycle the Java app which extends the outage window. It would be nice if the JVM had a management interface (through JMX) that could flush all or specified cached addresses.
Having said that, even if the above was completely addressed, each app has its own issues. For example, hopefully you are using validating database connection pools which will automatically ditch failed connections and create new ones (to the failover target) in the process of handing out connections from the pool. Failing that, hopefully your connection pools have a management interface that allows you to flush the pool and repopulate.
Unfortunately, there are all too many apps around that do not have this richness built in and once a remote resource is impacted, a recycle is the only way to re-engage. The long and the short of all this is that business continuity plans, or more specifically, the detailed procedures to handle an event, are challenging to compile since they may be different for each individual application and the exact nature of the failure may further dictate what the procedure should be, making what should be an absolutely unambiguous set of procedures into a series of decision making and analysis at a time when one can least afford to do so.
Do you see it in any more black-and-white than I do ?
Re: DNS Caching - Non Deterministic Behaviour
How do someone is able to run any modern application server with a security manager installed, when application server vendors don't even care to provide a starting policy file that will grant the minimal set of grant declarations to start their servers is beyond my understanding.
As for the failover options, as long as the libraries have configurable timeouts, you could go on with SRV records on your DNS, instead of As or CNAMEs. Google for "DNS SRV FAILOVER". SRV records were designed exactly to address this issue.
You can find a good description on SRV records and their usage here on:
Anyway, it's a good article. It's good to see people starting to realize that caring for non-functional as performance and scalability can really make or save your day.
Other day, after a very long session of profiling and code archaeology, I crafted a phrase, which I pledge the readers to take with some sense of humour before crucifying me.
The Great Knuth said:
"Early Optmization is the root of all evil"
But Marcos Eliziario, who is a poor programmer, known of no one, said at 2 AM after two sleepless days:
"Reckless regard for optimization is the cube of an even greater evil"
Developers are not taking advantage of the platform enabled technologies
ticking all the boxes...
Anyway, just wanted to say it hit the nail on the head for me - I look forward to your next article.
Roy Rapoport Aug 28, 2014