In this article, former Orbitz lead architect Brian Zimmer discusses scalability worst pratices. Topics covered include The Golden Hammer, Resource Abuse, Big Ball of Mud, Dependency Management, Timeouts, Hero Pattern, Not Automating, and Monitoring.
From the article:
One popular solution to the operation issue is a Hero who can and often will manage the bulk of the operational needs. The Hero Pattern can work in a small environment when one individual has the talent and capacity to understand an entire system, including the many nuances required to keep it functioning properly. For a large system of many components this approach does not scale, yet it is one of the most frequently-deployed solutions.
The Hero often understands service dependencies when no formal specification exists, remembers how to toggle features on and off or knows about systems everyone else has forgotten. The Hero is crucial but the Hero should not be an individual.
Read the full article here.
Community comments
Anti patterns also
by mohd yaqub,
DNS Caching - Non Deterministic Behaviour
by Nicholas Whitehead,
Re: DNS Caching - Non Deterministic Behaviour
by Marcos Santos,
Developers are not taking advantage of the platform enabled technologies
by Sarma Pisapati,
ticking all the boxes...
by Brendan Caffrey,
Anti patterns also
by mohd yaqub,
Your message is awaiting moderation. Thank you for participating in the discussion.
Impressive article. Some of these refer back to the documented anti patterns we find in the architecture community .
DNS Caching - Non Deterministic Behaviour
by Nicholas Whitehead,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Brian;
Nice job outlining some very salient issues.
Your comments on DNS lookups in Java hit a sore point I have been wrestling with. We have been doing some planning, preparation and testing for disaster recovery and business continuity fail over events and one of the issues that keeps coming up is the prototypical scenario of a Java app server that communicates with a remote service ( a DB, legacy service, messaging server etc.) that fails and the Java app must quickly switch over to what [usually] amounts to a new IP address.
Sounds simple enough, but there are some subtleties that I have been trying to definitively iron out as to what exact behaviour we can expect under different scenarios. As I understand it (and anyone please correct me if this is wrong) that the JVM absolutely does cache a DNS lookup so neither a host file update or a fully propagated DNS change will invalidate that cache. However, if the original IP address is detected as "invalid" because it is unreachable, the JVM will ditch the cached entry and refresh from the configured naming service. So in theory, provided your remote service has a cold, hard failure, your Java app should automatically pick up the new address. The problem is that frequently failovers are driven by partially impacted servers and not cold, hard failures. In these cases, the original address is still reachable and the only way to pick up the new address to to either deliberately kill the old service's IP connectivity (not always an option since this stalls triage and remediation) or recycle the Java app which extends the outage window. It would be nice if the JVM had a management interface (through JMX) that could flush all or specified cached addresses.
Having said that, even if the above was completely addressed, each app has its own issues. For example, hopefully you are using validating database connection pools which will automatically ditch failed connections and create new ones (to the failover target) in the process of handing out connections from the pool. Failing that, hopefully your connection pools have a management interface that allows you to flush the pool and repopulate.
Unfortunately, there are all too many apps around that do not have this richness built in and once a remote resource is impacted, a recycle is the only way to re-engage. The long and the short of all this is that business continuity plans, or more specifically, the detailed procedures to handle an event, are challenging to compile since they may be different for each individual application and the exact nature of the failure may further dictate what the procedure should be, making what should be an absolutely unambiguous set of procedures into a series of decision making and analysis at a time when one can least afford to do so.
Do you see it in any more black-and-white than I do ?
Cheers.
//Nicholas
Re: DNS Caching - Non Deterministic Behaviour
by Marcos Santos,
Your message is awaiting moderation. Thank you for participating in the discussion.
When a security manager is not set, the default for networkaddress.cache.ttl is 30 seconds, at least, this is the default behaviour on modern Sun JDKs.
How do someone is able to run any modern application server with a security manager installed, when application server vendors don't even care to provide a starting policy file that will grant the minimal set of grant declarations to start their servers is beyond my understanding.
As for the failover options, as long as the libraries have configurable timeouts, you could go on with SRV records on your DNS, instead of As or CNAMEs. Google for "DNS SRV FAILOVER". SRV records were designed exactly to address this issue.
You can find a good description on SRV records and their usage here on:
www.zytrax.com/books/dns/ch8/srv.html
Anyway, it's a good article. It's good to see people starting to realize that caring for non-functional as performance and scalability can really make or save your day.
Other day, after a very long session of profiling and code archaeology, I crafted a phrase, which I pledge the readers to take with some sense of humour before crucifying me.
The Great Knuth said:
"Early Optmization is the root of all evil"
But Marcos Eliziario, who is a poor programmer, known of no one, said at 2 AM after two sleepless days:
"Reckless regard for optimization is the cube of an even greater evil"
Developers are not taking advantage of the platform enabled technologies
by Sarma Pisapati,
Your message is awaiting moderation. Thank you for participating in the discussion.
Especially for .NET apps, apart from making a stateless app, it is always a best practice to host app/web service under IIS in windows 2003 where the platform enables technologies for scalability. This is pretty easy under Network Load Balancers for scale-out solution. Actually, developer has to focus on a cluster-aware applications to automate the configuration when a new node joins the cluster. In fact, more development effort is needed to provide scale-up solution rather than scale-out. On another ocation, I observed that the developers are spending time to do SSL with code. They should take advantage of IIS configuration which is easy to manage. After all, we don't want to write another framework or operating system. I think that developers should focus on functionality and improve presentation techniques.
ticking all the boxes...
by Brendan Caffrey,
Your message is awaiting moderation. Thank you for participating in the discussion.
I'd echo the praise of the previous commenters. Major work recently in my world has included introducing/improving/increasing monitoring, removing runtime dependencies which cause untold problems, pulling apart the big-ball-of-mud build and deployment process, even the way monitoring goes by the wayside and the JVM DNS-cache thing has come up recently during a disaster-recovery test as it did for Nicholas above - pretty much all the stuff you've described was familiar so I got that warm fuzzy feeling reading each paragraph.
Anyway, just wanted to say it hit the nail on the head for me - I look forward to your next article.
Regards.