InfoQ Homepage Articles Application Failover using AOP

# Application Failover using AOP

AOP has recently been in the midst of controversies with Gavin King snubbing it as a totally over hyped, failed technology and Cedric Beust also expressing serious doubts regarding aspects making it to mainstream programming, branding AOP as a great idea that will remain the privilege of a few expert developers. This article does not try to create a fan syndrome of Aspect Oriented Programming (AOP) - it describes how this technology has delivered barrels of good to a real-life Java EE project with a large financial organization in addressing some of the critical cross cutting concerns as last minute changes in project requirements. The scenarios described and the solutions implemented nicely portray how AOP complements OOP in addressing orthogonal concerns of modeling a business application.

## The Problem

We have been working on the development of a securities trading back office solution for a large financial organization using Java EE technology with Oracle 10g RAC as the database cluster and Websphere MQ as the messaging middleware. The project is currently deployed in the UAT phase, when it was decided by the management that we need to implement transparent application failover services over the clustered architecture.

Oracle 10g RAC supports FCF (Fast Connection Failover) offering an elegant way for JDBC applications to take advantage of the connection failover facilities. But the real challenge here is to handle the failover at the application layer and be transparent to the user through a retry-and-recover scheme.

In case of an Oracle node failover, the following sequence of events occurs:

• A database instance fails, leaving several stale connections in the connection cache.
• The RAC mechanism in the database generates a RAC event which is sent to the JVM containing JDBC.
• The daemon thread inside the JVM finds all the connections affected by the RAC event, notifies them of the closed connection through SQL exceptions, and rolls back any open transactions. When a RAC service failure is propagated to the JDBC application, the database has already rolled back the local transaction.

If Fast Connection Failover (FCF) is enabled, when one RAC node goes down, the connection cache is automatically invalidated and all unused connections are re-established with a new node. However, that is not true for the connections that are already being used by the application. In this scenario, when the application tries to use the connection which had been connected prior to node failover, SQLException is thrown (ORA-17008,Closed Connection). The application has to retry the connection manually and FCF guarantees the next connection attempt to be successful.

## The Brute Force Solution

The solution to the above problem needs to address the manual retry issue of the application through a suitable retry-recover scheme. We realized that at the application level, we need to address ORA-17008 specially, incorporating special handlers to enable automatic retries. The problem is that the code base, which has evolved over the last 2 years, has more than 2 million lines of Java and JSP consisting of more than 6000 classes and 500 database tables and there are zillions of instances like the following:

long id = ...; try {          Instrument instr = new Instrument(id, conn); } catch(SQLException ex) {          throw new KeyedException("cam.error.failed.retrieve.instrument", ex); } ...

For all snippets like the above, SQLException is a generic checked exception for all database related failures that has to be caught and littered all over the code base (gosh .. the pains of checked exceptions :-( .. should have wrapped in unchecked ones like Spring does). A brute force approach would be to incorporate specialized handlers at all sites where SQLException is caught. This scheme had to be ruled out keeping in mind that it is UAT season and the monstrous impact that this would have on the entire code base - the client will definitely not be amused!

## Enter Aspects

After careful analyses of the code base, we detected that the main areas of impact were the large number of service components and the console applications, where we need to implement the retry-and-recover scheme. For historical reasons we are not using EJB - instead all service components and console applications have launcher base classes as drivers. Still the whole gamut of functionalities that we need to address covered more than half of our total code base - a real cross cutting concern.

It was then that we thought of incorporating aspects to address this issue. The following solution was proposed and implemented:

1. Define a pointcut to handle SQLException
2. Define an advice to be executed on the pointcut that intercepts the exception and throws a typed exception
3. This error had to be handled in only 2 base classes - one for the Service Components and the other for the Console Applications. These two classes implement the retry-and-recover functionality within the handler for this error.

The following aspect implements the skeleton of the above scheme:

public aspect AspectFastConnFailOver {          pointcut sqlHandler(SQLException exception):          handler(SQLException+) && args(exception);           // advice to be executed as the handler of SQLException           // its derived exception          before(SQLException exception): sqlHandler(exception){           ...          // handle only if non-UI          if (!Application.getInstance()          .getContext()          .getCallerIdentity()          .isInteractiveUser()) {                   if(exception.getErrorCode() == Globals.FCF_SQLEX_ERRORCODE) {                        throw new DatabaseNotAvailableError();             }       } } ... }

## Implementing MQ Failover

Once transparent application failover was achieved for the database server, we decided to try out a similar implementation scheme for the Websphere MQ Services deployed over a cluster configured using Storage Foundation 4.0/HA from Veritas and MQ agent server framework from Veritas.

In case of failover using Oracle 10g RAC, whenever a fail-over takes place, proper event is generated and propagated to the application's JDBC layer with a particular error code. Based on the error code, JDBC invalidates all the unused connections from the pool and rollbacks the transaction associated with the current connections. At the application level, we need to trap the error code and retry to acquire a new connection to handle the situation. The next call to acquire a connection would create a new connection which should be successful if the fail-over has taken place by that time.

The situation with MQ fail-over is complicated by the fact that no event is sent back to the application level, since the Veritas cluster does not handle the fail-over intrinsically. Hence the application needs to detect the fail-over, invalidate all connections and sessions in the pool and roll back the transaction. Like SQLException, in message handling,as per JMS specification, JMSException can be thrown potentially from every method of all interfaces like Connection, Session, Receiver, Sender and Browser. Hence the retry-and-recover needs to be implemented in a centralized handler through appropriate pointcuts.

Here is the similar aspect fragment that does the interception:

public aspect AspectFailOver {         pointcut jmsHandler(JMSException exception):          handler(JMSException+) && args(exception)         && !within(...)         && !withincode(...));          // advice to be executed as the handler of JMSException         // its derived exception                 before(JMSException exception): jmsHandler(exception){         ...                  if (!Application.getInstance()                  .getContext()                  .getCallerIdentity()                  .isInteractiveUser()) {                           if(isMQFailoverException(exception)) {                                 throw new MQNotAvailableError();                           }                  } }         ...         } }

Similar to Oracle fail-over, the error MQNotAvailableError has been trapped in the couple of base classes for Service Components and Console Application Launchers for implementing the retry-and-recover loop.

## Minimum Impact, Maximum Effect

The result was great! We achieved the goal with minimal impact on the existing codebase - thanks to the power of AOP. We used AspectJ and compile time weaving - the build time went up, but the client had a happy face using AOP as the enabler technology to prevent the big impact on codebase.

Debasish Ghosh, CTO, Anshin Software has more than 17 years of experience in the global IT industry and specializes in leading delivery of enterprise scale solutions for various clients ranging from small ones to Fortune 500 companies. He is the technology evangelist of Anshin Software and takes pride in institutionalizing best practices in software design and programming. He loves to program in Java, Ruby and Scala and has been trying desperately to get out of the unmanaged world of C++. As part of the core management team of Anshin Software, Debasish has been one of the major players in taking the company from a group of 4 people to its current strength of 150. These days he has been blogging extensively at http://debasishg.blogspot.com.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

• ##### Sure

by Rickard Öberg /

• ##### Re: Sure

by Corby Page /

• ##### Re: Sure

by Rickard Öberg /

• ##### Modularized Failover

by Brian Sletten /

• ##### Re: Modularized Failover

by ARI ZILKA /

• ##### Re: Modularized Failover

by Ken Carroll /

• ##### Sure

Your message is awaiting moderation. Thank you for participating in the discussion.

I've said this a bunch of times before, but I'll say it again: we created our CMS/portal from scratch in 2002 based entirely on AOP for both services and our domain model (as described in quite some detail on my blog). Some of the things could possibly be done using so-called "simpler" things like interceptors, but some things simply cannot be done using any other technology, such as the domain model whereby each "class" is composed by adding reusable aspects. Some things, like typed aspects as Adrian pointed out to Gavin, are things that are critical to have in a so-called real system.

So to say that AOP is a "failed technology" is simply uninformed and based on limited experience. There are many places where it does not fit, and there are some places where no other technology fits.

• ##### Re: Sure

by Corby Page /

Your message is awaiting moderation. Thank you for participating in the discussion.

but some things simply cannot be done using any other technology, such as the domain model whereby each "class" is composed by adding reusable aspects.

I certainly agree that AOP is an important and successful technology, but I do get a bit confused about why it is good to compose your domain model from aspects.

The joy I get from AOP is from its flexibility to recompose the system on the fly. I can make a broad, consistent change by introducing or editing a pointcut without touching my core domain or services.

If you decide to compose your domain out of aspects, then it puts a lot of pressure on you to get the aspect separation right the first time. This is something I probably wouldn't trust myself (or most other architects) to do. When you change your mind later, recomposing pointcuts is very easy. Recomposing all of the code that you used to construct your domain model is very hard.

It seems that in a non-trivial system, such as your CMS, composing your domain out of aspects puts too much pressure on getting the upfront design exactly right the first time, at the expense of being able to make flexible changes down the line.

• ##### Re: Sure

Your message is awaiting moderation. Thank you for participating in the discussion.

I certainly agree that AOP is an important and successful technology, but I do get a bit confused about why it is good to compose your domain model from aspects.

The joy I get from AOP is from its flexibility to recompose the system on the fly. I can make a broad, consistent change by introducing or editing a pointcut without touching my core domain or services.

If you decide to compose your domain out of aspects, then it puts a lot of pressure on you to get the aspect separation right the first time. This is something I probably wouldn't trust myself (or most other architects) to do. When you change your mind later, recomposing pointcuts is very easy. Recomposing all of the code that you used to construct your domain model is very hard.

It seems that in a non-trivial system, such as your CMS, composing your domain out of aspects puts too much pressure on getting the upfront design exactly right the first time, at the expense of being able to make flexible changes down the line.

I guess it depends on what tools you use to do schema migration. In our case we simply dump the database to a big fat XML file, to an XSL on it to migrate it to whatever new model or refactorings we've done, and import it back. We can move fields between aspects, split aspects, merge aspects, move fields from one object to a hashmap in a referenced object, etc. etc. And not only *can* we do this, *we have*.

In fact, using XML+XSL is so neat that even people using relational databases might want to consider it as a means for schema migration.

The benefit of being able to use AOP for domain model construction is that you don't have to repeat fields that you want everywhere (last modified timestamps, metadata, ACL's) etc. The alternative would be to use plain OO composition, but that's worse from a lifecycle management point of view, and the code becomes more complicated (been there, tried that). Being able to do "obj instanceof ACL" anywhere to check its capabilities is very powerful and intuitive.

• ##### Good article showing a pragmatic use of AspectJ

Your message is awaiting moderation. Thank you for participating in the discussion.

Good article...

These kinds of applications show the real power of AOP and allow discussion surrounding it to move from theoretical arguments to practical facts. Basically, we can discuss merits of AOP in context of applications such as this; if someone hates AOP, they will have to show an alternative solution that is better than this :-).

Automatic failover and other kinds of error-handling are critical to high-availability applications. Yet, we don't see many applications implementing a failover scheme, much less a consistent scheme. The reason, I suspect, is shear cost of implementation. Even when you are ready to pay the cost, scattered code leads to an inconsistent implementation lacking any overall purpose and understanding. AOP-based solutions significantly reduce the cost, produce consistent solution, and lower maintenance cost.

-Ramnivas
ramnivas.com

• ##### Modularized Failover

Your message is awaiting moderation. Thank you for participating in the discussion.

It's nice to see this idea being discussed further. This was exactly the problem that made me yearn for AOP in 1999 even before I was familiar with the tools. I considered hand-rolling something with reflection but abandoned the approach because I had other things that needed to get done. The moment I read about AspectJ in 2000, I knew that it would be a great solution for this problem. Since then, with a new way of thinking, I have been able to solve several problems that would have been significantly more difficult or impossible without AspectJ. If there has been a failure, it is a messaging failure, not technical.

• ##### Re: Modularized Failover

by ARI ZILKA /

Your message is awaiting moderation. Thank you for participating in the discussion.

JDBC failover and MQ failover can be achieved w/o AOP. I feel the example is on a good track, but can be trivially achieved with a "fail-over" capable JDBC driver since JDBC is already behind a clean interface that requires the developer to wrap everything in try/catch. Check out general-purpose clustering w/ fault tolerance. AOP concepts are powerful, I agree. When applied to achieve something like working around Java native Serialization, then AOP delivers where no other solution can.

www.terracottatech.com/

--Ari

• ##### Re: Modularized Failover

by Ken Carroll /

Your message is awaiting moderation. Thank you for participating in the discussion.

Using AOP is a perfect match for dealing with failover. It simply cannot, except in very simple cases, be done with a failover capable jdbc driver because that is completely unaware of the full context, especially the full transaction context. If, for example, there are multiple sql statements all contained within a single transaction then, with FCF and TAF (and others such as DB2 ACR), the client application must rollback and replay all of the sql statements. Although this article is summary and glosses over such details they are vital to a reliable & robust failover solution. Far too many people have gone down the dead-end road of trying to use a failover aware jdbc driver only to experience serious bad consequences.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.