BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Application Failover using AOP

Application Failover using AOP

This item in japanese

AOP has recently been in the midst of controversies with Gavin King snubbing it as a totally over hyped, failed technology and Cedric Beust also expressing serious doubts regarding aspects making it to mainstream programming, branding AOP as a great idea that will remain the privilege of a few expert developers. This article does not try to create a fan syndrome of Aspect Oriented Programming (AOP) - it describes how this technology has delivered barrels of good to a real-life Java EE project with a large financial organization in addressing some of the critical cross cutting concerns as last minute changes in project requirements. The scenarios described and the solutions implemented nicely portray how AOP complements OOP in addressing orthogonal concerns of modeling a business application.

The Problem

We have been working on the development of a securities trading back office solution for a large financial organization using Java EE technology with Oracle 10g RAC as the database cluster and Websphere MQ as the messaging middleware. The project is currently deployed in the UAT phase, when it was decided by the management that we need to implement transparent application failover services over the clustered architecture.

Oracle 10g RAC supports FCF (Fast Connection Failover) offering an elegant way for JDBC applications to take advantage of the connection failover facilities. But the real challenge here is to handle the failover at the application layer and be transparent to the user through a retry-and-recover scheme.

In case of an Oracle node failover, the following sequence of events occurs:

  • A database instance fails, leaving several stale connections in the connection cache.
  • The RAC mechanism in the database generates a RAC event which is sent to the JVM containing JDBC.
  • The daemon thread inside the JVM finds all the connections affected by the RAC event, notifies them of the closed connection through SQL exceptions, and rolls back any open transactions. When a RAC service failure is propagated to the JDBC application, the database has already rolled back the local transaction.

If Fast Connection Failover (FCF) is enabled, when one RAC node goes down, the connection cache is automatically invalidated and all unused connections are re-established with a new node. However, that is not true for the connections that are already being used by the application. In this scenario, when the application tries to use the connection which had been connected prior to node failover, SQLException is thrown (ORA-17008,Closed Connection). The application has to retry the connection manually and FCF guarantees the next connection attempt to be successful.

The Brute Force Solution

The solution to the above problem needs to address the manual retry issue of the application through a suitable retry-recover scheme. We realized that at the application level, we need to address ORA-17008 specially, incorporating special handlers to enable automatic retries. The problem is that the code base, which has evolved over the last 2 years, has more than 2 million lines of Java and JSP consisting of more than 6000 classes and 500 database tables and there are zillions of instances like the following:

long id = ...;
try {
Instrument instr = new Instrument(id, conn); } catch(SQLException ex) {
throw new KeyedException("cam.error.failed.retrieve.instrument",
ex);
}
...

For all snippets like the above, SQLException is a generic checked exception for all database related failures that has to be caught and littered all over the code base (gosh .. the pains of checked exceptions :-( .. should have wrapped in unchecked ones like Spring does). A brute force approach would be to incorporate specialized handlers at all sites where SQLException is caught. This scheme had to be ruled out keeping in mind that it is UAT season and the monstrous impact that this would have on the entire code base - the client will definitely not be amused!

Enter Aspects

After careful analyses of the code base, we detected that the main areas of impact were the large number of service components and the console applications, where we need to implement the retry-and-recover scheme. For historical reasons we are not using EJB - instead all service components and console applications have launcher base classes as drivers. Still the whole gamut of functionalities that we need to address covered more than half of our total code base - a real cross cutting concern.

It was then that we thought of incorporating aspects to address this issue. The following solution was proposed and implemented:

  1. Define a pointcut to handle SQLException
  2. Define an advice to be executed on the pointcut that intercepts the exception and throws a typed exception
  3. This error had to be handled in only 2 base classes - one for the Service Components and the other for the Console Applications. These two classes implement the retry-and-recover functionality within the handler for this error.

The following aspect implements the skeleton of the above scheme:

public aspect AspectFastConnFailOver
{
pointcut sqlHandler(SQLException exception):
handler(SQLException+) && args(exception);

// advice to be executed as the handler of SQLException
// its derived exception
before(SQLException exception): sqlHandler(exception){

...

// handle only if non-UI
if (!Application.getInstance()
.getContext()
.getCallerIdentity()
.isInteractiveUser()) {
if(exception.getErrorCode() == Globals.FCF_SQLEX_ERRORCODE)
{
throw new DatabaseNotAvailableError();
} } }
... }

Implementing MQ Failover

Once transparent application failover was achieved for the database server, we decided to try out a similar implementation scheme for the Websphere MQ Services deployed over a cluster configured using Storage Foundation 4.0/HA from Veritas and MQ agent server framework from Veritas.

In case of failover using Oracle 10g RAC, whenever a fail-over takes place, proper event is generated and propagated to the application's JDBC layer with a particular error code. Based on the error code, JDBC invalidates all the unused connections from the pool and rollbacks the transaction associated with the current connections. At the application level, we need to trap the error code and retry to acquire a new connection to handle the situation. The next call to acquire a connection would create a new connection which should be successful if the fail-over has taken place by that time.

The situation with MQ fail-over is complicated by the fact that no event is sent back to the application level, since the Veritas cluster does not handle the fail-over intrinsically. Hence the application needs to detect the fail-over, invalidate all connections and sessions in the pool and roll back the transaction. Like SQLException, in message handling,as per JMS specification, JMSException can be thrown potentially from every method of all interfaces like Connection, Session, Receiver, Sender and Browser. Hence the retry-and-recover needs to be implemented in a centralized handler through appropriate pointcuts.

Here is the similar aspect fragment that does the interception:

public aspect AspectFailOver
{
pointcut jmsHandler(JMSException exception):
handler(JMSException+) && args(exception)
&& !within(...)
&& !withincode(...));

// advice to be executed as the handler of JMSException
// its derived exception

before(JMSException exception): jmsHandler(exception){
...
if (!Application.getInstance()
.getContext()
.getCallerIdentity()
.isInteractiveUser()) {
if(isMQFailoverException(exception)) {
throw new MQNotAvailableError();
}
}
} ...
}
}

Similar to Oracle fail-over, the error MQNotAvailableError has been trapped in the couple of base classes for Service Components and Console Application Launchers for implementing the retry-and-recover loop.

Minimum Impact, Maximum Effect

The result was great! We achieved the goal with minimal impact on the existing codebase - thanks to the power of AOP. We used AspectJ and compile time weaving - the build time went up, but the client had a happy face using AOP as the enabler technology to prevent the big impact on codebase.

About the author

Debasish Ghosh, CTO, Anshin Software has more than 17 years of experience in the global IT industry and specializes in leading delivery of enterprise scale solutions for various clients ranging from small ones to Fortune 500 companies. He is the technology evangelist of Anshin Software and takes pride in institutionalizing best practices in software design and programming. He loves to program in Java, Ruby and Scala and has been trying desperately to get out of the unmanaged world of C++. As part of the core management team of Anshin Software, Debasish has been one of the major players in taking the company from a group of 4 people to its current strength of 150. These days he has been blogging extensively at http://debasishg.blogspot.com.

Rate this Article

Adoption
Style

BT