Fault Tolerance and the Grid

Arjuna Technologies, a spin-off from Hewlett-Packard and (most of) the team behind the world's first Java Transaction Service and the Web Services products, has recently been looking at the world of Grid and how to apply their expertise. According to a recent white paper:

The trade-off for providing agility and greater resource utilisation is the increased complexity of the IT infrastructure. [...] However, as a result, data sharing is more common and its effects are harder to predict. It is much more difficult for a user to clearly understand the behaviour of a dynamically evolving IT infrastructure. This can result in severe problems, particularly if the data resources were originally designed to be accessed within a siloed environment, or if the subtleties of shared access were overlooked in the design of the infrastructure.

As the paper points out, there are trade-offs to be made from the move to SOA and greater agility: there's no such thing as a free lunch. Although data sharing is increased, the lack of overall control and knowledge concerning the execution of the environment, makes it difficult to reason about the degree and nature of the sharing. This complicates things from a reliability and fault tolerance perspective. The paper goes on to discuss how increased data sharing, particularly in enterprise Grids, present problems to infrastructure providers when considering how to guarantee consistency and coherency of data in the presence of failures and concurrent access. Furthermore as they point out, although replication, caching and partitioning of the data may help to improve performance and availability, but:

... addressing the performance problems introduces new issues. Synchronisation must be introduced to ensure coherent or consistent views, and this requires a protocol which communicates between the distributed parties.

Arjuna believes that popular Grid solutions assume little or no data sharing, which makes fault tolerance relatively straightforward. Otherwise the infrastructure must ensure that inconsistent state is not allowed to pollute the application.

... a system with the appropriate support could either ‘rollback’ the state to a previously consistent state (backward recovery), or ‘compensate’ to create a new consistent state (forward recovery). Without this, there is the real danger of corrupting the data on which the enterprise relies.

Which is essentially what we take for granted in JEE, CORBA, .NET and Web Services, so nothing fundamentally new here. However, the contention that the only kind of fault tolerance many existing Data Grid solutions provide is by restarting applications, which is not sufficient for applications that share data, is interesting. As is the belief that they are inadequate because they only focus on data coherency without providing scoping mechanisms for data consistency. It would be the same as using transactions in JEE without any form of concurrency control.

There has been a lot of work in recent times on adding fault tolerance to the Grid, some of which share the conclusions in the paper concerning necessary infrastructure updates:

Identify data sharing and protect applications from the consequences: fault-prevention
Monitor data sharing: fault-detection
Record data changes to aid recovery: fault-recovery

Several questions remain: are users of existing Data Grid infrastructures feeling the pain of missing these components? If not, why not, since these are critical capabilities within other distributed systems? Is data sharing within Grids a minority use case? Maybe compensation transactions are best handled within the application rather than the infrastructure?

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter