Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Fault Tolerance and the Grid

Fault Tolerance and the Grid

This item in japanese

Arjuna Technologies, a spin-off from Hewlett-Packard and (most of) the team behind the world's first Java Transaction Service and the Web Services products, has recently been looking at the world of Grid and how to apply their expertise. According to a recent white paper:
The trade-off for providing agility and greater resource utilisation is the increased complexity of the IT infrastructure. [...] However, as a result, data sharing is more common and its effects are harder to predict. It is much more difficult for a user to clearly understand the behaviour of a dynamically evolving IT infrastructure.  This can result in severe problems, particularly if the data resources were originally designed to be accessed within a siloed environment, or if the subtleties of shared access were overlooked in the design of the infrastructure.
As the paper points out, there are trade-offs to be made from the move to SOA and greater agility: there's no such thing as a free lunch. Although data sharing is increased, the lack of overall control and knowledge concerning the execution of the environment, makes it difficult to reason about the degree and nature of the sharing. This complicates things from a reliability and fault tolerance perspective. The paper goes on to discuss how increased data sharing, particularly in enterprise Grids, present problems to infrastructure providers when considering how to guarantee consistency and coherency of data in the presence of failures and concurrent access. Furthermore as they point out, although replication, caching and partitioning of the data may help to improve performance and availability, but:
... addressing the performance problems introduces new issues. Synchronisation must be introduced to ensure coherent or consistent views, and this requires a protocol which communicates between the distributed parties.
Arjuna believes that popular Grid solutions assume little or no data sharing, which makes fault tolerance relatively straightforward. Otherwise the infrastructure must ensure that inconsistent state is not allowed to pollute the application.
... a system with the appropriate support could either ‘rollback’ the state to a previously consistent state (backward recovery), or ‘compensate’ to create a new consistent state  (forward recovery). Without this, there is the real danger of corrupting the data on which the enterprise relies.
Which is essentially what we take for granted in JEE, CORBA, .NET and Web Services, so nothing fundamentally new here. However, the contention that the only kind of fault tolerance many existing Data Grid solutions provide is by restarting applications, which is not sufficient for applications that share data, is interesting. As is the belief that they are inadequate because they only focus on data coherency without providing scoping mechanisms for data consistency. It would be the same as using transactions in JEE without any form of concurrency control.

There has been a lot of work in recent times on adding fault tolerance to the Grid, some of which share the conclusions in the paper concerning necessary infrastructure updates:

  • Identify data sharing and protect applications from the consequences: fault-prevention
  • Monitor data sharing: fault-detection
  • Record data changes to aid recovery: fault-recovery
Several questions remain: are users of existing Data Grid infrastructures feeling the pain of missing these components? If not, why not, since these are critical capabilities within other distributed systems? Is data sharing within Grids a minority use case? Maybe compensation transactions are best handled within the application rather than the infrastructure?

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • data grids

    by Cameron Purdy,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Oracle's Coherence Data Grids already provides a stable, resilient, single system image with information consistency, once-and-only-once processing guarantees, and maintains that stability and those qualities of service even during / after server failure.

    Having gone down the path before, viewing a grid as multiple servers requiring 2PC transactions to maintain consistency is an extremely poor model for both scalability and availability. A Data Grid composed of multiple nodes should have its consistency maintained no differently than the RAM in a multi-CPU NUMA server; the requirements, the goals and the conceptual model are the same.


    Cameron Purdy
    Oracle Coherence: The Java Data Grid

  • Transaction in highly scalable evironment

    by Nati Shalom,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    "Several questions remain: are users of existing Data Grid infrastructures feeling the pain of missing these components? If not, why not, since these are critical capabilities within other distributed systems? Is data sharing within Grids a minority use case? Maybe compensation transactions are best handled within the application rather than the infrastructure?

    Transactions in a highly distributed systems should be dealt completely differently the it is currently dealt with in the J2EE model.

    I've recently reviewed Amazon approach to this challange in the following post: Lessons from Pat Helland: Life Beyond Distributed Transactions

    Below is a short summary from that analysis:

    Distributed Transactions are a common pattern used to ensure ACID properties are met between distributed resources. Since the late 70s, the <em>fI</em><em>rst generation</em> of this model has been widely used in many large-scale applications that struggle with the difficulties of multiplexing many online terminals. This led to the emergence of the <em>1st generation TP Monitors</em> (TPMs), such as Tuxedo (Now owned by BEA). The emergence of web based applications in the late 90s drove the creation of <em>2nd generation TPMs,</em> in the form of JEE application servers, to address similar needs using more open and modern technologies

    "The increased business demand for greater scalability led many to the realization that the current transaction model is a bottleneck due to its inherit centralized approach"

    This challenge leads us to the emergence of a third generation of TPMs,  or what Gartner calls Extreme Transaction Processing (XTP).

    I also gave online presentation on the topic on the last SpringWorld event - the online presentation can be found here

    Nati S.
    Write Once Scale Anywhere

  • Re: Transaction in highly scalable evironment

    by Brian Oliver,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I completely agree. If someone wants to break-free of the typical constraints and high-latencies of standard multi-phase transactions with in any environment (not just JEE), they have to do something very different.... that means being transaction-free. aka: Amazon / Ebay style!

    Unfortunately the linked presentation fails to suggest a third generation TPM, but instead suggests a mechanism for using Gigaspaces to perform processing using a "Space", which as it points out in Slide #15;

    Space provides Messaging and Data using the same underlying technology
    [as SOA on J2EE], which typically means the use of TCP/IP, Transaction Managers, JMS, JDBC and RMI? Hardly "different". Hardly "extreme". Potentially "scalable".

    Importantly, the presentation completely fails to mention that each use of the Space verbs / methods (Read, Write, Take, Notify) must be "transactional" if one requires any level of data integrity or guarantee. (from the Space spec) ie: No transaction on a Space = No data guarantee = Potential loss of work / trades / money in a financial system.

    So if a processing unit is performing work (a transaction) against a space and the said space activity is not synchronously written to disk AND/OR synchronously written to a backup space (and probably to disk on the backup space) (all managed by a TM), any changes in the environment (like a really long GC, process death etc), will cause independent client-side fail-over and unrecoverable loss of transactions. This is not theory.

    Having to perform multi-phase (ie: hand-shaking) transactional commits/rollbacks across many devices (including spaces) to guarantee integrity of data is in no way "extreme". It's not moving away from a "centralized approach" as it is still dependent on a TM / disk.

    Supporting Spring is great. It's a huge community and Open Spaces looks really cool. People however should be aware of the underlying risks and requirements for ensuring data integrity with the approaches suggested in the presentation, especially with in financial systems. They should also be made aware that they are responsible for handling all of the underlying Space Verb RemoteExceptions for recovery (ie: every Space verb may throw an RMI RemoteException). Simply wrapping a solution with Spring does not absolve the developer of any underlying work. As they say, "beauty is only skin deep" or perhaps "API deep".

    There's theory and then there's practice.

    -- Brian Oliver

    Oracle Coherence: The Data Grid

  • Re: data grids

    by Steve Caughey,

    Your message is awaiting moderation. Thank you for participating in the discussion.


    Arjuna aren’t proposing the use of 2PC across distributed servers – just pointing out that applications have a wide variety of coherency and consistency requirements. Application-controlled distributed 2PC is just one mechanism for satisfying particularly stringent requirements (so long as the user is willing to sacrifice availability under certain circumstances). DSM systems are another solution, as are centralised databases. But all have inherent strengths and weaknesses. I agree that infinitely-sized, cross-internet, shared memory which is completely coherent and consistent is the dream solution! It would be nice to model memory as if differences in latency, availability and reliability don’t exist, but for many applications those differences can be significant and become increasingly important for an agile enterprise which is dynamically reconfiguring its infrastructure and dynamically deploying applications. At Arjuna, we’re particularly interested in how the end-user in an agile enterprise can specify consistency and coherency requirements for their application data (in a technology-independent way) and how the underlying infrastructure can apply a variety of techniques to collectively satisfy those requirements.


    Steve Caughey

  • Re: Transaction in highly scalable evironment

    by Steve Caughey,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Nati and Brian,

    I (partially) agree!

    As I’ve pointed out to Cameron (above) Arjuna aren’t advocating the use of distributed 2PC as the solution to all of these problems. Our white paper just talks about the requirements for consistency and coherency without advocating any particular solution. I guess that, as we’ve been utilising distributed 2PC in our products and our research for twenty years now, it’s understandable that everyone assumes it’s our preferred solution. In fact distributed 2PC is just one technology that works if you have strict coherency and consistency requirements but are willing to accept that (as it’s built on a blocking protocol) then your data won’t be available until the protocol completes. This is not suitable for many applications (particularly ones where availability is paramount – including those you mentioned, Brian) but try telling some of IBM and BEA’s biggest customers that they don’t need transactioning! Werner Vogels (Amazon’s CTO) talks about the need for a range of solutions for data management in this excellent presentation:


    Steve Caughey

  • Atomikos=3rd generation TP monitor

    by Guy Pardon,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Interesting discussion. However, the comments seem to repeat the common misconceptions about transaction managers:

    -they do NOT have to be centralized (as claimed)
    -they do NOT have to cause unneeded overhead (as claimed)

    See my blog entry on how our product (ExtremeTransactions™) addresses these issues to yield perfect scalability.

    Also check the referred Nancy Lynch (MIT) paper - it has interesting implications for those who claim they can do distributed services without transaction manager. Not that I am saying that transactions are the holy grail, it is just that you don't have to discard them based on outdated assumptions.


Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p