InfoQ

News

Fault Tolerance and the Grid

Posted by Mark Little on Sep 18, 2007 03:50 PM

Community
Architecture,
SOA
Topics
Transactions Processing,
Grid Computing
Tags
Transactions
Arjuna Technologies, a spin-off from Hewlett-Packard and (most of) the team behind the world's first Java Transaction Service and the Web Services products, has recently been looking at the world of Grid and how to apply their expertise. According to a recent white paper:
The trade-off for providing agility and greater resource utilisation is the increased complexity of the IT infrastructure. [...] However, as a result, data sharing is more common and its effects are harder to predict. It is much more difficult for a user to clearly understand the behaviour of a dynamically evolving IT infrastructure.  This can result in severe problems, particularly if the data resources were originally designed to be accessed within a siloed environment, or if the subtleties of shared access were overlooked in the design of the infrastructure.
As the paper points out, there are trade-offs to be made from the move to SOA and greater agility: there's no such thing as a free lunch. Although data sharing is increased, the lack of overall control and knowledge concerning the execution of the environment, makes it difficult to reason about the degree and nature of the sharing. This complicates things from a reliability and fault tolerance perspective. The paper goes on to discuss how increased data sharing, particularly in enterprise Grids, present problems to infrastructure providers when considering how to guarantee consistency and coherency of data in the presence of failures and concurrent access. Furthermore as they point out, although replication, caching and partitioning of the data may help to improve performance and availability, but:
... addressing the performance problems introduces new issues. Synchronisation must be introduced to ensure coherent or consistent views, and this requires a protocol which communicates between the distributed parties.
Arjuna believes that popular Grid solutions assume little or no data sharing, which makes fault tolerance relatively straightforward. Otherwise the infrastructure must ensure that inconsistent state is not allowed to pollute the application.
... a system with the appropriate support could either ‘rollback’ the state to a previously consistent state (backward recovery), or ‘compensate’ to create a new consistent state  (forward recovery). Without this, there is the real danger of corrupting the data on which the enterprise relies.
Which is essentially what we take for granted in JEE, CORBA, .NET and Web Services, so nothing fundamentally new here. However, the contention that the only kind of fault tolerance many existing Data Grid solutions provide is by restarting applications, which is not sufficient for applications that share data, is interesting. As is the belief that they are inadequate because they only focus on data coherency without providing scoping mechanisms for data consistency. It would be the same as using transactions in JEE without any form of concurrency control.

There has been a lot of work in recent times on adding fault tolerance to the Grid, some of which share the conclusions in the paper concerning necessary infrastructure updates:

  • Identify data sharing and protect applications from the consequences: fault-prevention
  • Monitor data sharing: fault-detection
  • Record data changes to aid recovery: fault-recovery
Several questions remain: are users of existing Data Grid infrastructures feeling the pain of missing these components? If not, why not, since these are critical capabilities within other distributed systems? Is data sharing within Grids a minority use case? Maybe compensation transactions are best handled within the application rather than the infrastructure?

6 comments

Reply

data grids by Cameron Purdy Posted Sep 19, 2007 8:48 AM
Re: data grids by Steve Caughey Posted Sep 20, 2007 8:13 AM
Transaction in highly scalable evironment by Nati Shalom Posted Sep 19, 2007 12:09 PM
Re: Transaction in highly scalable evironment by Brian Oliver Posted Sep 20, 2007 6:08 AM
Re: Transaction in highly scalable evironment by Steve Caughey Posted Sep 20, 2007 8:56 AM
Atomikos=3rd generation TP monitor by Guy Pardon Posted Oct 9, 2007 10:38 AM
  1. Back to top

    data grids

    Sep 19, 2007 8:48 AM by Cameron Purdy

    Oracle's Coherence Data Grids already provides a stable, resilient, single system image with information consistency, once-and-only-once processing guarantees, and maintains that stability and those qualities of service even during / after server failure. Having gone down the path before, viewing a grid as multiple servers requiring 2PC transactions to maintain consistency is an extremely poor model for both scalability and availability. A Data Grid composed of multiple nodes should have its consistency maintained no differently than the RAM in a multi-CPU NUMA server; the requirements, the goals and the conceptual model are the same. Peace, Cameron Purdy Oracle Coherence: The Java Data Grid

  2. Back to top

    Transaction in highly scalable evironment

    Sep 19, 2007 12:09 PM by Nati Shalom

    "Several questions remain: are users of existing Data Grid infrastructures feeling the pain of missing these components? If not, why not, since these are critical capabilities within other distributed systems? Is data sharing within Grids a minority use case? Maybe compensation transactions are best handled within the application rather than the infrastructure?
    Transactions in a highly distributed systems should be dealt completely differently the it is currently dealt with in the J2EE model. I've recently reviewed Amazon approach to this challange in the following post: Lessons from Pat Helland: Life Beyond Distributed Transactions Below is a short summary from that analysis: Distributed Transactions are a common pattern used to ensure ACID properties are met between distributed resources. Since the late 70s, the fIrst generation of this model has been widely used in many large-scale applications that struggle with the difficulties of multiplexing many online terminals. This led to the emergence of the 1st generation TP Monitors (TPMs), such as Tuxedo (Now owned by BEA). The emergence of web based applications in the late 90s drove the creation of 2nd generation TPMs, in the form of JEE application servers, to address similar needs using more open and modern technologies "The increased business demand for greater scalability led many to the realization that the current transaction model is a bottleneck due to its inherit centralized approach" This challenge leads us to the emergence of a third generation of TPMs,  or what Gartner calls Extreme Transaction Processing (XTP). I also gave online presentation on the topic on the last SpringWorld event - the online presentation can be found here Nati S. GigaSpaces Write Once Scale Anywhere

  3. Back to top

    Re: Transaction in highly scalable evironment

    Sep 20, 2007 6:08 AM by Brian Oliver

    I completely agree. If someone wants to break-free of the typical constraints and high-latencies of standard multi-phase transactions with in any environment (not just JEE), they have to do something very different.... that means being transaction-free. aka: Amazon / Ebay style! Unfortunately the linked presentation fails to suggest a third generation TPM, but instead suggests a mechanism for using Gigaspaces to perform processing using a "Space", which as it points out in Slide #15;

    Space provides Messaging and Data using the same underlying technology
    [as SOA on J2EE], which typically means the use of TCP/IP, Transaction Managers, JMS, JDBC and RMI? Hardly "different". Hardly "extreme". Potentially "scalable". Importantly, the presentation completely fails to mention that each use of the Space verbs / methods (Read, Write, Take, Notify) must be "transactional" if one requires any level of data integrity or guarantee. (from the Space spec) ie: No transaction on a Space = No data guarantee = Potential loss of work / trades / money in a financial system. So if a processing unit is performing work (a transaction) against a space and the said space activity is not synchronously written to disk AND/OR synchronously written to a backup space (and probably to disk on the backup space) (all managed by a TM), any changes in the environment (like a really long GC, process death etc), will cause independent client-side fail-over and unrecoverable loss of transactions. This is not theory. Having to perform multi-phase (ie: hand-shaking) transactional commits/rollbacks across many devices (including spaces) to guarantee integrity of data is in no way "extreme". It's not moving away from a "centralized approach" as it is still dependent on a TM / disk. Supporting Spring is great. It's a huge community and Open Spaces looks really cool. People however should be aware of the underlying risks and requirements for ensuring data integrity with the approaches suggested in the presentation, especially with in financial systems. They should also be made aware that they are responsible for handling all of the underlying Space Verb RemoteExceptions for recovery (ie: every Space verb may throw an RMI RemoteException). Simply wrapping a solution with Spring does not absolve the developer of any underlying work. As they say, "beauty is only skin deep" or perhaps "API deep". There's theory and then there's practice. -- Brian Oliver Oracle Coherence: The Data Grid

  4. Back to top

    Re: data grids

    Sep 20, 2007 8:13 AM by Steve Caughey

    Cameron, Arjuna aren’t proposing the use of 2PC across distributed servers – just pointing out that applications have a wide variety of coherency and consistency requirements. Application-controlled distributed 2PC is just one mechanism for satisfying particularly stringent requirements (so long as the user is willing to sacrifice availability under certain circumstances). DSM systems are another solution, as are centralised databases. But all have inherent strengths and weaknesses. I agree that infinitely-sized, cross-internet, shared memory which is completely coherent and consistent is the dream solution! It would be nice to model memory as if differences in latency, availability and reliability don’t exist, but for many applications those differences can be significant and become increasingly important for an agile enterprise which is dynamically reconfiguring its infrastructure and dynamically deploying applications. At Arjuna, we’re particularly interested in how the end-user in an agile enterprise can specify consistency and coherency requirements for their application data (in a technology-independent way) and how the underlying infrastructure can apply a variety of techniques to collectively satisfy those requirements. Regards, Steve Caughey http://www.arjuna.com/

  5. Back to top

    Re: Transaction in highly scalable evironment

    Sep 20, 2007 8:56 AM by Steve Caughey

    Nati and Brian, I (partially) agree! As I’ve pointed out to Cameron (above) Arjuna aren’t advocating the use of distributed 2PC as the solution to all of these problems. Our white paper just talks about the requirements for consistency and coherency without advocating any particular solution. I guess that, as we’ve been utilising distributed 2PC in our products and our research for twenty years now, it’s understandable that everyone assumes it’s our preferred solution. In fact distributed 2PC is just one technology that works if you have strict coherency and consistency requirements but are willing to accept that (as it’s built on a blocking protocol) then your data won’t be available until the protocol completes. This is not suitable for many applications (particularly ones where availability is paramount – including those you mentioned, Brian) but try telling some of IBM and BEA’s biggest customers that they don’t need transactioning! Werner Vogels (Amazon’s CTO) talks about the need for a range of solutions for data management in this excellent presentation: http://www.infoq.com/presentations/availability-consistency;jsessionid=F7BE4A9FAD61156F2FDB521C69488A85 Regards, Steve Caughey http://www.arjuna.com/

  6. Back to top

    Atomikos=3rd generation TP monitor

    Oct 9, 2007 10:38 AM by Guy Pardon

    Interesting discussion. However, the comments seem to repeat the common misconceptions about transaction managers: -they do NOT have to be centralized (as claimed) -they do NOT have to cause unneeded overhead (as claimed) See my blog entry on how our product (ExtremeTransactions™) addresses these issues to yield perfect scalability. Also check the referred Nancy Lynch (MIT) paper - it has interesting implications for those who claim they can do distributed services without transaction manager. Not that I am saying that transactions are the holy grail, it is just that you don't have to discard them based on outdated assumptions. Guy

Exclusive Content

Business Natural Languages Development in Ruby

Jay Fields presents his concept of Business Natural Languages - a type of Domain Specific Languages geared towards being readable by domain experts.

Distributed Version Control Systems: A Not-So-Quick Guide Through

Adoption and interest for Distributed Version Control Systems is constantly rising. We will introduce the concept of DVCS and have a look at 3 actors in the area: git, Mercurial and Bazaar.

Segundo Velasquez and Agile as Seen Through the Customer's Eyes

Deborah Hartmann interviewed Segundo Velasquez about his experience as customer with an Agile team during the initial phase of software design of a product.

Fine Grained Versioning with ClickOnce

David Cooksey shows how to fine grained versioning to a ClickOnce deployment using an HttpHandler written with ASP.NET, making partial rollouts to a test audience much easier.

Implementing Manual Activities in Windows Workflow

Windows workflow (WF) is an excellent framework for implementing business processes, but lacks support for human activities. This article describes a completely generic approach for changing this.

Markus Voelter about Software Architecture Documentation

In this interview taken during OOPSLA 2007, Markus Voelter talks about the importance of documenting the software architecture, and gives some good and also bad examples on how it could be done.

Voca, UK's largest payment processing engine running Spring

William Soo and Meeraj Kunnumpurath discuss the Voca transaction processing system, architectural challenges and requirements, Voca's Spring/J2EE architecture, and the future SEPA architecture.

Patterns for securing architectures

Security is about trade-offs. Only a few have the expertise to design good security. This talk focuses on Security Patterns, such as Role-based Access Control, Single Access Point, and Front Door.