InfoQ Homepage Articles Error Handling Considerations in SOA Analysis & Design

Error Handling Considerations in SOA Analysis & Design

Sep 13, 2010 12 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Thorough analysis of error handling requirements during SOA analysis and design phase is the key to getting the services designed and implemented right. Lack of more detailed requirements that identify error handling scenarios or lack of understanding on how to incorporate those into SOA analysis and design phases result in development efforts to design services first with happy path functionality. Such an approach could potentially lead to significant project cost overruns as it requires considerable re-work and might require re-design of some component(s) to incorporate error handling considerations at a later point.

This paper looks at various error handling considerations associated with design of re-usable services and provides an outline of what error handling considerations apply during SOA analysis and design phases and also describes some best practices into designing these considerations to ensure that services are designed and implemented in all its completeness.

Introduction

Unlike in monolithic applications, error handling becomes a significant step in the design of SOA applications as SOA applications integrate heterogeneous IT systems across the organizational boundaries, vendor and partner IT assets. Focusing on error handling analysis early in the analysis and design phases ensures that appropriate error handling standards/guidelines are put in place for modules in different platforms. This paper identifies common error handling considerations that architects and designers need to address while going through the SOA solution design. SOA analysis and design tasks are broadly classified into three major phases i.e. Service Identification, Service Specification and Service Realization as identified in Service Oriented Modeling and Architecture by Ali Arsanjani. Subsequent discussion of this topic is oriented around error handling considerations that apply to these three phases.

Error Handling during Service Identification

The goal of service identification is to come up with a candidate service portfolio that leads to identifying re-usable service portfolio. This phase involves analysis of business artifacts package that includes key requirements, business goals, capability models, Business Process Analysis Model (BPAM), use cases, etc.

Types of Errors

Errors are broadly classified into two types:

Recoverable Errors - Recoverable errors are the errors that client programs can recover from to take appropriate alternate execution paths. Such errors are the result of failure to meet a particular business rule.
Non-Recoverable errors - These are the errors that client programs cannot recover from. This kind of errors are result of some unexpected errors during runtime such as programming errors such null pointers, resources not available etc.

Identification of Business Errors

Analyzing through the business artifact package provides many opportunities to discover business errors associated with services. If there are existing asset(s) for a business service, those component interfaces could be used to discover additional business errors that are otherwise not identified in top down analysis. Business errors are what referred to as recoverable errors. Once the service portfolio is in final draft stages, evaluate the re-usable services for the following error handling considerations:

Business error scenarios – Detailed description of condition that flags the business operation as invalid.
Error text – Provides a brief description of the business error that service consumers will receive for a business error.
Error code – Code that can be looked up for additional info about the error.
Suggestions – Feedback to the service consumer such as examples of valid inputs, or displaying specific information related to the error etc.
Service area - Identifies a service area that receives all notifications related to service system errors.

These attributes that define the business errors could either go into service contract or could be packaged into service response as needed.

Process failure recovery scenarios

Identify new operations - Business process flows or any micro flows are to be analyzed in the light of business errors that individual services in a process flow could throw. Such an analysis could lead to discovering newer operations that are otherwise not found in a typical top down process decomposition tasks.
Updates to process models - Service operation models/dependencies could be updated with the new operations discovered in the previous step.

Error Handling during Service Specification

Service Specification phase consists of tasks defining inputs and output messages, service and operation names, schemas, service composition, non-functional requirements and other service characteristics such as sync/async, invocation style, etc. for the services that are marked as to b eexposed.

Characteristics related to Error Handling

Common service characteristics that are related to error handling are:

Assured Delivery - Determine if a service requires assured delivery type of QOS. Such a requirement helps designers put in appropriate asynchronous messaging design patterns or use reliable messaging if implemented as web services
Monitoring requirements - Determine if the service business critical errors require being setup with proactive monitoring.
Error mapping/transformation rules - Establish transformation rules for errors codes/info returned by the service provider and how it needs to be provided to service consumer. Having standard business error codes helps applications consume these services easily in terms of handling the service errors.
Updated process flows - Existing process flows are to be updated with the newer operations or alternate execution paths as discovered in the identification step to handle business errors.
Transaction attributes and boundaries - Nature of errors such as system Vs application errors influences how different runtime platforms handle automatic rollbacks. Transaction attributes and boundaries in a process are to analyzed in the light of errors that can be expected from individual service invocations/transactions.

Propagation of errors

Error propagation to service consumers can be accomplished in many different ways that it is important to have an architecture design decision to choose the most appropriate style for the enterprise.

The two most popular choices for returning error information are use of SOAP fault or custom error payloads. While each approach has its own pros and cons, the choice mostly boils down to existing development and runtime platforms. Services implemented using web services oriented platforms find SOAP faults as their natural choice due to the support of a lot web services based tooling while custom payloads suite better for services implemented on more traditional message oriented middleware (MOM) platforms. For a deeper discussion of this subject, refer to some of the contrasting done by Boris Lubinsky. In general, non-recoverable system type of errors are better suited to be returned as SOAP faults due to the varying degrees of support in client tooling and server support for the various profiles while recoverable business errors are better described using custom error payloads because of the flexibility and extensibility it provides to define custom error schemas. However use of custom error payloads require the service consumer do additional client side handling to parse response messages to determine if the invocation could be determined to be successful.

Common enterprise wide custom schemas

Identify meta data and common schemas to describe errors consistently across the enterprise. This data could include common attributes include date, time, error code, descriptions, severity level, message source, correlation id, etc. Thorough analysis of this metadata would turn out to be very useful for setting up effective service monitoring.

Error Handling during Service Realization

Service realization phase is where the service model is mapped to service component and runtime/deployment model. This step typically involves designing service components, allocating the components to SOA stack layers choosing component interaction styles, runtime platforms and making architectural design decisions (ADD). Subsequent discussion of the subject will be focused around some best practices to implement error handling considerations in the three layers of typical enterprise SOA stack: business processes or choreography, mediation/BUS and component layers as highlighted in Figure below

clip_image001

Error handling in the business process/orchestration layer

Components deployed to this layer implementing business process flows or choreographies. The following error handling considerations apply here:

Fault Handlers - Use of fault handlers is the most popular way of handling service errors returned from the service invocations initiated from within the orchestrations. Fault handlers are attached to specific tasks in a process flow or as a global fault handler for the entire process. When the process results in errors, fault handlers are invoked to implement the corrective tasks. Compensation transactions and manual rollbacks are configured with the fault handlers so that appropriate corrective actions could be applied to handle the process errors. Care should be taken not to use Fault Handlers for alternate execution paths instead should only be used to recover from the errors thrown in the process.
Service status info - Choreography scenarios normally involve calling more than one service. These service invokes from within the process could end up resulting in errors of different severity that could range from info, warning, error and fatal. It is a good practice to collect status description from each invoke such as return codes etc. into a repeatable array and return the same back to service consumer. Such a practice gives the ability to the service consumer to determine if the completion of the process involved any warnings/errors from some of the services that process invoked.
Threshold error severity levels - Identify threshold error severity levels and design fault tolerance levels in service orchestration around these thresholds.Threshold levels could be set on any attribute or a combination of these that define the error, such as error severity levels, custom status codes etc. as opposed to solely relying on SOAP faults for determining process failures

Error handling in the Services/Mediation/ESB layer

Enterprise Service Bus (ESB) layer is at the core of typical enterprise SOA stack. This layer supports the transformation and routing capabilities required off of the enterprise re-usable services. Components in this layer provide a well defined interface to the various provider implementations such as existing underlying assets and partner or vendor based services, by applying appropriate message and protocol transformations. Error handling by the mediation components mostly involves transforming the provider error structures into well defined error structures defined in the context of business domain. These components also could handle applying some complex transformation and mapping rules on the errors returned from the backend functional components to provide more simplified error info to the service consumers within the enterprise.

Transform provider error codes - It is possible that different service providers return service errors using different semantics. The range could involve anywhere from popular SOAP faults to very proprietary structures. Appropriate transformation rules can be applied here so that re-usable enterprise services return errors in a more consistent manner that enterprise applications could easily parse and implement appropriate handlers.
Filter sensitive information – When internal service components throw fatal errors, the stack trace often contains sensitive information such as protocols used, server ips, etc. Appropriate filtering rules are to be established in this layer to filter any sensitive information in the stack trace. This strategy becomes all the more important when service responses are to be given out over the trusted networks.
Trapping application errors - Any kind of technical errors experienced by the service components such as resource unavailability or some runtime exceptions etc. are to be transformed into a simple technical error messages. If native components did not log these errors, then mediation layers could pass all the stack trace info into logging but only return a generic text message back to the service consumer informing about temporary service unavailability.

A lot of error handling considerations mentioned for this layer is also possible to be implemented in the component layer. But there are number of ESBs and frameworks in the market that does these things in a lot more configurable and flexible manner than what individual platform developers could implement in their functional component implementations. Separation of such error handling mediation concerns to ESB layer relieves the platform developers from having to satisfy a variety of error handling consideration and have them focus more on implementing the business functionality resulting in greater developer productivity.

Error handling in the component layer

Error handling by the components in this layer includes handling abnormal execution conditions such non-availability of a resource or some runtime conditions that the component is not programmed to handle or is considered in violation of logic. Components are required to handle such events to notify client programs and also do appropriate logging to help facilitate troubleshooting and service monitoring. In Java programming language, such events are thrown as exceptions and the API provides two different types of exceptions: checked and unchecked. Checked exceptions inherit from Exception class and are used to handle recoverable errors such as business error scenarios. Unchecked exceptions which are descendents of RuntimeException class are the ideal candidate exceptions handle non-recoverable errors such as resource non-availability or some null pointers.

The second part to component level error handling is to do appropriate logging. It is a good practice to perform logging closest to the source where the error occurred. When components throw application errors, they could log the exception at the appropriate interface within the component boundaries and then throw the exceptions. Use of correlation ids to identify the events and passing the same to calling applications would greatly enhance error tracking and monitoring by way of linking logs across different platforms.

Transaction Rollback and Compensation

Designing appropriate fault tolerant mechanisms to maintaining ACID (Atomicity, Consistency, Isolation and Durability) properties in process flows poses a big challenge in designing SOA solutions. These solutions typically involve business processes that invoke services spanning multiple platforms, interaction styles and resource providers. It is more than likely that not all services that participate in a business process are transactional. If any particular transaction in a process flow fails, appropriate recovery implementations are to be designed to preserve the data integrity. Transaction rollback and compensation transactions are two approaches aimed at solving this problem.

Transaction rollbacks could be implemented by coordinating the transactions through a transaction monitor, if the business process spans over a confined domain and if the resources are all transactional. If the business process is more complicated, failure to complete the business process not only requires rollbacks to bring the data back to its consistent state but might also require processes to invoke certain compensation transactions such as sending notifications or invoking reversal actions on some of the previous service invocations.etc. It is beyond the scope of this paper to elaborate more on these topics. Readers are encouraged to refer to upcoming web services standards in this space: Web Services -- Coordination (WS-C) and Web Services -- Transaction (WS-T) from OASIS.

Summary

This paper provides SOA architects techniques to discover error handling requirements from the business artifacts package and how to analyze these while going through SOA analysis and design phase. Also provides some best practices to implement error handling in the three layers of SOA i.e. orchestration, mediation and component layers.

A thorough upfront analysis of various error handling considerations help architects make the right decisions during design and implementation phases, platform and SOA stack products.

About the Author

Hari Poolla is a SOA practitioner at a large Insurance and Financial Services company in Midwest, USA. Focusing on application architecture, BPM, solution design, integration and collaboration in the enterprise applications space, specializing in building enterprise re-usable services. Provides expertise in enterprise SOA adoption roadmaps and designing custom SOA solution development methodologies. An IBM certified SOA solution designer and Sun certified enterprise architect for Java 2 platform. He has been part of architecture and implementation of a SOA based business solution for electronically moving funds (EFT) between different lines of business. He can be reached at hari_poolla@yahoo.com.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?