Schema for Web Services – Part I: Basic Datatypes
XML message exchange is the basis of most varieties of web services, including both SOAP and REST approaches. The use of XML creates some drawbacks, including potential issues with performance, but it also provides a level of abstraction which allows for loose coupling between the parties involved in an exchange. In order for that loose coupling to really work, though, you need to be able to define the structure of XML documents being exchanged in a way which allows verification of correct documents. The W3C's XML Schema definition language (which will be referred to as just "schema" for the rest of this article) is the approach most widely used for these message structure definitions.
Most web service applications don't work with XML documents directly, instead going through a data binding conversion layer within a web service toolkit. This is convenient for application developers, since it means they can work directly with data structures in their programming language of choice. But the data binding step needs to deal with mismatches between schema data types and structures and programming language data types and structures, and these mismatches can create problems for applications. If you want your web services to provide consistent, cross-platform compatibility (which is generally the whole point of using web services in the first place), you need to design your schema definitions to avoid potential problem areas - or at least be aware of the risks involved in using problematic schema features.
In this series of articles we're going to look at various types of problems that arise from the mismatch between schema and web service data bindings. For this first article we'll start at the most basic level, looking at simple data types and the problems they create.
Numeric values are about as basic as you can get when it comes to business data. Given the importance of numbers, you might think that this would be an area where schema worked smoothly and consistently. And in an abstract sense, it really does - but when schema gets applied by web services toolkits you can still run into a multitude of problems.
Part of the issue is the sheer variety of built-in schema numeric datatypes. Figure 1 shows the portions of the schema datatype tree involved in this area. To understand it, think in terms of specialization - the further you move down one of the branches of the upside-down tree, the more specialized the data that is represented by a type. At the top layer, directly under the generic anySimpleType, are the three basic numeric types float, decimal, and double. float and double are terminal types, matching the IEEE standard for floating point numbers, and as such provide excellent interoperability across web services platforms: Every major programming language supports 32-bit floating point numbers matching the float specification and 64-bit floating point numbers matching the double schema specification, so web services toolkits can just map these directly to the native language types. There may be minor differences between the programming language text representations of special values (not-a-number, positive and negative infinity, and positive and negative zero) and those used by schema, but the toolkits can easily handle translation.
Figure 1. Schema numeric types
It's when you go down the decimal branch of the tree that you start running into problems. decimal itself is defined as a string of any number of decimal digits, with an optional leading sign and optional decimal point. integer, the direct descendant of decimal, matches a subset of the values corresponding to decimal in that it allows any number of decimal digits, with an optional leading sign, but does not allow a decimal point. The descendants of integer further restrict the allowed values, in the case of nonPositiveInteger and nonNegativeInteger by prohibiting values respectively greater than or less than zero, and in the case of long by limiting the range of values to a 64-bit 2s-complement equivalent. int, short, and byte further restrict the range, to 32-bit, 16-bit, and 8-bit 2s-complement respectively, while the unsigned variations match unsigned values of the same number of bits.
All major programming languages support values matching the long, int, and short schema types along the main branch of the tree, but the other variations create potential problems. Java, for instance, doesn't include primitive types corresponding to unsignedLong or unsignedInt. Java web services frameworks generally work around this lack of language support by using special classes rather than primitives for these types, but this makes the web service interface somewhat awkward and can create performance issues (since primitives are generally much faster than object types when used in calculations).
Even the decimal and integer types present problems. Most Java toolkits handle these using the standard java.lang.BigDecimal and java.lang.BigInteger classes, which suffer from poor performance but support values of unlimited size. .Net instead uses a fixed-size 128-bit representation, which limits the possible value range (as allowed by the schema specification) but provides relatively good performance.
The schema numeric types are confusing and inconsistent (why a nonPositiveInteger type, but no nonPositiveDecimal type, for instance?), and generally just represent syntactic sugar in any case (since the ranges can instead be implemented using simpleType restriction). For these reasons it's best to avoid using most of these types in your schema definitions, especially those intended for use with web services. Use specific sized types (double and float for real numbers, and long and int for integers) where possible, since these translate consistently to programming language primitive types. If you need to work with values beyond the range or precision possible with these sized types, understand that decimal and integer will not necessarily give you what you want due to implementation differences, and instead consider using a string and handling the conversion of the value in application code.
The Issues of Time
Time-related values are another common source of problems in working with schema. Nine separate time-related datatypes are defined by schema, all based on a particular version of the Western Gregorian calendar. Unlike the numeric types, the time-related types aren't in any direct form of specialization relationship - instead, they're all considered as derived directly from the generic anySimpleType.
The most widely-used time datatypes are dateTime, date, and time. These three datatypes share a common representation format, with dateTime as the general case. Here's a sample dateTime value, for the current time as I write this article: "2008-09-08T15:38:53". A date value uses the same representation as a dateTime, but strips off the 'T' and the hour-minute-second values that follow (leaving "2008-09-08", in this case); a time value, conversely, strips off everything up to and including the 'T', keeping only the hour-minute-second values ("15:38:53").
Seems pretty simple so far, right? Where it gets confusing is in the actual interpretation of one of these values. Dates and times vary depending on where you're located, with the variation normally expressed in terms of time zones. For instance, as I write this article in New Zealand I'm 12 hours ahead of Universal time and 19 hours ahead of the Pacific Daylight Time currently in effect for the West coast of the U.S. At the same instant I wrote my sample dateTime value here as "2008-09-08T15:38:53", the time in Seattle was "2008-09-07T20:38:53".
For many applications you need to specify date/times in a manner which permits relating one value to another. Schema supports this requirement by allowing date/time values to use an appended time zone indication. This time zone indication can either take the form of the letter 'Z', used to indicate a date/time Universal time (UTC) value, or an offset from Universal time in hours and minutes. So any of these dateTime values (and many more variations) could all be used to indicate the same instant: "2008-09-08T15:38:53+12:00", "2008-09-07T20:38:53-08:00", or "2008-09-08T03:38:53Z".
But schema doesn't require that you specify a time zone indication, and without such an indication a date/time value can only be interpreted as being accurate for some arbitrary location which could be anywhere in the world. For some applications that may be just what you want - a person's birth date, for instance, is usually treated as a particular date without reference to location, and people likewise celebrate the Gregorian New Year as it occurs locally around the world - but for other applications it creates major issues. Consider the case of a conference call, for instance, where all the parties involved need to coordinate the time of the event to their local clocks.
Unfortunately, schema does not allow you to distinguish between the cases where a fully-specified date/time is needed and those where a zoneless value is allowed or even expected (at least not in a way which web services toolkit can interpret - you could do this by using simpleType restriction patterns, but patterns are generally ignored by the toolkits). So the ambiguity of schema on this point means that toolkits need to handle values both with and without time zone indications.
The need to handle both types of values creates some major headaches in terms of interpretation, especially since programming languages generally implement date/time handling based on absolute time values. There's just no way to correctly convert a schema value which is missing a time zone indication to an absolute time. Of course, that doesn't stop toolkits from doing something with such values, anyway. In most cases they convert the value as supplied by assuming it's given in terms of the local time zone, and that's often what you want - but when it's not, the resulting problems can be very difficult to isolate.
Problems due to time zones are especially messy for the date type. Most often, people treat dates as a fixed slot on the calendar. When you sign a legal document, for instance, you'll generally fill in the date of your signature. If you agree to a new project, there'll usually be a scheduled completion date (fanciful as these scheduled dates may sometimes be). And if you're asked to show your driver's license for proof of age when making a purchase, the clerk will look at your birth date and compare it with an age cutoff. In all these cases the date is treated as having day resolution, and differences between timezones are normally ignored. But the schema date type uses an associated time zone indication, just like the dateTime and time types. This use of a time zone indication creates a disconnect between the schema date type and the common form of a date. Generally this gets handled by converting dates to the 00:00 (midnight, as the start of the day) time representing the start of that day in whatever timezone was specified. But if you then print out that date value using the local timezone, you may find it's different from what was originally specified in the document.
If schema defined separate types for date/time values with time zone specifications and those without it'd be easy for applications to pick which type they wanted to use. Without this ability, it's difficult for toolkits to work around a basically flawed representation of date/time values in schema. Java's JAXB 2.0 takes what is probably the most comprehensive approach to the problem, handling all the schema date/time types with a special class (
javax.xml.datatype.XmlGregorianCalendar) which corresponds directly to schema representations. This approach preserves all the nuances of schema representations of values, but at the cost of passing the interpretation issues on to developers. Other toolkits generally just use defaults, such as assuming the local timezone.
Given the nasty issues lurking in this area, the best general approach is probably to only use the schema date/time types for values which should be fully-specified with time zone indications, and to make sure that any documents you generate do include time zone indications. Most web services toolkits will generate the time zone indications for you on output automatically, so this last part is easy. Requiring that your input documents also use time zone indications can be more difficult, especially since documents may be going through several stages of processing. If you want to be certain you don't run into problems caused by mistaken conversion assumptions your best solution is probably to use a string type in the schema representation, so that your web service toolkit will pass the value on to your application code without trying to interpret the value.
If you need zoneless date/time values (as for the birth date example), your best approach may again be to use a string type in the schema representation. That's not very satisfying from the standpoint of providing an accurate representation of the data in the schema, but avoids the issues with web services toolkits interpreting unzoned values as being in the local timezone.
Data structures used internally by applications often contain multiple linkages between components, including cross-references and indirect associations. XML, on the other hand, is inherently tree-structured. It's very easy to represent one-to-many relationships in XML through containment, but any other type of relationship is problematic. Even one-to-many relationships can be inefficient. Consider the case of a document listing a customer's order history, for instance. Each order will have associated billing and shipping addresses, but these addresses are often going to be repeated from one order to the next. If you just embed the addresses inside the information for each order, you'll end up with a lot of redundant information in your documents.
References can be used to get around the limitations of XML's tree structure. The idea of a reference is that you define something once in an XML document, including a unique identifier. Any time other data needs make use of that definition, you create a reference using the unique identifier.
Schema directly supports two forms of references. The first, using the ID type, defines element identifiers which can be linked from anywhere in the document by using the IDREF or IDREFS types. The nice part of ID/IDREF links is that they're simple - identifiers are just names, and any type of element can define an ID value in the schema. The downside of ID/IDREF links is that they use a global context, so there's no way to say that the value used for a particular IDREF must be defined on a particular element type, and the names used as ID values must be unique within a document (even across types of elements). Some web service toolkits support using ID/IDREF links to represent references within data structures (including JAX-WS/JAXB 2.0, and Apache Axis2 when used with JiBX data binding); other toolkits (such as .Net, and Axis2 used with ADB) do not, instead treating IDREF values as simple text strings.
The second type of references support by schema are key/keyref links. While ID/IDREF links are defined using datatypes, key/keyref links are instead part of the structure of a schema definition. This difference allows key/keyref links to be much more expressive than ID/IDREF links, including defining contexts within which key values are unique. But because key/keyref links are designed more for purposes of document validation than for structuring, they are complex and not generally used by data binding frameworks which convert XML data to and from data structures.
So if you want to embed linkages within your XML documents and have them handled by web services toolkits, your only hope is the ID/IDREF approach. Some toolkits will support these links directly; others will just treat the identifier values as strings, but you can write application code to cross-reference the identifier and reference values and build your own links.
In this article we've looked at some of the problems that arise when using the most common schema datatypes in web services. There are many other specialized schema datatypes beyond those mentioned in this article (a total of 42!), and some of these present other issues. As a general principle, the best approach to take in your web service schema definitions is to avoid the use of overly-specialized types (except for the numeric types that match common programming language types), and use a string type when you want full control over the interpretation of values.
It's worth pointing out that although some of the issues discussed in this article could be handled better by data binding frameworks, a lot of the problems lie with schema itself. In particular, the data/time family of types are at best cumbersome to work with and at worst invite errors through the lack of distinction between zoned and unzoned value types. It's possible to pass the confusion on to the user, as JAXB does with the XmlGregorianCalendar type, but that's not really a solution.
About the author
Dennis Sosnoski is a consultant and training facilitator specializing in Java-based SOA and web services. His professional software development experience spans over 30 years, with the last 10 years focused on server-side XML and Java technologies. Dennis is the lead developer of the open source JiBX XML data binding tool and the associated JiBX/WS web services framework, as well as a committer on the Apache Axis2 web services framework. He was also one of the expert group members for the JAX-WS 2.0 and JAXB 2.0 specifications. For information on his training and consulting services check his website http://www.sosnoski.co.nz.
Martin Thompson Jul 27, 2014