BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Generating Avro Schemas from XML Schemas Using JAXB

Generating Avro Schemas from XML Schemas Using JAXB

Introduction

The pursuit of efficient object serialization in Java has recently received a leg up from the Apache Avro project. Avro is a binary marshalling framework, supporting both schema and introspection based format specification. Schema files can be written in JSON and compiled into bindings for Java, C, C#, and a growing list of other languages. Avro is similar to Thrift or Google's Protocol Buffers in that the output format is a byte stream. The performance gains from working with binary data make these cross-platform frameworks highly appealing.

To get the most from Avro, a schema should be created to describe each object (or 'datum' in Avro-speak) in your application. While the schema specification uses JSON, there is currently a lack of tools designed to create schemas in Avro's format. On the other hand, there are currently many tools in existence for creating and editing XSD schema files [1,2]. Creating one schema for XML as well as Avro is therefore quite appealing, since it would require less work to maintain one set of XSD files, which are probably already being maintained for other purposes.

Approach

JAXB, the Java Architecture for XML Binding, has been around for some time as well. The XJC tool from the project is the standard way to create Java class bindings from XML schemas. The tool was designed to be somewhat extensible, and includes support for plugins written in Java (however, there aren't many plugins available, and the documentation for the process is somewhat lacking). Plugins are given access to the generated code model and allowed to make changes or otherwise utilize the information. There are several smaller plugins in the wild, addressing use cases ranging from printing an index of generated classes [3], to modifying the generated code to be simpler to use [4], to adding interfaces and methods supporting the visitor pattern [5].

In this article, I am using a new plugin which works alongside the JAXB code generation process to create Avro schemas which closely parallel the generated JavaBeans classes. This has the main advantage of automating the Avro schema creation process, as well as keeping the Avro bindings looking as close as possible to the JAXB bindings. By using JAXB, we are allowing the Avro schemas to be generated from provided XSD schema files. This has the advantage that a single XSD can be used to create both JAXB bindings as well as Avro bindings. No handmade Avro schema is required, which means one less mapping to maintain in your application code.

The plugin continues past the schema generation phase to create Java class bindings from the schemas, but the schema files could instead be processed by one of the other compilers for another language currently supported by the Avro project.

First Steps: Building the XJC Plugin

Each plugin starts by extending the Plugin class provided by XJC. However, getting XJC to actually work with its plugins is just not as simple as it should be. There are several issues involving class paths and order of execution which are likely to cause some headache (the JAXB-Basics project includes an Ant task, which is currently the preferred way to execute XJC with plugins [6,7]). Once you are integrated, your plugin will be called by XJC after it has created an outline of the Java code it intends to generate.

The outline object stores basic information about the bean classes and their properties. Sufficient information about the properties and their types is provided, and we are able to inspect the output to create Avro schemas accordingly. There are several levels of models available at runtime, including the Outline (a coarse outline of beans and their fields), and the JCodeModel (a Java representation of Java code). Creating comparable Avro schemas requires more type information than the Outline can provide, and so I decided to use the code model for the majority of the processing.

The two high level constructs JAXB creates are 'enums' and 'beans'. The enums translate very well to Avro's enum type, and the beans can become Record types.

private void inferAvroSchema(Outline outline) {
       Model model = outline.getModel();
       Set<NamedAvroType> avroTypes = new HashSet<NamedAvroType>();

       // enums
       for (Map.Entry<NClass, CEnumLeafInfo> entry : model.enums().entrySet()) {
             CEnumLeafInfo info = entry.getValue();
             NamedAvroType type = avroFromEnum(info);
             avroTypes.add(type);
       }

       // regular classes
       for (Map.Entry<NClass, CClassInfo> entry : model.beans().entrySet()) {
             CClassInfo info = entry.getValue();
             NamedAvroType type = avroFromClass(info);
             avroTypes.add(type);
       }
}

private NamedAvroType avroFromEnum(CEnumLeafInfo info) {
       List<String> constants = new ArrayList<String>();

       for (CEnumConstant constant : info.getConstants()) {
            constants.add(constant.getName());
       }

       AvroEnum enumType = new AvroEnum(constants);
       enumType.name = info.shortName;
       enumType.namespace = makePackageName(info.parent.getOwnerPackage());

       return enumType;
}

Generating the Avro Schemas

Constructing the records for the bean classes is at times straightforward and other times not so simple. The basic idea is that the top level class becomes a record, with each of its properties as a record field. XJC exposes the types of the properties in a bean. When these bean properties are primitives, we can use Avro's primitive types. In all other cases, careful consideration must be made to ensure consistent mapping between all classes. The following table summarizes the decisions made by the plugin while creating schemas from bean classes:

Java Concept

Avro Concept

Comments

Enum

Enum

 

Class

Record with Fields

Class properties become record fields.

List

Array

Preference of empty array versus null unioned with an array.

Inheritance

Child contains a field which references a superclass instance.

'_parent' fields, which contain parent instance data

unboxed primitive (required)

normal primitive

 

boxed primitive

null unioned with the primitive

This is true of all optional properties.

byte

int

Avro supports byte arrays or integers, but not bytes.

package name

namespace

1:1 mapping

xs:date type

long

Enforce a UTC millisecond timestamp for dates.

mapping concepts from Java to Avro

private AvroType avroFromType(NType type, JPackage _package) {
       AvroType returnType;
       JType implType = type.toType(theOutline, Aspect.IMPLEMENTATION);

       // primitives
       if (implType.isPrimitive()) {
              returnType = new AvroPrimitive(implType.name());

       // might be a boxed primitive
       } else if (type.isBoxedType()) {
              returnType = new AvroPrimitive(implType.unboxify().name());

       // might be a String, which is special
       } else if ("java.lang.String".equals(type.fullName())) {
              returnType = new AvroPrimitive("string");

       // might be an Object, in which case it's assumed to be a reference
       } else if ("java.lang.Object".equals(type.fullName())) {
              String pName = makePackageName(_package);
              returnType = new ReferenceAvroType(pName);

       } else {
              if (type instanceof CClassInfo) {
                    CClassInfo classInfo = (CClassInfo) type;
                    String name = makePackageName(classInfo.getOwnerPackage())
+ "." + classInfo.getSqueezedName();
                    returnType = new DummyAvroType(name);

       } else if (type instanceof CEnumLeafInfo) {
                    CEnumLeafInfo enumInfo = (CEnumLeafInfo) type;
                    String name =
makePackageName(enumInfo.parent.getOwnerPackage()) + "." + enumInfo.shortName;
                    returnType = new DummyAvroType(name);
             } else if (type instanceof CElementInfo) {
                    CElementInfo elemInfo = (CElementInfo) type;

                    if
("javax.xml.bind.JAXBElement<java.lang.Object>".equals(elemInfo.fullName())) {
                                 String pName = makePackageName(_package);
                                 returnType = new AvroArray(new
ReferenceAvroType(pName));
                    } else {
                           throw new SchemagenException("unknown element type:
" + type.fullName());
                    }
             } else if (implType instanceof JClass) {
                    return avroFromSpecialTypes((JClass) implType, _package);
             } else {
                    throw new SchemagenException("can't handle this type! " +
type.fullName());
             }
         }

         return returnType;
    }

Properties which are collections can be represented as an array type in Avro. Their type can be determined by looking at the parameterized type for the collection. Our Avro types will also require a namespace, which is provided to us by JAXB through the package names it uses. In the end, we aren't truly given 100% of the information we need to find the optimal Avro schema, but except for reference types and a few edge cases, we can create a solid Avro schema which closely follows JAXB's bindings.

Some smaller issues, such as Avro's lack of support for dates and inheritance, necessitate using a workaround. For xs:date types, we default to a long value. For inheritance, each child receives an instance of the parent object as one of its fields. This means that a subclass will contain an instance of its parent, rather than be directly related by any type hierarchy.

After all of the schemas have been generated, we need to compile them into Java source files. This is accomplished by using the avro-tools code from the Avro project. The compiler reads JSON schemas and outputs Java source code. One of the requirements of using this tool is that any dependencies of a schema are resolved before that particular schema is compiled. For example, if we have class A depending on class B, then B should be fed to the compiler before A. The solution I have chosen simply looks at each schema and its dependencies, and topologically sorts them so that they are processed in the correct order. In the future, using the Avro IDL instead could provide some dependency resolution through the use of 'import schema' statements.

A Small Example

Given a small xml schema, let's see what it looks like after processing. Our initial input looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xs:schema
       version="1.0" elementFormDefault="qualified"
    targetNamespace="http://www.nokia.com"
       xmlns="http://www.nokia.com"
       xmlns:xs="http://www.w3.org/2001/XMLSchema"
> 
       <xs:element name="abstractObject" type="abstractObject"/>

       <xs:complexType name="abstractObject" abstract="true">
              <xs:attribute name="id" type="xs:ID" use="required"/>
              <xs:attribute name="data" type="xs:string"/>
       </xs:complexType>

       <xs:complexType name="concreteObject">
              <xs:complexContent>
                     <xs:extension base="abstractObject">
                            <xs:sequence>
                                   <xs:element name="types" type="enumeratedType"
minOccurs="0" maxOccurs="unbounded"/>
                            </xs:sequence>
                     </xs:extension>
              </xs:complexContent>
       </xs:complexType>

       <xs:complexType name="someElement">
              <xs:sequence>
                     <xs:element name="count" type="xs:int" minOccurs="1"
maxOccurs="1"/>
              </xs:sequence>
              <xs:attribute name="token" type="xs:string"/>
       </xs:complexType>

       <xs:simpleType name="enumeratedType">
              <xs:restriction base="xs:string">
                     <xs:enumeration value="ONE"/>
                     <xs:enumeration value="TWO"/>
                     <xs:enumeration value="THREE"/>
              </xs:restriction>
       </xs:simpleType>
</xs:schema>

As can be seen, there is a normal complex type, a parent and child pair, and an enumerated type. The enumeration is used in the concreteObject type. With no customizations, JAXB will produce 4 classes:

  • AbstractObject
  • ConcreteObject
  • EnumeratedType
  • SomeElement

Likewise, the Avro schema generator will produce 4 schemas, which together look like this:

{
 "name" : "EnumeratedType",
 "namespace" : "com.nokia.avro",
 "type" : "enum",
 "symbols" : [ "ONE", "TWO", "THREE" ]
}{
 "name" : "SomeElement",
 "namespace" : "com.nokia.avro",
 "type" : "record",
 "fields" : [ {
     "name" : "count",
     "type" : "int"
 }, {
     "name" : "token",
     "type" : [ "string", "null" ]
 } ]
}{
 "name" : "AbstractObject",
 "namespace" : "com.nokia.avro",
 "type" : "record",
 "fields" : [ {
     "name" : "id",
     "type" : "string"
 }, {
     "name" : "data",
     "type" : [ "string", "null" ]
 } ]
}{
 "name" : "ConcreteObject",
 "namespace" : "com.nokia.avro",
 "type" : "record",
 "fields" : [ {
     "name" : "types",
     "type" : {
       "type" : "array",
       "items" : "com.nokia.avro.EnumeratedType"
     }
 }, {
     "name" : "_parent",
     "type" : "com.nokia.avro.AbstractObject"
 } ]
}

The enum remains an enum, while the other classes become records with their properties becoming fields. In the ConcreteObject there is a reference to the AbstractObject, which contains the additional properties of the class. Again, processing these schema files in the correct order is important.

A Plugin for Your Plugin

The three main pieces of the schema generator - XJC, Avro schema plugin, and Avro schema compiler - can be packaged up into a single Maven plugin which takes care of the execution. This helps significantly when dealing with the complexity of getting XJC to run with a custom plugin. Now it is possible to specify all of the execution settings in a POM file and simply run the Maven plugin. This can be run manually or as part of an automatic process. The plugin declaration requires only the basics that XJC needs to run.

<plugin>
      <groupId>com.nokia.util.avro</groupId>
      <artifactId>schemagen-plugin</artifactId>

      <configuration>
             <outputDirectory>
                    ${project.build.directory}/avro
             </outputDirectory>
             <packageName>my.generated</packageName>
             <schemaFiles>
                    <file>
                           ./src/main/resources/schema.xsd
                    </file>
             </schemaFiles>
      </configuration>
</plugin>

Conclusion

Starting with one XSD schema, I was able to create a parallel JSON schema specification for Apache Avro. The added-value here is that an Avro schema does not have to be written by hand, and can instead be generated from a common set of XSD's, which may be used for JAXB, web services, and other tools. Incremental changes to your XSD can be immediately represented in your Java code simply by running the schema through the Maven plugin. If you already maintain your XML format, then you're already maintaining your Avro binary format as well! And while my plugin generates Java class bindings, the same principals could be applied to other languages supported by the Apache Avro project. By keeping everything within Java, two sets of bindings are created which look and feel very similar to each other. In my own work, this has made implementing binary data views no more tedious than JAXB XML views.

XJC is a powerful tool for generating Java class bindings from XML schemas, and the plugin architecture provides an opportunity to both inspect and manipulate the code model prior to generation. Getting familiar with the inner workings of JAXB can at times be frustrating, and lacking in documentation and resources, however the payoff of tapping into the code generation process is worthwhile. The same principals used to create the Avro schemas could conceivably be used for other formats as well, opening up many new possibilities for the XJC tool.

Information about the Apache Avro project can be found here.

About the Author

Benjamin Fagin is a software developer specializing in Java and related technologies. His interests include music software and language design. When not programming, he is likely to be found outdoors biking or hiking. He currently lives in Chicago, where he is working with Nokia on the next generation of mapping technology.

 

 

References

1 XML Spy

2 Smooks

3,4 JAXB Basics

5 JAXB Visitor Plugin

6 JAXB Commons

7 "Writing a plug-in for the JAXB RI is really easy" Kohsuke Kawaguchi, June 1, 2005 

Rate this Article

Adoption
Style

BT