BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Structured Event Streaming with Smooks

Structured Event Streaming with Smooks

This item in japanese

Overview

Smooks is an open source Java framework for processing "Data Event Streams".  It is most commonly thought of as a Transformation framework and is used as such in quite a few products and projects, including JBoss ESB (and other ESBs).  At its core however, Smooks makes no mention of the word "transformation", or any other such terms.  It's scope of application is wider than that!

Smooks works by converting a stream of structured/hierarchical data into a stream of "events" that can then be targeted with "Visitor Logic" for analysis, or to produce a result (optional).

Source -> Structured Event Stream (Visitor Logic) -> Result

So what can we do with this that we can't already do with SAX and DOM etc?  Well since Smooks is built on top of these technologies, the simple answer is "nothing".  What Smooks adds is the ability to consume SAX and DOM more easily (Smooks doesn't currently support a StAX based filter).  It provides a Visitor API, as well as a configuration model, that allows you to easily target the Visitor logic at specific SAX events (if using the SAX filter) or DOM elements (if using the DOM filter).  Smooks also makes it very easy to consume source data formats other than XML (EDI, CSV, JSON, Java etc) in a standard way i.e. the standard event stream generated from the data source effectively becomes a canonical form for all of these different source data formats.  This is key to how Smooks works!

Smooks can be used in one, or both, of the following ways:

  • Mode 1: You can get down and dirty with Smooks by writing your own custom Visitor Logic event handlers that can be used to process selected events from a data source event stream.  In this mode, you need to get familiar with the core APIs.
  • Mode 2: You can reuse the growing number of out-of-the-box solutions provided with the Smooks distribution.  In this mode, you are simply reusing components that were created by others, by reconfiguring them to process your data sources e.g. configuring the  with some configurations so as to populate a Java object model from an EDI data source.

In this article, we will do a whistle-stop tour of some of the capabilities provided by the Smooks v1.1 distribution, out of the box.  By this we mean capabilities you can take advantage of without writing any code (ala mode #2 above).  These include:

  1. Multiple Source Data Formats:  "Easily" consume a number of popular data formats including XML, EDI, CSV, JSON and Java (yes, you can perform java to java transforms in a declarative manner). 
  2. Transformation:  Take advantage of numberous options that allow you consume the event stream produced from a data source, to produce a result of some other format (i.e. to "transform" the source).  This includes the ability to apply FreeMarker and XSL Templates to data models captured by Smooks as it filters the source data stream.  Since these template resources can be applied to events deep in the source data event stream, they can be used to perform "fragment based transforms".  By this we mean Smooks can be used to apply transformations to fragments of a data source Vs only being able to apply transforms to the data source as a whole.
  3. Java Binding:  Create and populate Java Object models/graphs from any of the supported data formats (i.e. not just XML), in a standard way.  The configurations for populating an object model from an EDI data source look exactly the same as that for an XML data source, or a JSON data source.  The only difference will be the "event selector" values used on the binding configurations.
  4. Huge Message Processing:  Through its SAX based Filter, Smooks enables processing of huge messages (GBs) in a number of ways.  Mixing the capabilities brought about by fragment based transforms, Java Bindings and mixing DOM and SAX models using NodeModels, Smooks can process huge messages on a low memory footprint.  These capabilities allow you to perform straight 1 to 1 transforms, as well 1 to n splitting, transforming and routing, of huge message data streams.
  5. Message Enrichment:  Enrich messages with data from a database.  This can be done on a fragment basis i.e. you can manage enrichment on a fragment by fragment basis.  An example of where this might be relevant would be a use case in which a message containing a list of customer IDs needs to be enriched with customer addressing and profile data from a database, before the message is forwarded to another process.
  6. Extensible XSD based Configuration Model:  Since Smooks v1.1, you can extend the Smooks XSD configuration namespace with configuration models for your own reusable custom Visitor Logic.  Creating these custom configuration extensions is a simple configuration task that greatly improves the usability of these reusable components.  All existing out of the box Smooks components utilize this facility.

Processing Different Data Formats

One of the key features of Smooks is the ability to easily configure it to process data of different formats (i.e. not just XML) in a standard way.  This means that if you develop some custom Visitor Logic for Smooks, that code will immediately be able to process any of the supported data formats, just as the Smooks out of the box components (Java Binding etc) are able to do.  Allied to this, if you develop a custom Reader implementation for a data format that is not supported out of the box (e.g. YAML), you immediately inherit the ability to use all available out of the box Visitor Logic (e.g. the Java Binding components) to process the data events generated from data streams of that type.  This is possible because Smooks components process a standardized event stream (i.e. a canonical form).

Out of the box, Smooks provides support for processing XML, EDI, CSV, JSON and Java Objects.  By default, Smooks reads the source data stream as XML (unless otherwised configured).  The exception to this is Java Object Sources, which can be automatically recognized.  For all other data format types, a "Reader" must be configured in the Smooks configuration.  The following is an example of configuring the CSV reader:

xml version="1.0"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd"
xmlns:csv
="http://www.milyn.org/xsd/smooks/csv-1.1.xsd">

<csv:reader fields="firstname,lastname,gender,age,country" separator="|" quote="'" skipLines="1" />

smooks-resource-list>

Readers for EDI, JSON etc are similarly configured via unique configurations namespaces i.e. <edi:reader></edi:reader>, <json:reader></json:reader> etc.  These namespaced configurations are supported via the Extensible XSD based Configuration Model outlined earlier.

The job of the configured Reader is that of translating the source data stream into a structured data event stream (i.e. the canonical form - currently based on SAX2).  Smooks listens to this stream of events, firing configured Visitor Logic (e.g. templating or binding resources) at the appropriate times.

Executing a Smooks Filter Process

This is straightforward:

private Smooks smooks = new Smooks("/smooks-configs/customer-csv.xml");

public void transCustomerCSV(Reader csvSourceReader, Writer xmlResultWriter) {
smooks.filter(
new StreamSource(csvSourceReader), new
StreamResult(xmlResultWriter));
}

The Smooks.filter() method consumes the standard javax.xml.transform.Source and javax.xml.transform.Result types.  The Smooks project also defines a number of new implementations.

Visualizing non XML Structured Data Event Stream

XML is the easiest visualization of the event stream generated by a source data stream.  So for an XML source, there's no real issue.  For a non XML source (e.g. CSV), it's not so easy.  The source looks typically nothing like XML.  To help with this, Smooks provides an Execution Report Generator tool.  One of the uses of this tool is that of helping you visualize the event stream generated by a non XML data source, as XML.  It's also very useful as a debugging tool.

This report generation tool is injected into the Smooks ExecutionContext

private Smooks smooks = new Smooks("/smooks-configs/customer-csv.xml");

public void transCustomerCSV(Reader csvSourceReader, Writer xmlResultWriter) {
ExecutionContext executionContext = smooks.createExecutionContext();

executionContext.setEventListener(new HtmlReportGenerator("target/report/report.html"));
smooks.filter(new StreamSource(csvSourceReader), new StreamResult(xmlResultWriter), executionContext
);
}

The output of which is a HTML page as follows (in Smooks v1.1):

JBoss are in the process of building an Eclipse editor for Smooks as part of JBoss Tools.  These tools will further simplify the process of visualizing, and working with, non XML data source event streams.

Split, Transform and Route

This use case is good in terms of demonstrating how a number of Smooks capabilities can be combined to perform a more complex task.

Continuing with the CSV example, we have the following basic requirements:

  1. The CSV stream is potentially huge, so we need to use the SAX filter.
  2. We need to route each CSV record to a JMS endpoint, as XML.  This means we need to split, transform and route the message.

Smooks provides support for applying fragment based transforms using a number of popular templating technologies, including XSL and FreeMarker.  Smooks also provides the ability to capture DOM NodeModels from the source event stream (again the source can be non XML), even when the SAX Filter is in use.  With this, Smooks constructs "mini" DOM models from source data fragments and makes them available to other Smooks resources, such as FreeMarker templating and Groovy scripting resources.  With this approach, you get some of the benefits of the DOM processing model, while still processing in a streamed environment.  For the outlined use case, we will use FreeMarker as the templating technology.

Smooks also provides out of the box support for routing data fragments (generated from source data fragments) to a number of different endpoint types, namely JMS, File and Database.  As with everything else in Smooks, such capabilities can always be built on or replicated to other use cases e.g. plugging in a custom email routing Visitor component would be trivial.  JBoss ESB (and other ESBs) provide custom Smooks Visitor components for performing fragment based ESB endpoint routing from inside a Smooks filtering process running on the ESB.

So configuring Smooks to fulfill the above use case is quite trivial:

xml version="1.0"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd"
xmlns:csv
="http://www.milyn.org/xsd/smooks/csv-1.1.xsd"
xmlns:jms
="http://www.milyn.org/xsd/smooks/jms-routing-1.1.xsd"
xmlns:ftl
="http://www.milyn.org/xsd/smooks/freemarker-1.1.xsd">

<params>
(1)
<param name="stream.filter.type">SAXparam>
params>

(2)
<csv:reader fields="firstname,lastname,gender,age,country" separator="|" quote="'" skipLines="1" />

(3)
<resource-config selector="csv-record">
<resource>org.milyn.delivery.DomModelCreatorresource>
resource-config>

(4)
<jms:router routeOnElement="csv-record" beanId="csv_record_as_xml" destination="xmlRecords.JMS.Queue" />

(5)
<ftl:freemarker applyOnElement="csv-record">
(5.a)
<ftl:template>/templates/csv_record_as_xml.ftlftl:template>
<ftl:use>
(5.b)
<ftl:bindTo id="csv_record_as_xml"/>
ftl:use>
ftl:freemarker>

smooks-resource-list>
  1. Configuration (1) instructs Smooks to use the SAX filter.
  2. Configuration (2) instructs Smooks to use the CSV Reader, with the supplied configuration.
  3. Configuration (3) instructs Smooks to create NodeModels for the record fragments (see the earlier Execution Report).  The NodeModel for each record will overwrite the NodeModel generated for the previous fragment, so there's never more than one CSV record in memory ( as a NodeModel) at any given time.
  4. Configuration (4) instructs Smooks to route the contents of beanId "csv_record_as_xml", to the specified JMS destination, at the end of every fragment.
  5. Configuration (5) instructs Smooks to apply the specified FreeMarker template (5.a) at the end of every fragment.  The result of the templating operation is to be bound into beanId "csv_record_as_xml" (5.b).

The FreeMarker template (5.a) can also be defined inline in the Smooks configuration (inside the <ftl:template></ftl:template> element), but in this case we define it in an external file:

<#assign csvRecord = .vars["csv-record"]> <#-- special assignment because csv-record has a hyphen -->
<
customer fname='${
csvRecord.firstname}' lname='${csvRecord.lastname}' >
<gender>${
csvRecord.gender}<gender>
<age>${
csvRecord.age}<age>
<nationality>${
csvRecord.country}<nationality>
<customer>

The above FreeMarker template references the fragment NodeModel.

Java Binding

Smooks can be effectively used to populate Java Object models from any supported source data format.  The populated Object model can be used as a result in it's own right, or can be used as a model for a templating operation i.e. the populated object models (stored in the bean context) are made available to the templating technologies (just like with the NodeModels).

Going with the CSV example again.  We have a Customer Java class, as well as the Gender enum type (getters/setters omitted):

public class Customer {
private String firstName;
private String lastName;
private Gender gender;
private int age;
}

public enum Gender {
Male,
Female
}

The Smooks configuration for populating a list of this Customer object from the CSV stream would be as follows:

xml version="1.0"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd"
xmlns:jb
="http://www.milyn.org/xsd/smooks/javabean-1.1.xsd">

(1) <csv:reader fields="firstname,lastname,gender,age,country" separator="|" quote="'" skipLines="1" />
(2) <jb:bindings beanId="customerList" class="java.util.ArrayList" createOnElement="csv-set">
(2.a)
<jb:wiring beanIdRef="customer" />
jb:bindings>

(3)
<jb:bindings beanId="customer" class="com.acme.Customer" createOnElement="csv-record">
<jb:value property="firstName" data="csv-record/firstName" />
<jb:value property="lastName" data="csv-record/lastName" />
<jb:value property="gender" data="csv-record/gender" decoder="Enum" >
(3.a)
<jb:decodeParam name="enumType">com.acme.Genderjb:decodeParam>
jb:value>
<jb:value property="age" data="csv-record/age" decoder="Integer" />
jb:bindings>

smooks-resource-list>
  1. Configuration (1) instructs Smooks to use the CSV Reader, with the supplied configuration.
  2. Configuration (2) instructs Smooks to create an instance of an ArrayList when we encounter the start of the message (the element) and bind it into the bean context under the beanId "customerList".  We want to wire in (2.a) instances of the "customer" bean (3) into this ArrayList.
  3. Configuration (3) instructs Smooks to create instances of the Customer class when it encounters the start of every element.  Each of the elements define a value binding, selecting data from the event stream and binding that data's decoded value into a specific property of the current Customer instance.  Configuration (3.a) tells Smooks to use the Enum decoder for the Gender property.

Of course, a twist on the earlier Split, Transform and Route use case might be to route populated Customer objects to the JMS Queue, instead of XML generated by a FreeMarker template:

xml version="1.0"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd"
xmlns:csv
="http://www.milyn.org/xsd/smooks/csv-1.1.xsd"
xmlns:jms
="http://www.milyn.org/xsd/smooks/jms-routing-1.1.xsd"
xmlns:jb="http://www.milyn.org/xsd/smooks/javabean-1.1.xsd">

<params>
<param name="stream.filter.type">SAXparam>
params>

<csv:reader fields="firstname,lastname,gender,age,country" separator="|" quote="'" skipLines="1" />

<jms:router routeOnElement="csv-record" beanId="
customer" destination="xmlRecords.JMS.Queue" />

<jb:bindings beanId="customer" class="com.acme.Customer" createOnElement="csv-record">
<jb:value property="firstName" data="csv-record/firstName" />
<jb:value property="lastName" data="csv-record/lastName" />
<jb:value property="gender" data="csv-record/gender" decoder="Enum" >
<jb:decodeParam name="enumType">com.acme.Genderjb:decodeParam>
jb:value>
<jb:value property="age" data="csv-record/age" decoder="Integer" />
jb:bindings>


smooks-resource-list>

And getting more complex, one could perform multiple routing operations for each csv-record, routing Customer Objects to the JMS Queue and FreeMarker generated XML messages to file.

Performance

Inevitably, this question arises again and again.  We have performed numerous adhoc benchmarks on Smooks and our general findings were as outlined in the following subsections.

  • Smooks Core Filtering Overhead:
  • Smooks Core processing of XML via the SAX filter (using Xerces as the XMLReader), with no configured Visitor logic, adds approximately 5 to 10 percent of an overhead on top of straight SAX processing with the same SAX parser.

  • Smooks Templating Overhead:
  • On earlier versions of Smooks, we performed some benchmarking to establish the overhead encored when applying XSLT via Smooks Vs applying it standalone.  Smooks then (and now) only supports XSLT via the DOM filter.  When comparing DOM based application of XSLT, Smooks adds about 5 to 15 percent overhead, depending on the XSL Processor.

  • Smooks Java Binding Overhead:
  • Our findings here are based purely on a comparison with one of the main open source XML to Java Binding frameworks.  What we found was that Smooks was marginally slower for smaller message (i.e. < 10K), but was faster for larger messages.

Smooks is in use in quite a few mission critical production environments today.  Any time we receive queries re performance, it has always been due to a configuration issue (e.g. leaving Execution Report Generation turned on).  Once resolved, users have always been very happy with performance.  This is not very empirical, but does suggest to us that Smooks is not a "dog" in respect of performance.

The bottom line seems to be that Smooks Core is quite efficient, only adding a relatively low overhead on top of  standard SAX based processing for XML.  After that, performance depends on the configured Visitor Logic, what it is doing and how efficient it is.

What Next for Smooks?

The primary focus of Smooks v1.2 will be on providing more tools for processing of EDI messages.  We also want to provide out of the box support for some of the more popular EDI message types.

As stated earlier, another important development for Smooks will be the work going on in the JBoss Tools project, where they are building an Eclipse Editor for Smooks.

Conclusion

Hopefully this article has given the reader a better insight into Smooks and it's core capabilities.  We hope people will download Smooks, take it for a spin, provide feedback etc etc.

Rate this Article

Adoption
Style

BT