AtomServer – The Power of Publishing for Data Distribution
Consider this – you work for a company with a federation of fully independent web sites, implemented in half a dozen different programming languages, on several different platforms. Each independent website has its own database system and schema, and is managed by teams with varying skill sets, located in eight sites throughout the United States and Europe. And the company is growing.
Your job? Enable these disparate systems to share crucial data conveniently and rapidly amongst themselves.
Your design criteria are:
High Traffic Capability– the service would need to move approximately 1M pieces of data a day at launch
Transactional Correctness – the service must be accurate as the authoritative source of data for all clients
Resiliency – the service must be easy to upgrade with seamless data republishing when formats change
Loose Coupling –with so many systems, each must be able to manage themselves independently
Adoption – the system must have a low barrier to entry for clients implemented in a variety of languages (Java, C#, PHP, Ruby, and ColdFusion)
Adaptability – the system must support many different types of data and be extensible to add new types of data on demand
We were faced with exactly this problem about a year ago at Homeaway.com, where we both work. And it didn’t take long to recognize two design tenets – first, that a distributed, publish-subscribe service is a great way to address resiliency and loose coupling of subsystems, and second, that building RESTful services (as opposed to heavyweight protocols like SOAP) is a natural solution for systems that need high scalability, extensibility, and ease of adoption. These two principles led us directly to Atom – a RESTful publishing protocol – and to a new breed of data service called an Atom Store. We’ve spent the last year implementing an Atom Store for Homeaway. And from that real-world implementation we have extracted the open source Atom Store framework, named AtomServer ( http://www.atomserver.org), described in this article.
Atom is comprised of two specifications – the Atom Syndication Format, which defines an XML-based language to describe web feeds, and the Atom Publishing Protocol, which describes a RESTful HTTP protocol for retrieving and manipulating such feeds.
Atom was conceived as a replacement for RSS (Rich Site Summary), which generally contains human authored text, such as blog entries. Consequently, the internal structure of an Atom entry or feed (the XML elements and attributes) conveys the semantics of publishing such as authors, languages, titles and so on. Don’t let this fool you; Atom entries are well suited for carrying all sorts of data as their payload.
Atom entries are the individual records of data, and Atom feeds are lists of entries. Because Atom is a RESTful protocol, resources are accessed by executing HTTP methods on URIs that identify resources – in this case, entries and feeds. For example, retrieving a feed of blog entries might be accomplished by doing a GET on a URI like http://your-atomserver/entries/myblog, and the response might look like:
Atom also supports the notion of categories that can be applied to an entry. Categories are arbitrary string tags that can be applied to an entry for the purposes of marking it as part of some group. Feed URIs can then be modified to incorporate filtering – returning only the entries from a feed that have a particular category applied to them.
There are a couple of terms that have special meaning in the context of Atom. Entries are grouped together in a fixed two-level hierarchy of workspaces and collections. Workspaces contain some number of collections, and collections contain some number of entries. The URI to a feed consists of the workspace and collection of the feed
and the URI to an entry consists of the workspace, collection, and EntryId. This identifier for the entry must be unique within the owning collection:
Atom, like RSS, provides the basis for a web syndication framework. There are a large number of existing clients that understand Atom, including browsers, newsreaders, and programmatic clients in practically every popular language. Additionally, because Atom is actually just a small set of conventions on top of HTTP and XML, it is easily accessible by any web-ready programming platform.
Extensions and Layered Protocols
The core Atom protocol describes the basic operations for manipulating feeds and entries, methods for error reporting and handling, and specifically provides for the concept of extensions. Atom extensions are additional XML elements and attributes that can appear in Atom XML documents, and additional HTTP request parameters that can be applied to a URI to modify the behavior of a server that supports Atom. Two important extensions are:
OpenSearch – defines a protocol for searching, including methods to introspect the server to determine what kinds of searches are supported. An OpenSearch enabled service returns search results as an Atom feed, with individual results represented as entries
Feed Paging – addresses pagination for time-based data by defining next and previous link types for Atom feeds which clients can use to page forward and back through a multi-page feed
Probably the most visible and influential use of Atom is GData, Google’s web API for accessing the data from their many services. GData incorporates the core Atom spec and OpenSearch, as well as a number of custom extensions to cover additional features not addressed by those specifications.
When you no longer limit Atom entries to web content like blogs or news feeds, and extend Atom to the management of general data, you have an Atom Store; a generic data store of inter-linked Atom entries, which you can edit using the Atom Publishing Protocol, and then search over using OpenSearch. AtomServer grew out of a desire to leverage this strategy for distributing access to our own data.
Distributed Control Over Data Access
One of the benefits to our use of the Atom protocol is the inherently distributed nature of the system. In AtomServer, we combined the notion of pagination, limiting the number of entries that we return with each request, with the mechanism for polling the server for feed updates. Clients should start at the beginning of the feed, and continue requesting pages of data until there are no more. Then, the client should periodically check for new data (i.e., whether there is another page of data past the last one that was successfully processed)
AtomServer itself simply marks each entry with an incrementing counter on every change, and clients are required to store the last value of the counter that was processed for each feed they read. On subsequent polls for a feed, the start-index of the next page of data is set to the end-index of the previous page. This method effectively handles the case where AtomServer is used to handle a high volume of rapidly changing data; multiple entries could change in the course of a second.
When you pull a feed with no start-index, you start at 0 by default. For example,
will return an extension tag, named endIndex in the http://atomserver.org/namespaces/1.0/ namespace, as a child of the feed element. This will contain the last index on the retrieved page:
That number should be passed as the start-index query parameter on the next poll:
When the requested page has no new data to offer, the server will return a 304 NOT MODIFIED response. This signals that it is advisable to wait for the given configured polling interval before asking for the next page of data.
Managing Entry Identity with POST and PUT
In Atom, it is essential that each entry have an ID that is unique within its owning workspace and collection – these three components together make the entry’s URI a unique identifier for the entry, differentiable from the other entries in the service. AtomServer supports creation of new entries using two different HTTP methods – POST and PUT.
When an entry is created with a POST, the URI that is used is the URI to the collection in which the new entry is to be inserted. In this case, AtomServer is responsible for assigning an Entry Id to the new Entry, and providing that ID back to the POST caller in the response body.
When an entry is created with a PUT instead, the responsibility for assigning the Entry Id is placed on the client doing the PUT. The URI to which the PUT is made is the URI to the entry that is to be created.
Updating an existing entry is done by making a PUT to the Entry’s URI – in that sense creating an Entry with a PUT is like a “lazy” update. If there is no such Entry, it is created, otherwise the existing entry is updated.
Guaranteeing Data Integrity with Optimistic Concurrency
In order to guarantee consistent, predictable data in the highly distributed world of Atom, AtomServer uses Optimistic Concurrency to manage writes to the system. Optimistic concurrency states that a writer to the AtomServer must know the current revision number of the resource he is editing, and that he should write to the resource assuming that he will be able to complete the write operation, but be able to gracefully handle the case where someone has written the resource in the meantime.
For example, assume that systems A and B both want to make changes to the Acme Widgets in a given data feed. A comes along, and asks for the current representation of Acme Widget 123:
And is returned the following “edit link” within its feed response.
<link href=”/widgets/acme/123.xml/2” rel=”edit”/>
And now A sets about making its edits to the representation of 123.xml. While A is doing its work, B comes along and requests the current version of 123.xml, and gets the same response. B’s edits take less time than A’s, so B immediately writes its changes back to the edit link:
This succeeds, returning a 200 OK, letting B know that its edit has been committed successfully to the AtomServer. Now, A finishes its edit and tries to write to the same edit link, but the revision number has been updated due to B’s edit. Consequently, A will receive a 409 CONFLICT HTTP error, indicating that someone has changed the resource he is attempting to update since he last refreshed his view of the resource from the server. In this case, A should GET the resource again, this time getting a new edit link, and repeat the process. Note that this allows A to make his changes to a copy of /widgets/acme/123.xml that already contains B’s changes, so the system prevents A from overwriting B’s changes blindly.
In many systems, there will be only a single, authoritative writer for a given set of data. In those cases, to reduce the overhead of managing Optimistic Concurrency, it is possible to override optimistic concurrency by providing an asterisk (*) as the revision number:
However, it is important to use this feature only when a client knows for certain that it will be the only writer to a given resource.
In Atom, Categories are specified on entries as a pair of values – the Scheme and the Term. The Scheme is essentially a “namespace” of categories, and the Term is the specific value within that namespace.
Borrowing from GData’s extensions to Atom, AtomServer allows for a special feed syntax that lets clients filter a feed by the categories applied to entries. For example, to get a feed of only the Acme Widgets that had been tagged as having the color “red,” the request could be:
If multiple categories are specified, only entries that have ALL of the given categories (a Boolean AND) will be received. For example,
will return all of the big, red Acme Widgets. An arbitrary combination of categories using AND and OR, in prefix notation can also be specified with
This will return all of the Acme Widgets that are either red, or are both big and blue. These feeds should be treated by the client exactly the same as any other feed – they can be polled and paginated just like a feed without categories.
AtomServer is an off-the-shelf implementation of an Atom Store. It is implemented as a Java web application, and should deploy into any J2EE Servlet Container. Under the covers, AtomServer uses the Apache Project’s open-source implementation of the Atom Protocol, called Abdera, to process the RESTful verbs and XML vocabulary of Atom.
Abdera is an excellent library for adding an Atom front-end to an existing application. AtomServer, by contrast, is a full-fledged Atom Store. Out of the box, it provides all of the components needed to store and interact with the Atom metadata, as well as the contents of the Atom entries themselves.
AtomServer’s protocol borrows from GData’s design wherever appropriate. In some cases we’ve made slightly different decisions to improve URL readability, to simplify query structures, or to implement features not covered by GData’s specification.
AtomServer manages all of the Atom metadata associated with an entry in a relational database, and stores the actual entry content either in a relational database or on a file system, depending on your particular needs. AtomServer automatically handles all of the aspects of the Atom protocol (URI interpretation, parsing of the Atom elements and extensions, update timestamps, entry categorization), so you only need to publish changes to the server and poll at intervals for feed changes.
AtomServer is easy to use. It deploys either as a simple WAR file, or alternatively, as a standalone server, running within an embedded Jetty Server. Most applications should be able to use AtomServer by simply providing a very small amount of configuration – a few Spring Beans that configure that application's Atom workspaces and the content storage.
AtomServer has several important, advanced features that we have not yet covered. We’ve run out of space for those features this time, but in an upcoming article we will dive into:
An auto-tagger for Atom categories- An easily configured mechanism to "auto tag" entries when they are either created or updated. An XPathAutoTagger is built in, which allows you to XPath into your content and conditionally associate Atom categories with it.
Batch operations- Full support for operating on a mixed "batch" of create, update, or delete requests.
Aggregate feeds-The powerful ability to join together disparate entries – from different collections or workspaces – into an aggregate entry, using Atom categories. Including the ability to request feeds of these aggregates. So instead of forcing you to deal with several feeds – having to tie the information back together yourself – you can listen to a single aggregate feed, which will reflect the changes to any of its parts.
AtomServer is real. It is in live production use at our company, handling more than a million requests a day, with several million entries in our store. Building on a RESTful specification such as Atom while leveraging the design of existing services like GData have ensured a solid foundation on which to build. We hope that you will pick up a copy and tell us what you think. You can get AtomServer from http://www.atomserver.org, and be up and running in minutes.
Atom and Abdera are taking over!
Re: Atom and Abdera are taking over!
I cannot really speak to the strengths or weaknesses of WS02 Registry, but I know for certain that AtomServer is built for transactionally-correct, industrial strength applications. And because it was not built in a vacuum -- AtomServer was extracted from a real world project -- it is particularly client-centric.
There are three irritating technical flaws in this article.
- The article seems to suggest that URI paths like your-atom-server/workspace/collection/entryid are mandated by the Atom specs. This is not true.
- The article states "When an entry is created with a PUT instead, the responsibility for assigning the Entry Id is placed on the client doing the PUT." which may be true of AtomServer, but is not part of Atompub; if you do this, you're stepping outside.
- The notion of working revision numbers into your entry URIs is first of all specific to AtomServer, and more important, is entirely unnecessary in a good Atompub implementation; Atompub includes a rather fully-thought-through version of optimistic concurrency based HTTP ETAGs.
I guess the real gripe is that the article fails to distinguish between what is generally true of Atompub and the specific quirks of the AtomServer implementation.
- * The intent wasn't to imply that workspace/collection/entryid is a mandated part of the Atom spec. Nor is it a mandated part of AtomServer, for that matter -- we were simply going with a boiled down URI scheme so we could talk about some concrete examples easily.
- * Regarding the comment about the PUT, I think it's clear in the context of the article that we are talking about AtomServer, and not AtomPub in general - the section in question was describing the two different ways that AtomServer supports "creating" entries, either via a POST to a collection's URI, or a PUT directly to an entry's URI.
- * The revision numbers in the URI scheme is, as you say, specific to AtomServer. We are aware of the ETAGS-based optimistic concurrency in Atompub, but implementing it in AtomServer hasn't reached the top of the priority stack yet. If people start adopting AtomServer, and there's an outcry, then it will move up - otherwise, we'll eventually get to it anyways - we agree it's the right thing to do. In any case, the URI-based scheme for optimistic concurrency is simpler for many clients to implement, and we will probably always support both, even after we implement the ETAGS-based solution.
I appreciate the comments -- they'll help us to be clearer next time. I guess my response to your "real gripe" is that I think the title, the organization, and the content of the article DO distinguish between Atompub and the specific "quirks" of AtomServer -- the entire article is about AtomServer specifically, except where we specifically outline some basics of Atompub to make the article readable for people who are relatively uninitiated to the spec.
We've been developing AtomServer as a real-world solution to a problem for a while now, and after getting a lot of interest from other Atom enthusiasts we open-sourced the solution that we've built. We use Abdera because they've already written a really good implementation of the full Atompub protocol, and there was no reason to reinvent that wheel, but for other parts of the problem we have traded full flexibility for ease of configuration.
If someone is looking for a library to use in building Atompub into an existing service, they should definitely use Abdera directly. AtomServer, by contrast, is a full java web application that can be up and running in a few minutes by configuring a database and a few XML configuration files. It addresses all of the metadata and content management pieces that Abdera doesn't, and it's undergone a lot of battle-hardening to make it rock-solid and performant. Our goal moving forward is to make AtomServer easily interoperable with any spec-compliant Atom Client, while making the deployment of a server as easy as possible, with as little coding as possible.
Again - thanks for the comments. Hopefully it's helped clarify for other readers what the true scope of the article is!
leading the way..
I'm looking forward to some more how-to's on extending with your own collections. I'm sure it's simple, however, to be honest it's not really that easy given the existing docs. My "outcry" would be to ditch the heavy XML configurations. Familiarize yourselves with rails, grails, or Guice, and consider making it work more like that (Simple creating model file called "Users" creates a user table, CRUD, etc). Use annotations in java beans, but please no more large, tedious xml config files.
Re: leading the way..
Your comments are well taken. We have done some work already on creating a custom Spring dialect to make the XML configuration at least much smaller and more concise - we should be rolling that in before too long...
Your analogies to rails/grails/etc. are something we hadn't really considered, but we should have. I could definitely see us using a simpler format than XML altogether (maybe YAML?) for declaring workspaces/collections - which is obviously the most common configuration step. In most cases, you should be able to use AtomServer without actually writing any java code, so Java Annotations wouldn't solve the problem generally, but it would be interesting to see how they could make extending AtomServer easier when custom code IS required.
7 Ways to Optimize JenkinsCloudBees