InfoQ Homepage Articles Resource-Oriented Architecture: Information, Not Containers

# Resource-Oriented Architecture: Information, Not Containers

This item in chinese

## Introduction

The Web is known primarily as a Web of Documents because that has been our main experience with it. Documents represent an organizing principle that we have collectively used for thousands of years to manage our information. This series is largely about building Webs of Data, but in doing so, we need not ignore the idea of documents as a data source. In the early days of HTML, we made a go at scraping content from pages, but it became such a painful and fragile process, most of us gave it up as a Sisyphean futility. It was particularly fruitless because there were no standards around how information was represented in pages. New technologies are emerging to make it easier to encode extractable content on the Web. In a future article, we will look at technologies that allow us to extract structure where there is none. For now we will focus on how producers can increase the machine-processability of the documents they produce.

## Microformats

While the rest of us threw in the towel on screen scraping, a determined community emerged that was not satisfied to ignore the prospect of encoding semantic content in documents. Microformats were created as small, focused, domain-specific naming conventions for weaving certain kinds of data in HTML. Calendars, reviews, people, organizations and outlines are all examples of the pragmatic and useful information that can be expressed behind the scenes. Advocates were clear that these efforts were for "humans first and machines second." They did not want to require any special tools, new languages or any of the perceived trappings (read: impedances) of the larger Semantic Web. This was a lower-case semantic web effort.

Here we see an example of the hReview microformat expressed around a movie review. The specifics are shown in bold. Note that other than following some naming conventions, this looks exactly like the kind of markup designers and Web developers were already used to using.

<div class="hreview">
<span class="reviewer vcard">
<span class="fn">O. Dinn</span>,
<abbr class="dtreviewed" title="20100412">April 12th, 2010</abbr>
</span>
<div class="item">
<a lang="en" class="url fn" href="http://www.imdb.com/title/tt0800369/">Thor</a>
</div>
<div>Rating: <span class="rating">5</span> out of 5</div>
<div class="description">
<p>Better than "Iron Man"!.</p>
</div>
</div>


Smaller goals are often more achievable, but they also usually yield smaller gains. Microformats have been modestly successful within specific areas. However, their constraints have left the larger community wanting more. There is no unifying data model behind microformats. Separate software libraries emerged to parse and extract each of the individual formats, but there was nothing to intrinsically connect the different formats together. There is no real capacity to make global references to entities and relationships; everything is focused on document-scope content. There are no namespaces to protect against accidental term clashes.

Most microformat developers simply are not interested in solving these larger problems and prefer immediate applicability to grander visions. The group remains active and a well-received new book was recently published on the subject. However, the success they have had has fueled the appetite for a broader solution that could benefit from data that had been encoded this way. The good news is, it is easier than ever to support the larger vision and we can bring in all of the richness the microformats community has contributed to the Web with us.

## GRDDL

One of the first problems to solve is how to consistently extract the semantic content from pages that were encoded with it. One way to manage this is with a technology called Gleaning Resource Descriptions from Dialects of Languages (GRDDL). While this approach has wider applicability to different document formats, as far as (X)HTML is concerned, it works as follows. In order to indicate to a client that a page had content that can be extracted, it would identify itself as having a GRDDL metadata profile:

<head profile="http://www.w3.org/2003/g/data-view">


Metadata profiles were introduced in HTML 4 to allow an author to provide clients with one or more ways of categorizing or viewing a document. What this means here is that when a client finds this profile, it can expect to find one more <link> elements in the <head> element with a rel type of "transformation". These links indicate the location of XSLT style sheets that can be applied to the document to extract specific metadata:

<link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokFOAF.xsl" />


What we see here are links to extract Dublin Core publication metadata, Creative Commons licensing information, and geolocation information. Clients can decide which transformations they wish to apply either by looking for known stylesheets or simply using them all. To see this in action, we will look at the canonical GRDDL page, Joe Lambda's Homepage. We have a human-readable version seen here:

But we also notice that there is metadata behind the scenes:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
href="http://www.w3.org/2000/06/webdata/xslt?xslfile=
http%3A%2F%2Fwww.w3.org%2F2003%2F11%2Frdf-in-xhtml-processor;xml
file=http%3A%2F%2Fwww.w3.org%2F2003%2F12%2Frdf-in-xhtml-xslts%2Fcomplete-example.html"
/>
<meta name="DC.Title" xml:lang="en" lang="en"  content="Joe Lambda Home page as example of RDF in XHTML" />  <meta name="DC.Creator" content="Joe Lambda" />  <meta name="DC.Description" xml:lang="en" lang="en"  content="home page for the infamous Joe Lambda, with the purpose of demonstrating how to mix   several RDF vocabularies in XHTML"
/>
<meta name="ICBM" content="39.2975, -94.71388888888889" />

<body>

<div class="foaf-person">
<h1>Joe Lambda's homepage</h1>
.
.
.


To see what gets extracted from this page, click here. That is a link to a service that takes a reference to this document and applies the XSLT stylesheets it finds. Note: Some browsers (I'm talking to you, Safari) have lame handling of this content. If it doesn't look like proper RDF/XML, try Firefox or another browser.

This approach requires the document producer to follow the naming conventions expected by the XSLT stylesheets. Once that is done, however, any client can discover the links and apply the transformations when the page is fetched. GRDDL works with both RDF-encoded data as well as microformat data which can be converted to RDF on the fly. We can use RDF's open, extensible data model to accumulate the facts from different encodings into the same output model. HTML 5 has dropped the profile attribute. A new mechanism will be devised to signal the presence of GRDDLable content. The rest of the mechanism will continue to work.

The problem remains, however, that every stylesheet has its own convention for expressing this information. For this and other reasons, the Semantic Web community sought a common approach for encoding rich metadata in our documents and RDFa was developed.

HTML 5 includes a more generic semantic markup approach than microformats called microdata. It does not have the same scope RDFa does and is still in development so we will not focus on it now.

## RDFa

RDFa is a standard way of encoding the Resource Description Framework (RDF) model into structured documents. While we will focus only on XHTML in this article, there are proposed specifications and guidance for HTML 5, SVG and even the Open Document Format (ODF).

RDFa was originally proposed by Mark Birbeck as a W3C note, but it has since become a formal W3C Recommendation and is gaining steam rapidly. It is not without its critics, but newer versions of the specification are attempting to address these concerns while maintaining backward compatibility.

RDFa specifies the following attribute types to encode RDF content:

Attribute Description
@rel Relationships between resources
@rev Reverse relationships
@href URI (resource object)
@src URI (embedded object)
@property Subject-Value relationships
@resource Non-clickable resource relationships
@datatype Datatype relationships
@typeof RDF type/instance relationships

Let's see how we can start to encode some information into a sample XHTML file. In this case, a page documenting a conference. It will have information about when and where the conference will occur as well as who will be there and why we should care.

The human-readable page (with its admittedly spartan sense of style) allows us to get the information we need visually, but if we would like to extract the information through software, we will need to encode it somehow. To start off with, we need to define the document type and any XML namespace prefixes that we expect to use throughout the document. Other namespaces that apply to only certain portions of the document can be localized to a specific section as we will see.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:cal="http://www.w3.org/2002/12/cal/icaltzd#">
<title>QCon Tokyo 2010</title>
.
.
.

We have introduced a series of standard prefix mappings (XHTML, RDF and RDFS) as well as for vocabularies that will be of use to describe our event in machine processable ways. In addition to the vocabularies we have already defined (e.g. Dublin Core), we have one for Friend-of-a-Friend and calendar information. We use the prefixes because URIs and URLs can be pretty unwieldy. We do not want to return long, redundant data to clients or force developers and designers to type them over and over again. The original RDFa specification chose to use a compact notation for the URIs called CURIEs.

The next thing to do is to indicate a subject for discussion. There can be many subjects described in a document so you should be careful where you introduce them. They are hierarchically-related and you may end up stepping on an existing topic if you are do not exercise caution. To announce a new subject, we will use the about attribute:

<body>
<a property="cal:summary" href="http://qcontokyo.com">QCon Tokyo 2010</a> to be held at
<span property="cal:location">Tokyo Midtown Conference Center</span>
from <span property="cal:dstart" datatype="xsd:date">2010-04-19</span> to
<span property="cal:dtend" datatype="xsd:date">2010-04-20</span>.
</p>
</div>
.
.
.


We have a mixture of regular XHTML and RDF metadata. The first thing we did was to identify the URL of the subject http://qcontokyo.com#qcontokyo2010 as a subject. We indicate the type of the subject by using the typeof="cal:Vevent" property. This is the equivalent RDF of saying:

<http://qcontokyo.com#qcontokyo2010> rdf:type cal:Vevent .

In other words, the URL (for this document) refers to something that is a Vevent. We have expressed one RDF fact. What is this business at the end though? We have used a fragment identifier to indicate the "concept of the 2010 instance of Qcon Tokyo". The same URL might be used next year and the facts expressed about QCon Tokyo 2011 will not be the same as for this one. Additionally, the web page is not an event. The web page describes an event. The document's URL would be an inappropriate subject choice. There are other ways of handling this distinction that we will address in a future article. For now, we are going to stick with fragment identifiers.

The next fact uses the property attribute to indicate a subject-value relationship for the subject we have already defined. If we had to duplicate data for both humans and machines, that would be a wasteful and silly approach, so we try to express values only once for both consumers. RDFa parsers will retrieve the value from the text node of the element to which the property attribute is attached. We now have a second fact:

<http://qcontokyo.com#qcontokyo2010> cal:summary "QCon Tokyo 2010" .


The fact expressed using the cal:location predicate works basically the same way. Now we want to do something interesting by issuing facts that involve datatypes. In this case, we want to indicate the start and end dates for the event. Not only do we use the cal:dstart and cal:dtend relationships, we also indicate that the text values for the objects should be interpreted as being of type "xsd:date". When parsed, an RDFa parser will produce results such as:

<http://qcontokyo.com#qcontokyo2010> <http://www.w3.org/2002/12/cal/icaltzd#dstart>
"2010-04-19"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://qcontokyo.com#qcontokyo2010> <http://www.w3.org/2002/12/cal/icaltzd#dtend>
"2010-04-20"^^<http://www.w3.org/2001/XMLSchema#date> .


We certainly would expect to express more information about the event itself, but we would do so mostly with variations on what we have already learned. If you wanted to add in other vocabularies, it would be easy to do so as long as you set up the appropriate prefix mappings. Let's introduce some new subjects and structures. In order to get people to come to a conference we need to get them excited about the speakers. If they have written compelling books, we should highlight those facts.

<div>
<p>You will hear from:
<ul>
<li resource="urn:ISBN:0596517742">
<span rel="dc:creator">          <a href="http://qcontokyo.com/speaker_DouglasCrockford.html#doug"
typeof="foaf:Person" property="foaf:name">Douglas Crockford</a>
</span>, author of <span property="dc:title">"JavaScript: The Good          Parts"
</span>
</li>
<li resource="urn:ISBN:0201633612">
<span rel="dc:creator">         <a href="http://qcontokyo.com/speaker_ErichGamma.html#erich"
typeof="foaf:Person" property="foaf:name">Erich Gamma</a>
</span>, author of <span property="dc:title">"Design Patterns :          Elements of Reusable Object-Oriented Software"
</span>
</li>
<li resource="urn:ISBN:0978739213">
<span rel="dc:creator">         <a href="http://qcontokyo.com/speaker_MichaelNygard.html#mike"
typeof="foaf:Person" property="foaf:name">Mike Nygard</a>
</span>, author of <span property="dc:title">"Release It!"
</span>
</li>
</ul>
</p>
</div>


Using the resource attribute, we introduce three new subjects. Because we chose to identify them with their URNs, we don't expect someone to click through them, but we have a valid URI to uniquely identify each book. Another approach would have been to link to an online book reseller or the publisher's page. Note that each <li> element has as its scope one thing. We have chosen to use the book as the subject, not the author so the subject applies to the hierarchical scope below where we define it. Any subsequent properties expressed below that will not apply to the event as above, but to the newly introduced subject instead. We indicate authorship with the dc:creator relationship.

We have also chosen to use the author's individual conference pages to identify them. Every speaker has one and they are unique per speaker. For the same reasons as above, however, a web page is not a speaker. So, we introduce fragment identifiers for the speakers as well.

The final point of discussion here is how we indicate Creative Commons license information for the document itself. We are no longer talking about the event or its prestigious authors, but the actual document in question. Here, we use the following block to indicate who maintains the copyright, what kind of license is associated with the page and how someone needs to indicate attribution if they use sections of the page:

    <div id="foot">
src="http://i.creativecommons.org/l/by-nc/3.0/us/88x31.png" />
</a>
<br/>
This document, <span xmlns:dc="http://purl.org/dc/elements/1.1/"
</div>
</body>
</html>


Note that we can introduce new namespace prefixes in an arbitrary block such as the footer here. If there is no Creative Commons content elsewhere in the document, it is not necessary to add the prefix to the entire document. One of the goals of RDFa is to allow portions of a document to be self-contained with respect to metadata. Future tool support will allow metadata associated with a section to be extracted as part of copy and paste operations.

## Extraction

Now that we have marked up all of this content, how do we go about extracting it? We will need to employ an RDFa parser. There are several available in different languages, but we will use Damian Steer's Java RDFa parser because it is so easy. First, save out this file locally. Download the parser and you should be able to execute it like this:

java -jar java-rdfa-0.4-SNAPSHOT.jar qcon.html

using whatever version you actually downloaded and pointing it at whatever you named the file. After doing so, you should see all of our RDF facts printed out (with each subject explicitly called out):

<http://qcontokyo.com/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2002/12/cal/icaltzd#Vevent> .
<http://qcontokyo.com> <http://www.w3.org/2002/12/cal/icaltzd#summary>
"QCon Tokyo 2010" .
<http://qcontokyo.com/> <http://www.w3.org/2002/12/cal/icaltzd#location>
"Tokyo Midtown Conference Center" .
<http://qcontokyo.com/> <http://www.w3.org/2002/12/cal/icaltzd#dstart>
"2010-04-19"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://qcontokyo.com/> <http://www.w3.org/2002/12/cal/icaltzd#dtend>
"2010-04-20"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://qcontokyo.com/speaker_DouglasCrockford.html#doug> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person> .
<urn:ISBN:0596517742> <http://purl.org/dc/terms/creator>
<http://qcontokyo.com/speaker_DouglasCrockford.html#doug> .
<http://qcontokyo.com/speaker_DouglasCrockford.html#doug> <http://xmlns.com/foaf/0.1/name>
"Douglas Crockford" .
<urn:ISBN:0596517742> <http://purl.org/dc/terms/title>
"\"JavaScript: The Good Parts\"" .
<http://qcontokyo.com/speaker_ErichGamma.html#erich> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person> .
<urn:ISBN:0201633612> <http://purl.org/dc/terms/creator>
<http://qcontokyo.com/speaker_ErichGamma.html#erich> .
<http://qcontokyo.com/speaker_ErichGamma.html#erich> <http://xmlns.com/foaf/0.1/name>
"Erich Gamma" .
<urn:ISBN:0201633612> <http://purl.org/dc/terms/title>
"\"Design Patterns : Elements of Reusable Object-Oriented Software\"" .
<http://qcontokyo.com/speaker_MichaelNygard.html#mike> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person> .
<urn:ISBN:0978739213> <http://purl.org/dc/terms/creator>
<http://qcontokyo.com/speaker_MichaelNygard.html#mike> .
<http://qcontokyo.com/speaker_MichaelNygard.html#mike> <http://xmlns.com/foaf/0.1/name>
"Mike Nygard" .
<urn:ISBN:0978739213> <http://purl.org/dc/terms/title>
"\"Release It!\"" .
<file:///Users/brian/Documents/Writing/InfoQ/ROA-3/qcon.html> <http://purl.org/dc/elements/1.1/title>
"QCon Tokyo 2010" .
<http://bosatsu.net> .
"Brian Sletten" .


We have just trivially pulled a lot of useful information from the page. It is the kind of thing a browser, a proxy or other HTTP-savvy agent could easily do if it wanted to. However, we do not need to wait for this support to be added before we can start to imagine using this information. We can wrap the RDFa extraction code as a service. The W3C does this with the RDFa Distiller, some Python code that you can download and run locally if you are going to use it a lot. For now, we can simply point the service at a Web-accessible version of our document:

http://www.w3.org/2007/08/pyRdfa/extract?uri=http%3A%2F%2Fbosatsu.net%2Fqcon.html&format=pretty-xml&warnings=false&parser=lax&space-preserve=true

This is a URL to a service that takes a URL as a data source and returns the extracted information as an RDF document. This generated RDF document is web addressable at the URL above.

To understand why this is important, we must recall what we have learned so far from this series. In the first article we learned about how REST's URL naming schemes and content negotiation allow us to unify access to our documents, data and services. In the second article, we learned about RDF terms/concepts and how to query RDF data sources. Tying it all together (including our new URL to a service extracting content from a document), we can imagine issuing a SPARQL query against a document as a database. The RDF model allows us to accumulate facts from a variety of data sources, now we can include our documents in our workflows! Imagine wanting to find the name of every author mentioned on the page and references to any books he or she may have written. When applied to our compound URL, the following SPARQL will do just that:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/terms/>

select ?name ?title where {
?book dc:creator ?author ;
dc:title ?title .
?author foaf:name ?name .
}

Viewed in Leigh Dodds' Twinkle SPARQL tool, we see the answer to the question:

It may take a while to fully comprehend what this new capability entails, but you are encouraged to think on it for a bit.

For years, critics have mocked the idea of people being willing to mark their content up this way, but as we will see in the next section, they were sorely mistaken. Beyond that, however, if you are producing pages programmatically, it should hopefully be obvious how straightforward it would be to emit XHTML with these properties woven in to encourage machine-processability.

Imagine if the scientific or financial data in a report were directly extractable. "Given the top-producing companies in our market segment this quarter expressed in this report, query our existing databases for each to see how they have done in the past." The more of a commitment your organization makes to REST, RDF and RDFa, the more powerful and easy these kinds of queries will be.

## On the Web

If this were just a fancy new tool in the Semantic Web toolbox, it would be powerful, but perhaps not as urgently relevant. If we look at how RDFa is being adopted and early analysis of the benefits from doing so, we will get a sense of why this is so important.

• First of all, mainstream open source software like Drupal and MediaWiki have both started supporting RDFa directly.
• The U.S. and U.K. governments are both publishing more information in this way. (The U.K. is winning!)
• O'Reilly has started marking up all of their catalog with RDFa and the Good Relations vocabulary.
• Large retailers such as Tesco and BestBuy are also using Good Relations and RDFa on pages for their products and their stores. BestBuy in particular has seen dramatic increases in traffic after engaging these technologies. It is premature to point solely to the RDFa metadata, but it appears to be at least a contributing factor and others are starting to see jumps as well.
• Google has started to index a handful of vocabularies around people, reviews and video content using both microformats and RDFa.
• Yahoo!'s Search Monkey effort is pushing the envelope on how to expand the support for other metadata types.
• The Creative Commons is indicating license information for songs and other content on the Web.

To understand how Creative Commons is innovating in this space, let us look back to our conference page and how we indicated the license in the footer. If you click directly on the license link in this current page, you will get a generic view of the license definition.

If, however, you click through the link on a Web-accessible version of the conference page, you will see something unique:

What happened?!?

The Creative Commons server gets a reference to the HTTP Referrer when you clicked through the link. It fetches the content from that referrer, parses it, and extracts whatever it finds related to licensing. When it can find related information, it generates specific responses based on what it finds. In this case, the attribution name and URL we provided.

Effectively, we provided metadata on the link and someone else's server responded accordingly. A link is no longer just a link. The response you get can depend on which "door" you came through. Query parameters have historically given us this capability in some respects, but in this pull-based view, we can respond to whatever is passed in, not just what we tell them they can pass in. We are accepting arbitrary, globally-resolvable metadata.

## The Future

RDFa 1.0 has been greeted mostly with enthusiasm, but there has been some criticism. Some people are concerned about the hierarchical scoping of namespaces and the use of CURIEs. Elements that resolve to a term in one namespace, might resolve (or fail to) if they are moved and another namespace takes precedence. Better tooling will help this, but RDFa 1.1 hopes to address some of the concerns as well by allowing fully-qualified term references in attributes.

The RDFa Working Group is breaking RDFa 1.1 into several pieces including Core, API and usage documents. They have an aggressive schedule over the next year, but are moving quickly. On April 15, 2010, the WG approved the RDFa Core 1.1 and XHTML+RDFa 1.1 modifications for publication. This includes a variety of ways to express prefix mappings, deprecation of the xmlns prefix, the idea of RDFa Profiles and more. There are new attributes to support these capabilities:

Attribute Description
@vocab For establishing a default vocabulary
@prefix For establishing prefix mappings
@profile For specifying RDFa profiles

There is a lot to be excited about from these technologies as both a producer and a consumer. Links indicate relevance and the more linkage you give your content, the more relevant it can be. We are starting to see benefits as far as search placement, but we can now start to aggregate information across sites as well. Rating information no longer has to be maintained on one site (which can lead to charges of undue influence) but can be culled from multiple locations.

The general advancement here, though, is that documents become just another data source. The Web allowed us to forget about computers and focus on documents. The Semantic Web is allowing us to focus on information and forget about information containers. The same SPARQL queries can include content contributed by relational databases, REST services, native RDF triplestores and now documents. Information that we encode to share with other humans no longer needs to be inaccessible to the rest of our software. We will need better tools and support to fully take advantage of these technologies, but it has become much more of a question of when it will happen, not if it will.

## Epilogue

Shortly after I submitted this to InfoQ for publication, the F8 Conference occurred where Facebook dropped the news that they are embracing Semantic Web-like metadata encoded in HTML "based on RDFa". Clearly, they are intending to own all of these relationships, but overall it will be a good proof of the concepts. Ben Adida had a thoughtful response of why this idea of distributed innovation is so cool.

Additionally, the W3C has released the RDFa 1.1 for XHTML Modularization and is seeking feedback. Ivan Herman has posted a useful discussion of what is new and different.

This all underscores the currency of the topics of this article and the sense of inevitability of my original final sentence: "[it] has become much more of a question of when it will happen, not if it will."

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.