BT

Choosing Patterns over Abstractions: Streaming XML

| by Jonathan Allen Follow 576 Followers on Apr 17, 2007. Estimated reading time: 2 minutes |

Due to its structure, XML does not naturally stream well. Microsoft’s XML Team researched several different APIs in an attempt to abstract away the complexity. In the end, they choose to give up on abstract APIs and instead demonstrate some coding patterns to accomplish the same goal. To support this they included a sample using LINQ to XML to extract information from a 10 MB Wikipedia abstract file.

Compared to other formats such as comma-separated values, XML is very expensive to parse. And because it is generally loaded all at once, a large input file can consume an inordinate amount of time and memory. The solution to this is to stream the XML file, which keeps memory consumption down by never having the complete file in memory all at once.

The goals of the XML Team were

  1. Developer should be able to use essentially the same concepts, classes, and methods to process large or even infinite XML documents as are used to process small documents in memory.
  2. It should possible to stream output as well as input.
  3. It should be more declarative than imperative.
  4. It only needs to support simple XML formats that rely on regularly repeated elements.
  5. It should support loading some context prior to streaming over elements.

Unfortunately, the XML Team were not able to meet all these design goals in an out of the box solution.

One thing we noted in all the feedback on proposed streaming input designs is that experienced developers’ intuitions of how this feature would work was at odds with designs that we knew would actually work. The realities of streaming (e.g., you can only traverse the stream once, and you can’t backtrack/consume it out of order) lead to counter-intuitive behavior that even a designer of Anders' caliber was not able to figure out how to hide behind a simple API. Likewise, many people intuitively expected this functionality to be exposed on the XStreamingElement class, but doing so leads to essentially all the XElement functionality being duplicated in XStreamingElement (or else there are all sorts of kludgy clones and casts exposed to the user). 

We believe it will be easier to show people how to write specialized code for their use cases than to explain the various abstractions we need to invent to cover a wider range of plausible corner cases. As sample code, we give a very compelling illustration of the power of the underlying LINQ and System.Xml APIs and offer the bare bones of a toolkit that does indeed solve the majority of streaming scenarios that we know of, with code written by those who know the API better than anyone.

To this effect, they show how to write a custom axis method that works over streamed XML. The design pattern they demonstrate is based on the XMLReader class and C#’s yield keyword. VB users can do this as well, but since VB lacks the yield keyword, a bit more code will be needed.

 

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT