Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Choosing Patterns over Abstractions: Streaming XML

Choosing Patterns over Abstractions: Streaming XML

Due to its structure, XML does not naturally stream well. Microsoft’s XML Team researched several different APIs in an attempt to abstract away the complexity. In the end, they choose to give up on abstract APIs and instead demonstrate some coding patterns to accomplish the same goal. To support this they included a sample using LINQ to XML to extract information from a 10 MB Wikipedia abstract file.

Compared to other formats such as comma-separated values, XML is very expensive to parse. And because it is generally loaded all at once, a large input file can consume an inordinate amount of time and memory. The solution to this is to stream the XML file, which keeps memory consumption down by never having the complete file in memory all at once.

The goals of the XML Team were

  1. Developer should be able to use essentially the same concepts, classes, and methods to process large or even infinite XML documents as are used to process small documents in memory.
  2. It should possible to stream output as well as input.
  3. It should be more declarative than imperative.
  4. It only needs to support simple XML formats that rely on regularly repeated elements.
  5. It should support loading some context prior to streaming over elements.

Unfortunately, the XML Team were not able to meet all these design goals in an out of the box solution.

One thing we noted in all the feedback on proposed streaming input designs is that experienced developers’ intuitions of how this feature would work was at odds with designs that we knew would actually work. The realities of streaming (e.g., you can only traverse the stream once, and you can’t backtrack/consume it out of order) lead to counter-intuitive behavior that even a designer of Anders' caliber was not able to figure out how to hide behind a simple API. Likewise, many people intuitively expected this functionality to be exposed on the XStreamingElement class, but doing so leads to essentially all the XElement functionality being duplicated in XStreamingElement (or else there are all sorts of kludgy clones and casts exposed to the user). 

We believe it will be easier to show people how to write specialized code for their use cases than to explain the various abstractions we need to invent to cover a wider range of plausible corner cases. As sample code, we give a very compelling illustration of the power of the underlying LINQ and System.Xml APIs and offer the bare bones of a toolkit that does indeed solve the majority of streaming scenarios that we know of, with code written by those who know the API better than anyone.

To this effect, they show how to write a custom axis method that works over streamed XML. The design pattern they demonstrate is based on the XMLReader class and C#’s yield keyword. VB users can do this as well, but since VB lacks the yield keyword, a bit more code will be needed.


Rate this Article