Ruby XML Roundup: Hpricot 0.7, Stable Libxml-ruby and Nokogiri
Ruby's XML story has improved lately with a small arms race between XML libraries Nokogiri, Hpricot and libxml-ruby. Nokogiri was released last fall, and is based on the native libxml2 and libxslt:
Since Nokogiri leverages libxml2, consumers get (among other things) fast parsing, i13n support, fast searching, standards based XPath support, namespace support, and mature HTML correction algorithms.
Nokogiri also provides features such as searching with XPath and CSS selectors, and is supported on 1.9.1.
After some benchmarks showed Nokogiri to be in the lead when it comes to performance, Hpricot's maintainer _why put effort into improving the library and recently released version Hpricot 0.7:
Please enjoy a succulent, new Hpricot. A bit faster, some Ruby 1.9 support, and assorted fixes. [..]
I'm sure you're wondering what's the reason for Hpricot updates, in the face of heated competition from the Nokogiri and LibXML libraries. Remember that Hpricot has no dependencies and is smaller than either of those libs. Hpricot uses its own Ragel-based parser, so you have the freedom to hack the parser itself, the code is dwarven by comparison.
Best of all, Hpricot has run on JRuby in the past. And I am in the process of merging some IronRuby code and porting 0.7 to JRuby. This means your code will run on a variety of Ruby platforms without alteration. That alone makes it worthwhile, wouldn't you agree?
Finally, libxml-ruby was released as version 1.0 with:
* Ruby 1.9.1 support
* Out of the box support for OS X 10.5 and MacPorts [..]
* A nice, clean API that makes it easy to do simple things, but provides all the power of libxml2 if you need it
The latest version is 1.1.3, which was released with a crucial improvement:
Working through the options one-by-one, I finally found the culprit, an obscure field in the structure:int dictNames : Use dictionary names for the treeWhat this setting controls is whether libxml2 uses a dictionary to cache strings it has previously parsed. Caching strings makes a big difference, so by default it should be enabled. That is now the case with libxml-ruby 1.2.3 and higher.
With this change, libxml-ruby now runs at about equal performance as Nokogiri.
Martin Thompson Jul 27, 2014