Ruby XML Roundup: Hpricot 0.7, Stable Libxml-ruby and Nokogiri

| by Werner Schuster Follow 6 Followers on Mar 24, 2009. Estimated reading time: 1 minute |

Ruby's XML story has improved lately with a small arms race between XML libraries Nokogiri, Hpricot and libxml-ruby. Nokogiri was released last fall, and is based on the native libxml2 and libxslt:

Since Nokogiri leverages libxml2, consumers get (among other things) fast parsing, i13n support, fast searching, standards based XPath support, namespace support, and mature HTML correction algorithms.

Nokogiri also provides features such as searching with XPath and CSS selectors, and is supported on 1.9.1.

After some benchmarks showed Nokogiri to be in the lead when it comes to performance, Hpricot's maintainer _why put effort into improving the library and recently released version Hpricot 0.7:

Please enjoy a succulent, new Hpricot. A bit faster, some Ruby 1.9 support, and assorted fixes. [..]

I'm sure you're wondering what's the reason for Hpricot updates, in the face of heated competition from the Nokogiri and LibXML libraries. Remember that Hpricot has no dependencies and is smaller than either of those libs. Hpricot uses its own Ragel-based parser, so you have the freedom to hack the parser itself, the code is dwarven by comparison.

Best of all, Hpricot has run on JRuby in the past. And I am in the process of merging some IronRuby code[1] and porting 0.7 to JRuby. This means your code will run on a variety of Ruby platforms without alteration. That alone makes it worthwhile, wouldn't you agree?

Finally, libxml-ruby was released as version 1.0 with:

* Ruby 1.9.1 support
* Out of the box support for OS X 10.5 and MacPorts [..]
* A nice, clean API that makes it easy to do simple things, but provides all the power of libxml2 if you need it

The latest version is 1.1.3, which was released with a crucial improvement:

Working through the options one-by-one, I finally found the culprit, an obscure field in the structure:
int	dictNames	: Use dictionary names for the tree
What this setting controls is whether libxml2 uses a dictionary to cache strings it has previously parsed. Caching strings makes a big difference, so by default it should be enabled. That is now the case with libxml-ruby 1.2.3 and higher.

With this change, libxml-ruby now runs at about equal performance as Nokogiri.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you