BT

Your opinion matters! Please fill in the InfoQ Survey!

Google Open Sources Gumbo, An HTML5 Parsing Library

| by Abel Avram Follow 4 Followers on Aug 14, 2013. Estimated reading time: 1 minute |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Google has open sourced Gumbo, an HTML parsing library written in C. Gumbo adheres to the HTML5 parsing algorithm, passing all html5lib-0.95 tests, and has been tested on 2.5 billion pages indexed by Google.

According to the project’s description page, the purpose of releasing Gumbo is to provide developers with a lightweight HTML parsing library that has no outside dependencies and can be called from the majority of languages. The library could be included in web page validators, static analyzers, templating languages, refactoring tools, etc.

Google considers Gumbo as “robust and resilient to bad input”, but it does not recommend maintaining pointers to some of its internal data structures because the ABI is likely to change in the future. But the API is considered to be pretty stable, the team waiting on comments from users before releasing it as 1.0, which is to happen in the near future.

Some of the features to be added in the future are:

  • Support for recent HTML5 spec changes to support the template tag.
  • Support for fragment parsing.
  • Full-featured error reporting.
  • Bindings in other languages.

Prior the standardization of the HTML5 parsing algorithm, each browser chose how to tokenize input pages and how to render them. And while HTML 4 had specifications on valid markups, there was no guidance on what a browser should do when the input was not valid, and 95% of world’s web pages did not pass the W3C reference validator. Validating HTML pages with a tool like Gumbo ensures pages will parsed and rendered properly in all major browsers.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT