BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Google Open Sources Gumbo, An HTML5 Parsing Library

by Abel Avram on Aug 14, 2013 |

Google has open sourced Gumbo, an HTML parsing library written in C. Gumbo adheres to the HTML5 parsing algorithm, passing all html5lib-0.95 tests, and has been tested on 2.5 billion pages indexed by Google.

According to the project’s description page, the purpose of releasing Gumbo is to provide developers with a lightweight HTML parsing library that has no outside dependencies and can be called from the majority of languages. The library could be included in web page validators, static analyzers, templating languages, refactoring tools, etc.

Google considers Gumbo as “robust and resilient to bad input”, but it does not recommend maintaining pointers to some of its internal data structures because the ABI is likely to change in the future. But the API is considered to be pretty stable, the team waiting on comments from users before releasing it as 1.0, which is to happen in the near future.

Some of the features to be added in the future are:

  • Support for recent HTML5 spec changes to support the template tag.
  • Support for fragment parsing.
  • Full-featured error reporting.
  • Bindings in other languages.

Prior the standardization of the HTML5 parsing algorithm, each browser chose how to tokenize input pages and how to render them. And while HTML 4 had specifications on valid markups, there was no guidance on what a browser should do when the input was not valid, and 95% of world’s web pages did not pass the W3C reference validator. Validating HTML pages with a tool like Gumbo ensures pages will parsed and rendered properly in all major browsers.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT