InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Google Introduces Binary Encoding Format: Protocol Buffers

Posted by Werner Schuster on Jul 21, 2008

Sections
Architecture & Design,
Development,
Enterprise Architecture
Topics
Architecture ,
Ruby ,
Java ,
SOA ,
Web Services ,
Performance & Scalability ,
.NET
Tags
Google ,
CORBA ,
Distributed Programming ,
XML Schema
Google recently open sourced Protocol Buffers - a Data Interchange Format. Behind the somewhat nondescript name hide:
  • an IDL to describe data formats
  • a binary encoding scheme to encode formats described in the IDL
  • data binding support using code generators, with Google providing C++, Python, Java implementations

The IDL allows to describe data formats, here is an example from the Protocol Buffers project page:
message Person { 
required int32 id = 1;
required string name = 2;
 optional string email = 3;
}
The numbers ("tags") assigned to the field names need to be specified explicitly to allow the formats to evolve. If they were automatically assigned, a change to the format - say inserting a new field - would cause trouble. Why? Because in the binary format, tags are used to describe what field (in the protocol description) a particular chunk of bytes is. Together with the rule that unknown tags are ignored, explicitly assigned tag numbers allow to add new fields as the format evolves, yet retain compatibility.

To use the format descriptions, stored in .proto files, they're compiled into source code. Google's release comes with support for C++, Python and Java. Support for other languages is also becoming available, eg. Ruby, Erlang, Perl, Haskell, and others. Everyone interested in adding support for another language will appreciate the reverse engineered grammar of the .proto files as EBNF.

Language support means that .proto files are turned into code in the target language, consisting of classes that map to the formats defined in the .proto files. With this, it's possible to get an object from a binary, modify the object's fields and serialize the state back to the binary format.

As is usual with new Google projects, the release of Protocol Buffers caused quite a stir, with a lot of blog posts devoted to it. The release post on Google's blog explained the reason for Protocol Buffers, and mentioned that XML would be very inefficient as an encoding format. This caused a storm of blog posts - either arguing that Protocol Buffers would mean the end of XML or arguing that Protocol Buffers were inferior to XML. Ted Neward gives an explanation of the situation, with this conclusion:
In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.
With all the comparisons to XML or JSON, it's easy to miss that Protocol Buffers are a reimplementation of existing technologies. Next to the already mentioned ones, a widely used competing technology is ASN.1, which seems to be somewhat obscure and little known despite being several decades old. This is peculiar if you look at a small sample of the formats that are described in ASN.1:
  • X.509 certificates (used for PKI in many systems, including SSL)
  • LDAP
  • Cryptographic Message Syntax (CMS) for email cryptography
  • PKCS#1 for RSA keys
  • 3G phone networks
 ASN.1 has many uses ; for example, data encoded using ASN.1 is used by everyone using telecommunication nowadays.  ASN.1 is based on similar concepts as Protocol Buffers - it uses an IDL to describe formats and uses a compiler to generate necessary code for a target language. A key difference, however, are the multiple encodings for ASN.1, which allow to choose from a list of encoding methods for different purposes. The list of encodings includes e.g Canonical Encoding Rules (CER) which enforce strict rules for the encoding - crucial for anything concerning digital signatures which react badly to subtle differences, Packed Encoding Rules (PER) and more. The XML Encoding Rules (XER) allow to have the data encoded as XML - which basically makes ASN.1 an alternative to XML Schema. Fast Web Services is a technology which allows to map XML Schemas to ASN.1 and then use ASN.1's more efficient encodings between endpoints that support them.

Another technology similar to Google's Protocol Buffers is Facebook's Thrift, which works in a similar way (see side by side comparison of Protocol Buffers and Thrift. A less successful technology is Binary XML which has been pondered in the XML scene for a very long time but hasn't really arrived yet. In response to questions about Protocol Buffers in, Erlang's creator Joe Armstrong mentioned UBF as binary format for programs that doesn't require parsing.

The common goal of these technologies is to improve efficiency. It's possible to argue that the amount of data, sent over a wire, doesn't matter because compression can help with data size. However: compression/decompression is an extra step that has to be performed after/before using the data - the actual parsing process still uses the larger amount of data. In the case of XML this means repeatedly reading the same element tags over and over - compare this to the numeric tags of, say Protocol Buffers. Of course - this improvement depends on the actual format. A format that consists of mostly strings will not benefit as much as a format made up of mostly numeric data.

Mark Pilgirm also put together a list of reactions to Protocol Buffer. Another aspect of Protocol Buffers mentioned in a comment by a Google employee on Steve Vinoski's blog, although it's supposedly in heavy used inside Google.

Have you been in a situation where you considered a binary format for efficiency reasons? If yes - did you roll your own or did you use an existing technology?
  • This article is part of a featured topic series on SOA
Should have adopted SDO by mani doraisamy Posted
Two more !!! by siva prasanna kumar P Posted
All big shops have are sick with "Not invented here" by Slava Imeshev Posted
Re: All big shops have are sick with by Slava Imeshev Posted
Re: All big shops have are sick with by Nikita Ivanov Posted
Adobe's AMF by Jim Greer Posted
Performance doubts by Jimmy zhang Posted
  1. Back to top

    Should have adopted SDO

    by mani doraisamy

    With ChangeSummary and XML support, SDO should have been a better choice.

  2. Back to top

    Two more !!!

    by siva prasanna kumar P

    Already there seems to be a huge debate going on about JSON vs XML, two more (thrift and pb) have popped up.


    The three most important characteristics which are must for any good data format are data structure, data types and data constraints.


    According to me currently only XML has all the three. I am not aware of any other format which has all these characteristics and widely accepted.

  3. Back to top

    All big shops have are sick with "Not invented here"

    by Slava Imeshev

    www.omg.org/gettingstarted/omg_idl.htm

    hessian.caucho.com/

    Or, maybe, this a key to innovation? Reinvent 100 wheels and 101st will be another big thing?

    Regards,

    Slava Imeshev
    Cacheonix: Clustered Java Cache

  4. Back to top

    Re: All big shops have are sick with

    by Slava Imeshev

    You can also under-invent the wheel by providing a message editor that does not parse link breaks and links :)




    www.omg.org/gettingstarted/omg_idl.htm




    hessian.caucho.com






    Regards,




    Slava Imeshev

    Cacheonix: Clustered Java Cache

  5. Back to top

    Adobe's AMF

    by Jim Greer

    It would be interested (to me, anyway) to also see in this comparison Adobe's AMF, another binary message format that has also been open-sourced.

  6. Back to top

    Re: All big shops have are sick with

    by Nikita Ivanov

    Agree here w/Slava. WTF is wrong with Caucho if one needs this data portability? Or SDO? I don't understand... Can someone from Google team provide a sensible reasoning to use theirs vs. others?


    Thanks,

    Nikita Ivanov.

    GridGain - Grid Computing Made Simple

  7. Back to top

    Performance doubts

    by Jimmy zhang

    It is actually not a foregone conclusion that protocol buffer will outperform XML, see this article for further analysis
    soa.sys-con.com/node/250512

Educational Content

Jesper Boeg on Priming Kanban

In this interview, Jesper Boeg, author of the new InfoQ book – Priming Kanban, discusses the keys to using Kanban effectively, and how to get started if you are currently using other approaches.

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.