BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Google Introduces Binary Encoding Format: Protocol Buffers

by Werner Schuster on Jul 21, 2008 |
Google recently open sourced Protocol Buffers - a Data Interchange Format. Behind the somewhat nondescript name hide:
  • an IDL to describe data formats
  • a binary encoding scheme to encode formats described in the IDL
  • data binding support using code generators, with Google providing C++, Python, Java implementations

The IDL allows to describe data formats, here is an example from the Protocol Buffers project page:
message Person { 
required int32 id = 1;
required string name = 2;
 optional string email = 3;
}
The numbers ("tags") assigned to the field names need to be specified explicitly to allow the formats to evolve. If they were automatically assigned, a change to the format - say inserting a new field - would cause trouble. Why? Because in the binary format, tags are used to describe what field (in the protocol description) a particular chunk of bytes is. Together with the rule that unknown tags are ignored, explicitly assigned tag numbers allow to add new fields as the format evolves, yet retain compatibility.

To use the format descriptions, stored in .proto files, they're compiled into source code. Google's release comes with support for C++, Python and Java. Support for other languages is also becoming available, eg. Ruby, Erlang, Perl, Haskell, and others. Everyone interested in adding support for another language will appreciate the reverse engineered grammar of the .proto files as EBNF.

Language support means that .proto files are turned into code in the target language, consisting of classes that map to the formats defined in the .proto files. With this, it's possible to get an object from a binary, modify the object's fields and serialize the state back to the binary format.

As is usual with new Google projects, the release of Protocol Buffers caused quite a stir, with a lot of blog posts devoted to it. The release post on Google's blog explained the reason for Protocol Buffers, and mentioned that XML would be very inefficient as an encoding format. This caused a storm of blog posts - either arguing that Protocol Buffers would mean the end of XML or arguing that Protocol Buffers were inferior to XML. Ted Neward gives an explanation of the situation, with this conclusion:
In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.
With all the comparisons to XML or JSON, it's easy to miss that Protocol Buffers are a reimplementation of existing technologies. Next to the already mentioned ones, a widely used competing technology is ASN.1, which seems to be somewhat obscure and little known despite being several decades old. This is peculiar if you look at a small sample of the formats that are described in ASN.1:
  • X.509 certificates (used for PKI in many systems, including SSL)
  • LDAP
  • Cryptographic Message Syntax (CMS) for email cryptography
  • PKCS#1 for RSA keys
  • 3G phone networks
 ASN.1 has many uses ; for example, data encoded using ASN.1 is used by everyone using telecommunication nowadays.  ASN.1 is based on similar concepts as Protocol Buffers - it uses an IDL to describe formats and uses a compiler to generate necessary code for a target language. A key difference, however, are the multiple encodings for ASN.1, which allow to choose from a list of encoding methods for different purposes. The list of encodings includes e.g Canonical Encoding Rules (CER) which enforce strict rules for the encoding - crucial for anything concerning digital signatures which react badly to subtle differences, Packed Encoding Rules (PER) and more. The XML Encoding Rules (XER) allow to have the data encoded as XML - which basically makes ASN.1 an alternative to XML Schema. Fast Web Services is a technology which allows to map XML Schemas to ASN.1 and then use ASN.1's more efficient encodings between endpoints that support them.

Another technology similar to Google's Protocol Buffers is Facebook's Thrift, which works in a similar way (see side by side comparison of Protocol Buffers and Thrift. A less successful technology is Binary XML which has been pondered in the XML scene for a very long time but hasn't really arrived yet. In response to questions about Protocol Buffers in, Erlang's creator Joe Armstrong mentioned UBF as binary format for programs that doesn't require parsing.

The common goal of these technologies is to improve efficiency. It's possible to argue that the amount of data, sent over a wire, doesn't matter because compression can help with data size. However: compression/decompression is an extra step that has to be performed after/before using the data - the actual parsing process still uses the larger amount of data. In the case of XML this means repeatedly reading the same element tags over and over - compare this to the numeric tags of, say Protocol Buffers. Of course - this improvement depends on the actual format. A format that consists of mostly strings will not benefit as much as a format made up of mostly numeric data.

Mark Pilgirm also put together a list of reactions to Protocol Buffer. Another aspect of Protocol Buffers mentioned in a comment by a Google employee on Steve Vinoski's blog, although it's supposedly in heavy used inside Google.

Have you been in a situation where you considered a binary format for efficiency reasons? If yes - did you roll your own or did you use an existing technology?

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Should have adopted SDO by mani doraisamy

With ChangeSummary and XML support, SDO should have been a better choice.

Two more !!! by siva prasanna kumar P

Already there seems to be a huge debate going on about JSON vs XML, two more (thrift and pb) have popped up.


The three most important characteristics which are must for any good data format are data structure, data types and data constraints.


According to me currently only XML has all the three. I am not aware of any other format which has all these characteristics and widely accepted.

All big shops have are sick with "Not invented here" by Slava Imeshev

www.omg.org/gettingstarted/omg_idl.htm

hessian.caucho.com/

Or, maybe, this a key to innovation? Reinvent 100 wheels and 101st will be another big thing?

Regards,

Slava Imeshev
Cacheonix: Clustered Java Cache

Re: All big shops have are sick with by Slava Imeshev

You can also under-invent the wheel by providing a message editor that does not parse link breaks and links :)




www.omg.org/gettingstarted/omg_idl.htm




hessian.caucho.com






Regards,




Slava Imeshev

Cacheonix: Clustered Java Cache

Adobe's AMF by Jim Greer

It would be interested (to me, anyway) to also see in this comparison Adobe's AMF, another binary message format that has also been open-sourced.

Re: All big shops have are sick with by Nikita Ivanov

Agree here w/Slava. WTF is wrong with Caucho if one needs this data portability? Or SDO? I don't understand... Can someone from Google team provide a sensible reasoning to use theirs vs. others?


Thanks,

Nikita Ivanov.

GridGain - Grid Computing Made Simple

Performance doubts by Jimmy zhang

It is actually not a foregone conclusion that protocol buffer will outperform XML, see this article for further analysis
soa.sys-con.com/node/250512

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

7 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT