Google Introduces Binary Encoding Format: Protocol Buffers
- an IDL to describe data formats
- a binary encoding scheme to encode formats described in the IDL
- data binding support using code generators, with Google providing C++, Python, Java implementations
The IDL allows to describe data formats, here is an example from the Protocol Buffers project page:
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
The numbers ("tags") assigned to the field names need to be specified explicitly to allow the formats to evolve. If they were automatically assigned, a change to the format - say inserting a new field - would cause trouble. Why? Because in the binary format, tags are used to describe what field (in the protocol description) a particular chunk of bytes is. Together with the rule that unknown tags are ignored, explicitly assigned tag numbers allow to add new fields as the format evolves, yet retain compatibility. To use the format descriptions, stored in .proto files, they're compiled into source code. Google's release comes with support for C++, Python and Java. Support for other languages is also becoming available, eg. Ruby, Erlang, Perl, Haskell, and others. Everyone interested in adding support for another language will appreciate the reverse engineered grammar of the .proto files as EBNF.
Language support means that .proto files are turned into code in the target language, consisting of classes that map to the formats defined in the .proto files. With this, it's possible to get an object from a binary, modify the object's fields and serialize the state back to the binary format.
As is usual with new Google projects, the release of Protocol Buffers caused quite a stir, with a lot of blog posts devoted to it. The release post on Google's blog explained the reason for Protocol Buffers, and mentioned that XML would be very inefficient as an encoding format. This caused a storm of blog posts - either arguing that Protocol Buffers would mean the end of XML or arguing that Protocol Buffers were inferior to XML. Ted Neward gives an explanation of the situation, with this conclusion:
In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.With all the comparisons to XML or JSON, it's easy to miss that Protocol Buffers are a reimplementation of existing technologies. Next to the already mentioned ones, a widely used competing technology is ASN.1, which seems to be somewhat obscure and little known despite being several decades old. This is peculiar if you look at a small sample of the formats that are described in ASN.1:
- X.509 certificates (used for PKI in many systems, including SSL)
- LDAP
- Cryptographic Message Syntax (CMS) for email cryptography
- PKCS#1 for RSA keys
- 3G phone networks
Another technology similar to Google's Protocol Buffers is Facebook's Thrift, which works in a similar way (see side by side comparison of Protocol Buffers and Thrift. A less successful technology is Binary XML which has been pondered in the XML scene for a very long time but hasn't really arrived yet. In response to questions about Protocol Buffers in, Erlang's creator Joe Armstrong mentioned UBF as binary format for programs that doesn't require parsing.
The common goal of these technologies is to improve efficiency. It's possible to argue that the amount of data, sent over a wire, doesn't matter because compression can help with data size. However: compression/decompression is an extra step that has to be performed after/before using the data - the actual parsing process still uses the larger amount of data. In the case of XML this means repeatedly reading the same element tags over and over - compare this to the numeric tags of, say Protocol Buffers. Of course - this improvement depends on the actual format. A format that consists of mostly strings will not benefit as much as a format made up of mostly numeric data.
Mark Pilgirm also put together a list of reactions to Protocol Buffer. Another aspect of Protocol Buffers mentioned in a comment by a Google employee on Steve Vinoski's blog, although it's supposedly in heavy used inside Google.
Have you been in a situation where you considered a binary format for efficiency reasons? If yes - did you roll your own or did you use an existing technology?
- Sections
- Enterprise Architecture
- Architecture & Design
- Development
- Topics
- OMG
- Enterprise Architecture
- Infrastructure
- CORBA
- Ruby
- SOA
- Object Oriented Design
- .NET
- XML
- Performance & Scalability
- Architecture
- XML Schema
- Dynamic Languages
- Web Services
- Distributed Programming
- Java
- Design
- Middleware
- Distributed Systems
- Markup Languages
Should have adopted SDO
by
mani doraisamy
Two more !!!
by
siva prasanna kumar P
The three most important characteristics which are must for any good data format are data structure, data types and data constraints.
According to me currently only XML has all the three. I am not aware of any other format which has all these characteristics and widely accepted.
All big shops have are sick with "Not invented here"
by
Slava Imeshev
hessian.caucho.com/
Or, maybe, this a key to innovation? Reinvent 100 wheels and 101st will be another big thing?
Regards,
Slava Imeshev
Cacheonix: Clustered Java Cache
Re: All big shops have are sick with
by
Slava Imeshev
www.omg.org/gettingstarted/omg_idl.htm
hessian.caucho.com
Regards,
Slava Imeshev
Cacheonix: Clustered Java Cache
Adobe's AMF
by
Jim Greer
Re: All big shops have are sick with
by
Nikita Ivanov
Thanks,
Nikita Ivanov.
GridGain - Grid Computing Made Simple
Performance doubts
by
Jimmy zhang
soa.sys-con.com/node/250512
Educational Content
Building Hypermedia APIs with HTML
Jon Moore Jun 19, 2013
Deleting Code at Nokia
Tom Coupland Jun 19, 2013
Intro to CLP with core.logic
Ryan Senior Jun 18, 2013
Spock: A Highly Logical Way To Test
Howard Lewis Ship Jun 18, 2013
Java Garbage Collection Distilled
Martin Thompson Jun 17, 2013




Hello stranger!
You need to Register an InfoQ account or Login to post comments. But there's so much more behind being registered.Get the most out of the InfoQ experience.
Tell us what you think