Google Releases New Version Of Protocol Buffers

Google released a new version of protocol buffers – a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. The changes in this release are outlined in the change notes.

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.

From to the documents released; The commonly available techniques for serializing objects for across processes/machine boundaries are

Native serialization, where objects are serialized using the native implementation of the language being used for e.g. Java, C++
Serializing using custom serialization format
Serialize the data to XML.

Each of these approaches have their own set of problems associated with it, for e.g. Native serialization means the platforms on the ends of the serialization pipe must be the same in order to be able to materialize serialized objects, XML is known to be verbose and an inefficient serialization format and custom serialization formats lead to increased cost of developing one-off parsers.

The goal of Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

Protocol buffers supports the following primitive datatypes that can be represented in "object" graphs

Base 128 Varint representations - int32, int64, uint32, uint64, sint32, sint64, bool, enum (Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.)
Fixed size 64 bit representations - fixed64, sfixed64, double
Fixed size representations - string, bytes, embedded messages, packed repeated fields
Fixed size 32 bit representations - fixed32, sfixed32, float

A unit of serialization is a message which could contain fields composed of the primitive datatypes or embedded messages. Protocol buffers supports optional, required and repeated fields. An example of an address book message definition using protocol buffers would look like this

package tutorial;

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

message AddressBook {
  repeated Person person = 1;
}

The features of the message definition language are described in the language guide. When compiled using a protocol buffer compiler, the encoders and parsers that are generated use a proprietary efficient serialization format. The current release includes compilers and APIs for C++, Java, and Python. However there are community projects to add new language implementations to Protocol Buffers, including Perl, C#, and Ruby.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter