Introducing Microsoft Avro
Microsoft has announced their implementation of the Apache Avro wire protocol. Avro is described a “compact binary data serialization format similar to Thrift or Protocol Buffers” with additional features needed for distributed processing environments such as Hadoop.
In order to make the protocol as fast as possible, the Microsoft Avro Library uses expression trees to build and compile a custom serializer at run time. After the initial hit to compile the serializer into IL code, this should provide significantly better performance than reflection-based algorithms.
Unlike Protocol Buffers, the Avro protocol is self-describing. When the connection is made between client and server, the schema is transmitted. Usually just once, so neither have to hard code the binary format nor do you need to pay the price to transmit the schema in each message.
Because of this, the Microsoft Avro Library can support three modes:
- Reflection mode. The IL code for the serializer is built based on the schema of .NET types to achieve maximum performance.
- Generic record mode. The JSON schema of the data can be specified at runtime so that it provides the ability for handling dynamic data with arbitrary schema.
- Container mode. The library can generate portable files with embedded schema. The file format is compatible with Avro container file specification and can be used across platforms.
When used in reflection mode, Avro uses the same DataContract/DataMemeber attributes that WCF developers are familiar with.
In generic record mode it is assumed that you don’t have a .NET class predefined to store the data. Instead you use the AvroRecord class in conjunction with a JSON document that describes the format of the data. AvroRecord objects need to be accessed in a late bound manner (C# dynamic, VB Option Strict Off).
Container mode can be used in conjunction with reflection or generic record mode. Since you are creating files in this mode instead of sending messages over the wire you can compress and/or encrypt the data using whatever means you prefer. Out of the box you get no compression or deflate, but instructions for building your code codec are included.
What about compatibility?
Am I reading the article right? - Only the last mode of Microsoft implementation (container mode) can be used across platforms (producing files according to container spec)? Are datastreams from the fast (reflection) and the JSON (generic) modes compatible with upstream Avro or is this implementation just a MS to MS solution?
Re: What about compatibility?
For "container" I'm assuming that everyone has to agree on whatever compression/encryption codec you are layering on top.
Re: What about compatibility?
- Reflection mode seems like .NET's preferable way of using data objects. Would be an analogue to beans which seems the popular way for DO in Java. (but the two are different otherwise)
- Generic mode is using a Map interface.
The Java implementation has also a JSON view implementation, but given reflection and generic mode either Microsoft or someone else can provide the JSON-ish way of using Avro provided the user case is there. Just to add - they say they have published the code under Apache 2 license on CodePlex which is nice.