BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News SuperPack, a New Serialization Format with a Smaller Payload

SuperPack, a New Serialization Format with a Smaller Payload

Bookmarks

Shape Security has open sourced a new schemaless binary serialization format called SuperPack.

SuperPack uses a binary serialization format meant to reduce the payload size. Compared with several other schemaless formats, SuperPack has the smallest payload for a given 4.48 KB sample message, according to Shape Security:

 

Original Message

YAML

BSON

JSON

Sereal

SuperPack

Uncompressed

4.769 B 

134%

111%

69%

40%

28%

Compressed

4.769 B 

14%

20%

12%

16%

13%

YAML and BSON are pretty verbose, adding to the message payload. JSON is much better than YAML but still considerably more than SuperPack due to its text encoding format. After gzip compression, the values are quite different, YAML, JSON and SuperPack being close to one another at 12-14% of the original message.

One of the main advantages of using an encoding format such as SuperPack is the ability to communicate with clients without exchanging a message schema before hand. Data type information is included in the payload. SuperPack has 36 predefined data types, including the usual true, false, uint16, uint32, float32, and also some not so usual, such as uint6, nint4, array5, etc., the latter being meant to represent values that have a high probability of being encountered in messages.

SuperPack also includes types for array, strings, and maps. One of the types is extension which enables users to add new types. SuperPack includes two optional optimizations that can be used to reduce the payload in certain cases: repeated string optimization and the repeated keyset optimization.

We asked Michael Ficarra, a Research Engineer and FOSS Coordinator, to provide us with more details about SuperPack.

InfoQ: What exactly did you do differently than other schemaless formats to get a smaller encoded size?

Michael Ficarra: The philosophy behind SuperPack is that, even if you cannot predict your data’s schema in advance, the data likely has structures or values that are repeated many times. As an example, say you have a data structure named "cats" which maps people to their cats. Instead of encoding, for every cat, the fact that it has a name and a birthday and favorite food, we can encode this information once and reference it later for very efficient protobuff-style packing of the values.

Also, some values are just more common than others and should have efficient representations. If you look into the details of the format, you'll see that all values start with a one-byte indicator of the value's type that we call the "type tag". We've reserved a number of regions of the type tag domain so that all or part of the value may be encoded within the tag itself. As a simple example, there are two boolean type tags: one for the value true and one for the value false. Similarly, there are 64 "uint6" type tags, allowing us to represent each of the numbers 0 to 63 in a single byte, and arrays (which must encode both their length and their entries) with fewer than 32 entries may have their length encoded in the tag. Relating back to the earlier example, cats usually have fewer than 64 whiskers, and most cat owners own fewer than 32 cats, so these values will be stored very efficiently.

InfoQ: Have you compared SuperPack against a schema-driven binary format such as Protocol Buffers? Are the Protobuf payloads significantly smaller?

Ficarra: We have not done this kind of comparison. I imagine the Protobuf payloads would be smaller in most cases except when SuperPack's string deduplication is particularly effective. Whenever your needs allow you to use a schema-driven format, especially when you can pair it with a lossless data compression algorithm such as LZW or Deflate, you should take advantage of that.

InfoQ: What are the values in terms of the time needed to encode/decode the messages?

Ficarra: The encoding time will vary based on whether the encoder decides (or is instructed) to use the optional keyset and string deduplication optimizations. The implementation language may also have some inherit performance difficulties such as JavaScript using IEEE 754 doubles for all numbers.

InfoQ: Are there any plans to add support for other languages?

FicarraOf course! We have a Java implementation that we use internally here at Shape Security. It's not quite ready to be open-sourced yet, but if we hear there's a demand, we'll expedite that process. And I'm more than willing to help out if the community wants to start an implementation for another ecosystem. I think a Rust implementation would be exciting!

I'd also like to add that SuperPack is still very young, and if readers have suggestions for how it can be improved, we would love to hear them. Simply open an issue on the specification's issue tracker. We hope to have future versions of SuperPack that are even better!

Currently, SuperPack comes with a JavaScript transcoder, but others can be created starting from this one. SuperPack has been open sourced with a very permissive license.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Light on details

    by Cameron Purdy,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Comparing to BSON isn't fair at all, as BSON is one of the stupidest designs in the history of computer science. It's like comparing race times against a person in a coma.

    What is this new format for? Is it for JSON data? You should be more specific about what the problem target is, i.e. what this is a solution for. There are a zillion binary serialization formats, but each (BSON excepted) has its sweet spot.

    Peace,

    Cameron.

  • Re: Light on details

    by Pete Haidinyak,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    :-)

  • Re: Light on details

    by Abel Avram,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Sorry for being late on answering this. I was away.

    I did not notice any specific application for SuperPack. It can be used anywhere one wants to use a schemaless binary format, services, microservices, client-server, etc..

    And, yes, there are many other formatting options, but I wrote about it due to the smaller size of the payload, which is important.

  • Re: Light on details

    by Michael Ficarra,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Cameron, if you would like to know more about the specific goals of the SuperPack format to compare them to the goals of the other comparable formats, you can see the goals listed in the introduction of the SuperPack specification. I have copied them below for you:

    SuperPack is designed to


    • achieve a small encoded payload size
    • have a reasonably small transcoder
    • encode data of an (a priori) unknown schema
    • transcode arbitrary data types without consulting the SuperPack authors through extension
    • operate in an environment without access to a lossless data compression algorithm

    • Comparison to MessagePack

      by Oisín Mac Fhearaí,

      Your message is awaiting moderation. Thank you for participating in the discussion.

      This scheme seems close to MessagePack, which has similar goals, so I was surprised to see it wasn't mentioned here. However I see on the SuperPack page that a much broader comparison was done, and the it actually beats MessagePack on the test file by a large margin on the *uncompressed* output. Strangely, it loses all of that advantage when the output is gzipped, and it seems that gzipped JSON easily beats both MessagePack and SuperPack.

    Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

    Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

    BT