Shape Security has open sourced a new schemaless binary serialization format called SuperPack.
SuperPack uses a binary serialization format meant to reduce the payload size. Compared with several other schemaless formats, SuperPack has the smallest payload for a given 4.48 KB sample message, according to Shape Security:
Original Message |
YAML |
BSON |
JSON |
Sereal |
SuperPack |
|
Uncompressed |
4.769 B |
134% |
111% |
69% |
40% |
28% |
Compressed |
4.769 B |
14% |
20% |
12% |
16% |
13% |
YAML and BSON are pretty verbose, adding to the message payload. JSON is much better than YAML but still considerably more than SuperPack due to its text encoding format. After gzip
compression, the values are quite different, YAML, JSON and SuperPack being close to one another at 12-14% of the original message.
One of the main advantages of using an encoding format such as SuperPack is the ability to communicate with clients without exchanging a message schema before hand. Data type information is included in the payload. SuperPack has 36 predefined data types, including the usual true, false, uint16, uint32, float32
, and also some not so usual, such as uint6, nint4, array5
, etc., the latter being meant to represent values that have a high probability of being encountered in messages.
SuperPack also includes types for array, strings, and maps. One of the types is extension
which enables users to add new types. SuperPack includes two optional optimizations that can be used to reduce the payload in certain cases: repeated string optimization and the repeated keyset optimization.
We asked Michael Ficarra, a Research Engineer and FOSS Coordinator, to provide us with more details about SuperPack.
InfoQ: What exactly did you do differently than other schemaless formats to get a smaller encoded size?
Michael Ficarra: The philosophy behind SuperPack is that, even if you cannot predict your data’s schema in advance, the data likely has structures or values that are repeated many times. As an example, say you have a data structure named "cats" which maps people to their cats. Instead of encoding, for every cat, the fact that it has a name and a birthday and favorite food, we can encode this information once and reference it later for very efficient protobuff-style packing of the values.
Also, some values are just more common than others and should have efficient representations. If you look into the details of the format, you'll see that all values start with a one-byte indicator of the value's type that we call the "type tag". We've reserved a number of regions of the type tag domain so that all or part of the value may be encoded within the tag itself. As a simple example, there are two boolean type tags: one for the value true and one for the value false. Similarly, there are 64 "uint6" type tags, allowing us to represent each of the numbers 0 to 63 in a single byte, and arrays (which must encode both their length and their entries) with fewer than 32 entries may have their length encoded in the tag. Relating back to the earlier example, cats usually have fewer than 64 whiskers, and most cat owners own fewer than 32 cats, so these values will be stored very efficiently.
InfoQ: Have you compared SuperPack against a schema-driven binary format such as Protocol Buffers? Are the Protobuf payloads significantly smaller?
Ficarra: We have not done this kind of comparison. I imagine the Protobuf payloads would be smaller in most cases except when SuperPack's string deduplication is particularly effective. Whenever your needs allow you to use a schema-driven format, especially when you can pair it with a lossless data compression algorithm such as LZW or Deflate, you should take advantage of that.
InfoQ: What are the values in terms of the time needed to encode/decode the messages?
Ficarra: The encoding time will vary based on whether the encoder decides (or is instructed) to use the optional keyset and string deduplication optimizations. The implementation language may also have some inherit performance difficulties such as JavaScript using IEEE 754 doubles for all numbers.
InfoQ: Are there any plans to add support for other languages?
Ficarra: Of course! We have a Java implementation that we use internally here at Shape Security. It's not quite ready to be open-sourced yet, but if we hear there's a demand, we'll expedite that process. And I'm more than willing to help out if the community wants to start an implementation for another ecosystem. I think a Rust implementation would be exciting!
I'd also like to add that SuperPack is still very young, and if readers have suggestions for how it can be improved, we would love to hear them. Simply open an issue on the specification's issue tracker. We hope to have future versions of SuperPack that are even better!
Currently, SuperPack comes with a JavaScript transcoder, but others can be created starting from this one. SuperPack has been open sourced with a very permissive license.