Skip to content

Latest commit

 

History

History
166 lines (123 loc) · 13.9 KB

format-specification.md

File metadata and controls

166 lines (123 loc) · 13.9 KB

Serde-Brief Binary Format

The format is close to JSON, modified to be better, binary and fit to serde's data model.

Stability

The format is not considered stable as of yet.

Self-Describing Format

The format includes information on the structure of the data.

Advantages over non-self-describing formats:

  • There is no need for a schema to parse any given data.
  • Easy to provide backwards/forwards compatibility of data formats, as it is possible to add new fields.
  • Type compatibility can be checked.

Disadvantages over non-self-describing formats:

  • Larger binary representation.
  • Additional parsing / overhead.

Defined Types

Every value in Serde-Brief is prepended with a byte detailing its type. The Serde-Brief format currently contains these types:

Type Description Byte value
Null No value. 0
BooleanFalse Boolean with value false. 1
BooleanTrue Boolean with value true. 2
UnsignedInt Unsigned integer. The following bytes are the value in "VarInt" encoding (see below). 3
SignedInt Signed integer. The following bytes are the value in "VarInt" encoding (see below). 4
Float16 Float with 16-bit precision (not yet used/supported). 5
Float32 Float with 32-bit precision. The next 4 bytes are the value (little-endian). 6
Float64 Float with 64-bit precision. The next 8 bytes are the value (little-endian). 7
Float128 Float with 128-bit precision (not yet used/supported). 8
Bytes Raw bytes. The following bytes are the length of the byte sequence (must fit into usize). After that come the raw bytes of the given length. 10
String UTF-8 string. The following bytes are the length of the byte sequence (must fit into usize). After that come the string's raw bytes of the given length. 11
SeqStart A sequence of any number of values of any type. There is no specified length. The following bytes are the sequence's values. The end of the sequence is recognized by the SeqEnd type. 15
SeqEnd The end of a sequence. 16
MapStart A map of any number of key-value pairs of any types. There is no specified length. The following bytes are the map's keys and values. The end of the sequence is recognized by the SeqEnd type. 17
MapEnd The end of a map. 18

Examples

  • [0]: null value
  • [1]: false
  • [2]: true
  • [3, 0]: 0
  • [4, 1]: -1
  • [10, 0]: byte sequence of length 0
  • [10, 1, 5]: byte sequence of length 1 containing a byte with value 5
  • [15, 16]: empty sequence
  • [15, 0, 1, 16]: sequence with 2 values: null and false
  • [17, 18]: empty map
  • [17, 3, 0, 2, 18]: map with 1 key-value pair: 0 -> true

VarInt Encoding

All integers are encoded in this format. It allows to use the same format for all integer numbers, regardless of size. It also saves space for small integers. The format is identical to postcard's VarInt encoding. Also see Wikipedia's article on VLQ.

For every byte, the most significant bit determines whether this is the last byte of the number. For example, 0x83/0b1000_0011 will result in another byte being read for the current number. 0x73/0b0111_0011 will be considered the last byte. Every byte's lower 7 bits are used to store the actual value.

Unsigned integers are encoded least-significant-bits first. For example, 0x017F/0b0000_0001_0111_1111 will be encoded like this: 0xFF/0b1111_1111, 0x02/0b0000_0010. Further explanation: The least significant 7 bits are 111_1111. Since we need another byte to store the number's rest of the bits, the 8th bit will be 1, too. Therefore, out first bit is 0xFF. The next 7 bytes of our number are 000_0010. We don't need any more bytes after this one, as the value needs less than 14 bits, therefore the 8th bit is 0. The encoded byte is 0x02.

Signed integers would blow up this encoding, since -1 is 0xFFFF_FFFF_FFFF_FFFF in two's-complement in a u64. Therefore, signed integers are ZigZag-encoded first. The sign ends up in the lowest bit in the first byte. -1 would be encoded as 0b0000_0001. 1 is encoded as 0b0000_0010.

Maximum Length

There is no length limit on the number's encoding in the format itself. In practice however, serde supports up to 128 bits and the deserialization will fail on any numbers larger than the expected type. So reading a u8 will fail when there is more than 2 bytes or more than 8 value-bits. A 128 bit value will never exceed 19 bytes. Other parsers would, in theory, be allowed to encode arbitrarily large numbers in any amount of bytes.

Canonicalization

The encoding allows values to pad numbers with any number of 0s, e.g. a chain of 0x80 bytes. The number 0 could be represented as 0x80, 0x80, 0x80, 0x00. Four bytes, despite being value 0. The serializer will always output numbers with the lowest number of bytes. However, the deserializer will accept representations with additional padding up to the maximum number of bits of the expected type.

Architecture-Specific Sizes

The isize and usize types are as wide as pointers on the specific system. This means, the maximum/minimum number can differ across systems. The VarInt encoding works the same way, so different systems can communicate without any issues, as long as the value fits into the smallest of the system's architecture. Parsing will fail on the smaller architecture otherwise.

Sequences and Maps

Sequences/arrays and maps do not specify their length, so any number of values can follow. Their end is denoted by a value of a special end type.

Values can have any type, so even maps can consist of arbitrarily complex keys and values. The key itself could be a structure 2 layers deep. The type of every value can differ.

Mapping of Rust Types to Encoded Data

The encoding/serialization and decoding/deserialization happens via serde, so it follows the serde data model. Please familiarize yourself with its concept to fully understand the following. In any case, the following describes how Rust types are mapped to Serde-Brief format types.

There are two modes of the format. The first and default encodes structs as maps with keys being strings of the fields' names. The second encodes structs as maps with keys being unsigned integers, where the value denotes the index/position in the struct. Similarly, the same happens for enums. Variants are encoded either as string or as unsigned integer denoting their index (NOT discriminant).

Note that (at least currently) the deserializer can parse data regardless of which encoding was used, unless it relies on features that do not work with index representation mode (e.g. internally tagged enums). The serializer however needs to know which format it needs to serialize to.

Advantages of the default (string representation):

  • Compatibility and robustness: adding or re-ordering fields works without issues.
  • Support of #[serde(rename)], internally tagged enums and any other serde feature. The index representation does NOT support renaming fields. It also cannot deserialize internally tagged enums. This is due to the way serde handles internally tagged enums. Externally or adjacently tagged enums DO work, as well as untagged enums. Please note however, that untagged enum variants can more easily be differentiated with named fields.
  • External parties can understand the data more easily with named fields.

Advantages of the index representation:

  • Smaller footprint: strings need more space in the encoding.

Serde Datatypes in Serde-Brief (String Representation)

The list of serde's types can be found here, along with how Rust types are mapped to serde's types.

Serde Type Brief Type Description
bool BooleanFalse or BooleanTrue Value is saved within the type. No additional value.
u8, u16, u32, u64, u128 UnsignedInt VarInt encoded.
i8, i16, i32, i64, i128 SignedInt ZigZag encoded and then VarInt encoded.
f32 Float32 4 bytes containing the raw value (little-endian).
f64 Float64 8 bytes containing the raw value (little-endian).
char String UTF-8 encoded and serialized as string.
string String First, the length (bytes, not chars) in VarInt encoding is given (unsigned). Then the raw bytes follow. The bytes must be a UTF-8 encoded string.
byte array Bytes First, the length in VarInt encoding is given (unsigned). Then the raw bytes follow.
sequence SeqStart .. SeqEnd SeqStart is the type for starting a sequence. Any number of values follow. A SeqEnd at the correct position will end the sequence.
map MapStart .. MapEnd MapStart is the type for starting a map. Any number of key-value pairs follow. The keys and values are not separated, they are differentiated by position. A MapEnd at the correct position will end the map.
option Null or any other type. None becomes the Null type. Any other value is directly encoded as its type. Note that Option<()> will always be Null and decoded as None.
tuple SeqStart .. SeqEnd Encoded as sequence. Information that the length is fixed is unused and not saved.
unit Null Always Null.
unit struct Null Struct names are not used. There is no value, similar to the unit type.
newtype struct Any Structs names are not used. Newtype structs (only one field) are encoded as their inner value (transparent encoding).
tuple struct SeqStart .. SeqEnd Struct names are not used. Therefore encoded just as a tuple (so as a sequence).
struct MapStart .. MapEnd Struct names are not used. Encoded as a map with keys being the field names and values being their encoded values.
unit variant String Enum names are not used. Variants without data are just the variant name as string.
newtype variant MapStart, String, Any, MapEnd Enum names are not used. Variants with values are a map with a single key-value pair. The key is the variant name as string. The value is the encoded value.
tuple variant MapStart, String, SeqStart .. SeqEnd, MapEnd Enum names are not used. Variants with values are a map with a single key-value pair. The key is the variant name as string. The value is a sequence of the encoded values.
struct variant MapStart, String, MapStart .. MapEnd, MapEnd Enum names are not used. Variants with values are a map with a single key-value pair. The key is the variant name as string. The value is a map of the field names to their values.

Serde Datatypes in Serde-Brief (Index Representation)

The list of serde's types can be found here, along with how Rust types are mapped to serde's types.

The index representation does not work with internally tagged enums (#[serde(tag = "t")]). Externally or adjacently tagged enums do work (nothing or #[serde(tag = "type", content = "c")]).

Serde Type Brief Type Description
bool BooleanFalse or BooleanTrue Value is saved within the type. No additional value.
u8, u16, u32, u64, u128 UnsignedInt VarInt encoded.
i8, i16, i32, i64, i128 SignedInt ZigZag encoded and then VarInt encoded.
f32 Float32 4 bytes containing the raw value (little-endian).
f64 Float64 8 bytes containing the raw value (little-endian).
char String UTF-8 encoded and serialized as string.
string String First, the length (bytes, not chars) in VarInt encoding is given (unsigned). Then the raw bytes follow. The bytes must be a UTF-8 encoded string.
byte array Bytes First, the length in VarInt encoding is given (unsigned). Then the raw bytes follow.
sequence SeqStart .. SeqEnd SeqStart is the type for starting a sequence. Any number of values follow. A SeqEnd at the correct position will end the sequence.
map MapStart .. MapEnd MapStart is the type for starting a map. Any number of key-value pairs follow. The keys and values are not separated, they are differentiated by position. A MapEnd at the correct position will end the map.
option Null or any other type. None becomes the Null type. Any other value is directly encoded as its type. Note that Option<()> will always be Null and decoded as None.
tuple SeqStart .. SeqEnd Encoded as sequence. Information that the length is fixed is unused and not saved.
unit Null Always Null.
unit struct Null Struct names are not used. There is no value, similar to the unit type.
newtype struct Any Structs names are not used. Newtype structs (only one field) are encoded as their inner value (transparent encoding).
tuple struct SeqStart .. SeqEnd Struct names are not used. Therefore encoded just as a tuple (so as a sequence).
struct MapStart .. MapEnd Struct names are not used. Encoded as a map with keys being the field indices (u32) and values being their encoded values.
unit variant UnsignedInt Enum names are not used. Variants without data are just the variant index as unsigned integer (u32).
newtype variant MapStart, UnsignedInt, Any, MapEnd Enum names are not used. Variants with values are a map with a single key-value pair. The key is the variant index as unsigned integer (u32). The value is the encoded value.
tuple variant MapStart, UnsignedInt, SeqStart .. SeqEnd, MapEnd Enum names are not used. Variants with values are a map with a single key-value pair. The key is the variant index as unsigned integer (u32). The value is a sequence of the encoded values.
struct variant MapStart, UnsignedInt, MapStart .. MapEnd, MapEnd Enum names are not used. Variants with values are a map with a single key-value pair. The key is the variant index as unsigned integer (u32). The value is a map of the field indices (u32) to their values.