DPack

DPack is a very compact binary format for serializing data structures, designed for efficient, high-performance serialization/parsing, and optimized for web use. For common large data structures in applications, a dpack file is often much smaller than JSON (and much smaller than MsgPack as well), and can be parsed faster than JSON and other formats. dpack leverages structural reuse to reduce size, and uses binary format for performance. dpack has several key features:

Uses internal referencing and reuse of structures, properties, values, and objects for remarkably compact serialization and fast parsing.
Defined as a valid unicode character string, which allows for single-pass text decoding for faster and simpler decoding (particulary in browser), support across older browsers, and ease of manipulation as a character string. It can also be encoded in UTF-8, UTF-16, or any ASCII compatible encoding.
Supports a wide range of types including strings, decimal-based numbers, booleans, objects, arrays, dates, maps, sets, and user-provided classes/types.
Supports positionally mapped object properties for lazy evaluation of paths for faster access to data without parsing entire data structures (useful for storing, querying, and indexing data in databases).
Supports referencing of objects which can be used to reorder serialization and reuse objects.
Optimized to compress well with Huffman/Gzip encoding schemes

This repository is for the specification of dpack. For dpack libraries:

JavaScript implementation

Specification

DPack is designed for ease of creating high performance implementations, making it easy to use fast bitwise operators, and memory-limited structures. While the dpack messages may appear somewhat complicated description, it is intended to be easily implemented, particularly with typed languages, to be flexible, and to provide protection against excessive memory consumption.

DPack can be described as four layers of well-defined parsing and transformation, to simplify the definition and implementation of DPack. The layers of reading a DPack file are:

Character decoding from binary data to character-based text.
Lexing of characters into tokens that define token type and accompanying numbers.
Parsing of tokens into rudimentary values and sequences.
Structuring of rudimentary values/structures into final data structures.

DPack is a binary data format, in the sense that structures are defined through byte-level tokens for machine parsing. But it is primarily specified as character-based format; blocks of data can be entirely decoded using a character set decoding as the first layer of reading, and then the parsing rules operate on the decoded characters. dpack is optimized for, should default to, UTF-8 encoding, but could be encoded in any character set that can encode unicode charaters 0-127. In addition, it can be encoded using additional characters for more efficient space usage when encoded in UTF-16. A DPack message should consist entirely of tokens and strings that are described by the tokens.

Token Lexing

The basic entity in a dpack message is a token. A token may consist of 1 to 8 characters. The meaning of the characters in a token are determine by their unicode code point number, and further defined by bitwise positions. All token values are based on their character's unicode code point. The token character or characters are used to determine a type number from 0 to 3, and an accompanying number which is an unsigned integer up to 2^46. A dpack serialized data structure consists entirely of tokens and strings that are read by length specified by tokens, based on the parsing rules.

By default, all token character bytes have an initial 0 bit (if 0 - 127 byte range is used, as this makes it compatible with most character sets). The second bit is always a "stop" bit. A stop bit of one means this is the last byte, a zero means additional bytes are part of the token. In the first byte, the next two bits (3rd and 4th) represent the type of the token. The two bits are used to determine the type 0 - 3. The remaining bits, the first four bits of the first byte, and all remaining bytes (up to and including a byte with a stop bit) are used to serialize the accompanying number, which is interpreted by big endian bytes/bits, where the first bits are most significant, and later bytes/bits are less significant. However, if the type is 3 and the stop bit is 0, this is a special case where the type is changed to 7 and no further bytes are read (only the last four bits from the first byte are used to compute the number, as if the stop bit was set). For the sake of this specification, we represent tokens as <type, number>.

Examples

Character "R": Code point 82. Binary representation:

0 1 0 1 0 0 1 0 - Stop bit is set (1), type (0 1) is 1, and accompanying number (0 0 1 0) is 2, producing a token <1, 2>.

Character "!D": Code points 33 and 68. Binary representation:

0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 - Stop bit is not set (0), type (1 0) is 2, bits (0 0 0 1) will be combined with next byte. Next byte, stop bit is set. Combined bits (0 0 0 1 0 0 0 1 0 0) make 68 the accompanying number, producing token <2, 68>.

Character "4": Code point 52. Binary representation:

0 0 1 1 0 1 0 0 - Stop bit is not set (0), type (1 1) is 3. However, this is the special case of type 3 and no stop bit so it is treated as type 7 (stopped), and accompanying number (0 1 0 0) is 4, producing token <7, 4>.

Limits

There may be up to 8 bytes, which accomodates up to 46 bits for the accompanying number, therefore the accompanying number must be an unsigned integers under 2^46.

Some types will use the accompanying number to specify the length of a string immediately following the token. When a string is to be read, the number specifies the number of characters to be including in the string, after which the next block can be immediately read. The length of the string is not bytes, but basic multiplane characters. Any supplemental plane should be counted as two characters (surrogates) should be counted as two characters (a pair). In other words, a string length is defined by its UTF-16 encoding (though it may be serialized in UTF-8 in dpack).

Example Code

This format is designed to be easily parsed with very simple and efficient code. Using standard C/C++/Java/C#/JS/TS syntax, type and number pairs can be lexed efficiently from characters/bytes with just a few lines of code:

token = dpackSource[position++] // read first character/byte
if (token >= 48) { // single byte token
	type = (token >>> 4) ^ 4; // shift and xor gives us the type
	number = token & 15; // last 4 bits is the number
} else { // multiple byte token
	type = (token >>> 4) & 11; // shift and omit the stop bit (bit 3)
	number = token & 15; // last 4 bits is high bits of number
	do {
		token = dpackSource[position++]; // get next byte
		number = (number << 6) + (token & 63); // progressively shift the number for big endian numbers
	} while (token < 64); // until a stop bit

Alternate Encodings

Tokens may also consist of higher character codes and a compliant parser should also be to parse characters that extend beyond unicode 127. For characters with code points 128 and above, the character code point should be interpreted as a 16-bit unsigned integer, with the first bit always as 0, the second bit as a stop bit, the third and fourth bit as type bits, and the remaining 12 bits for the accompanying number. These 16-bit character encoding/decodings can be used for greater efficiency where UTF-16 encoding is preferred (which can be faster in languages that internally represent strings with UTF-16, and there is relatively unlimited socket bandwidth, such as interprocess pipes).

Parsing Types

The type value indicates how the token (and any accompanying string) should be parsed into a rudimentary value. The type is two bits so there are four basic parse types:

1 - Number

This type code means the accompanying number should be directly interpreted as the value, no further parsing is needed for this value.

2 - String

The accompanying number indicates the number of following characters that compose a string. The string, following this token, with the length defined by the accompanying number, is the rudimentary value that is produced.

3 - Type Definition

This type code is used to define properties and special values. The accompanying number is used to determine the value or property to create. This is the definition of the accompanying number follows:

The first six accompanying number codes are for defining special constant values, and should be converted to these values:

0 ("p") - null (or NIL or NULL depending on language)
1 ("q") - Reserved
2 ("r") - Reserved
3 ("s") - false
4 ("t") - true
5 ("u") - undefined (used as value to indicate a property should be omitted)

The next 10 codes are property definition codes. The parser needs to read the next value after this token as the parameter for the property that is being created or modified. The first property definition codes are property creation definitions, which define a new property for the current slot. The property definition code can be followed by the value/parameter which defines the key that is associated with this property. The value/parameter that defines the key can be elided; if the property code is followed by a sequence or another property definition token, the key is not specified and defaults to null. The codes below indicate which property type to be created (and each type defines how the rudimentary values are converted to final values):

6 ("v") - Default property type
7 ("w") - Array property type
8 ("x") - Referencing property type
9 ("y") - Numeric property type
10 - ("z") - Binary property type The next three codes are u

Dpack

Install / Use

README