Multibase

Self-identifying base encodings

Multibase is a protocol for disambiguating the "base encoding" used to express binary data in text formats (e.g., base32, base36, base64, base58, etc.) from the expression alone.

When text is encoded as bytes, we can usually use a one-size-fits-all encoding (UTF-8) because we're always encoding to the same set of 256 bytes (+/- the NUL byte). When that doesn't work, usually for historical or performance reasons, we can usually infer the encoding from the context.

However, when bytes are encoded as text (using a base encoding), the choice of base encoding (and alphabet, and other factors) is often restricted by the context. Worse, these restrictions can change based on where the data appears in the text. In some cases, we can only use [a-z0-9]; in others, we can use a larger set of characters but need a compact encoding. This has lead to a large set of "base encodings", almost one for every use-case. Unlike the case of encoding text to bytes, it is impractical to standardize widely around a single base encoding because there is no optimal encoding for all cases.

As data travels beyond its context, it becomes quite hard to ascertain which base encoding of the many possible ones were used; that's where multibase comes in. Where the data has been prefixed before leaving its context behind, it answers the question:

Given binary data d encoded into text s, what base b was used to encode it?

To answer this question, a single code point is prepended to s at time of encoding, which signals in that new context which b can be used to reconstruct d.

Format
- Multibase Table
Specifications
Status
- Reserved Terms
Multibase By Example
FAQ
Implementations:
Disclaimers
Contribute
License

Format

The Format is:

<base-encoding-code-point><base-encoded-data>

Where <base-encoding-code-point> is a code representing an entry in the multibase table.

Multibase Table

The current multibase table is here:

Unicode,    character,  encoding,           description,                                                    status
U+0000,     NUL,        none,               (No base encoding),                                             reserved
U+0030,     0,          base2,              Binary (01010101),                                              experimental
U+0031,     1,          none,               (No base encoding)                                              reserved
U+0037,     7,          base8,              Octal,                                                          draft
U+0039,     9,          base10,             Decimal,                                                        draft
U+0066,     f,          base16,             Hexadecimal (lowercase),                                        final
U+0046,     F,          base16upper,        Hexadecimal (uppercase),                                        final
U+0076,     v,          base32hex,          RFC4648 case-insensitive - no padding - highest char,           experimental
U+0056,     V,          base32hexupper,     RFC4648 case-insensitive - no padding - highest char,           experimental
U+0074,     t,          base32hexpad,       RFC4648 case-insensitive - with padding,                        experimental
U+0054,     T,          base32hexpadupper,  RFC4648 case-insensitive - with padding,                        experimental
U+0062,     b,          base32,             RFC4648 case-insensitive - no padding,                          final
U+0042,     B,          base32upper,        RFC4648 case-insensitive - no padding,                          final
U+0063,     c,          base32pad,          RFC4648 case-insensitive - with padding,                        draft
U+0043,     C,          base32padupper,     RFC4648 case-insensitive - with padding,                        draft
U+0068,     h,          base32z,            z-base-32 (used by Tahoe-LAFS),                                 draft
U+006b,     k,          base36,             Base36 [0-9a-z] case-insensitive - no padding,                  draft
U+004b,     K,          base36upper,        Base36 [0-9a-z] case-insensitive - no padding,                  draft
U+0052,     R,          base45,             Base45 RFC9285,                                                 draft
U+007a,     z,          base58btc,          Base58 Bitcoin,                                                 final
U+005a,     Z,          base58flickr,       Base58 Flicker,                                                 experimental
U+006d,     m,          base64,             RFC4648 no padding,                                             final
U+004d,     M,          base64pad,          RFC4648 with padding - MIME encoding,                           experimental
U+0075,     u,          base64url,          RFC4648 no padding,                                             final
U+0055,     U,          base64urlpad,       RFC4648 with padding,                                           final
U+0070,     p,          proquint,           Proquint (https://arxiv.org/html/0901.4016),                    experimental
U+0051,     Q,          none,               (no base encoding)                                              reserved
U+002F,     /,          none,               (no base encoding)                                              reserved
U+1F680,    🚀,         base256emoji,       base256 with custom alphabet using variable-sized-codepoints,   experimental

NOTE: Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). In UTF-32, for example, that same "z" would be [0x7a, 0x00, 0x00, 0x00] not [0x7a], so detecting and dropping an initial byte of 0x7a would not suffice to confirm the rest was base58btc-encoded bytes; [0x7a, 0x00, 0x00, 0x00] would instead be the UTF-32 bytes that correspond to the z codepoint for that entry, and the entire byte array would need to be detected and dropped. Also note the difference between 0x00 (codepoint 0 or 0x00) and 0 (codepoint 48 or 0x30).

Specifications

Below is a list of specs for the underlying base encodings:

base2 Base2 RFC
base8 Base8 RFC, similar to rfc4648
base10 Base10 RFC
base36 Base36 RFC
base16* RFC4648
base32* (Except for base32z) rfc4648
base32z Human-oriented base32 spec
base45 RFC9285
base64* RFC4648
base58btc https://datatracker.ietf.org/doc/html/draft-msporny-base58-02
base58flickr https://datatracker.ietf.org/doc/html/draft-msporny-base58-02, but using a different alphabet
proquint Proquint RFC, which is the original spec with an added prefix for legibility
base256emoji Base256Emoji RFC

Status

Each multibase encoding has a status:

reserved - for functional reasons or to avoid collisions with other multi-* registries, this registry cannot accept registrations at this code-point and implementing one unregistered is discouraged for interoperability reasons
experimental - these encodings have been proposed but are not widely implemented and may be removed.
draft - these encodings are mature and widely implemented but may not be implemented by all implementations.
final - these encodings should be implemented by all implementations and are widely used.
deprecated - this entry will likely be removed and reassigned in the future and it will not likely become a final registration

Reserved Terms

The following codes are reserved and cannot be registered in the multibase table. Note that all three of the Unicode entries, expressed as the [unsigned varint] expression of that Unicode code-point in UTF-8, correspond to widely-used entries in the [multiformats registry group] that could create confusions for some legacy systems handling both binary and multibased structures from other multiformats. While technically the multibase registry is not part of the [multiformats registry group], these reservations minimize risk of confusion when composing multiple multiformats in one data system.

NUL (n/a) - Legacy data may be found with null-byte-prefixed binary structures mixed in among multibase-encoded ones in arrays of data, although support for this is no longer mandated by conformant implementations.
/ (U+002F) - Separator used by [multiaddr].
1 (U+0031) - Base58-encoded identity multihashes used by libp2p peer IDs.
Q (U+0051) - Base58-encoded sha2-256 multihashes used by libp2p/ipfs for peer IDs and CIDv0.

Multibase By Example

Consider the following encodings of the same binary string:

4D756C74696261736520697320617765736F6D6521205C6F2F # base16 (hex)
JV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32
3IY8QKL64VUGCX009XWUHKF6GBBT

Multibase

Install / Use

README

Multibase

Table of Contents

Format

Multibase Table

Specifications

Status

Reserved Terms

Multibase By Example