EntropyString for Erlang

Efficiently generate cryptographically strong random strings of specified entropy from various character sets.

<a name="TOC"></a>TOC

Installation
Usage
Overview
Real Need
Character Sets
Custom Characters
Efficiency
Custom Bytes
Entropy Bits
Take Away

Installation

Add to rebar.config

{deps, [
...
     {entropy_string, {git, "https://github.com/EntropyString/Erlang.git", {tag, "1.0.0"}}}
 ]}.

To build and run tests

> rebar3 compile
> rebar3 eunit

TOC

<a name="Usage"></a>Usage

To run code snippets in the Erlang shell

> rebar3 compile
> erl -pa _build/default/lib/entropy_string/ebin
Erlang/OTP ...
1> l(entropy_string).
{module,entropy_string}
2>

Generate a potential of 1 million random strings with 1 in a billion chance of repeat:

2> Bits = entropy_string:bits(1.0e6, 1.0e9).
68.7604899926346
3> entropy_string:random_string(Bits).
<<"GhrB6fJbD6gTpT">>

There are six predefined character sets. By default, random_string/1 uses charset32, a character set with 32 characters. To get a random hexadecimal string with the same entropy Bits as above (see Real Need for description of what entropy Bits represents):

2> Bits = entropy_string:bits(1.0e6, 1.0e9).
68.7604899926346
3> entropy_string:random_string(Bits, charset16).
<<"99d535fbcac884e875">>

Custom characters are also supported. Using uppercase hexadecimal characters:

2> Bits = entropy_string:bits(1.0e6, 1.0e9).
68.7604899926346
3> entropy_string:random_string(Bits, <<"0123456789ABCDEF">>).
<<"6099EA0B59F9813D5F">>

Convenience functions are provided for common scenarios. For example, OWASP session ID usingcharset32:

2> entropy_string:session_id().
<<"bQQjJrbJQ44j76hMPqrTtqGFrq">>

Session ID using RFC 4648 file system and URL safe characters:

2> entropy_string:session_id(charset64).
<<"4VMiJmD23Aq2Px7vGnd8Fi">>

TOC

<a name="Overview"></a>Overview

entropy_string provides easy creation of randomly generated strings of specific entropy using various character sets. Such strings are needed as unique identifiers when generating, for example, random IDs and you don't want the overkill of a GUID.

A key concern when generating such strings is that they be unique. Guaranteed uniqueness, however,, requires either deterministic generation (e.g., a counter) that is not random, or that each newly created random string be compared against all existing strings. When ramdoness is required, the overhead of storing and comparing strings is often too onerous and a different tack is chosen.

A common strategy is to replace the guarantee of uniqueness with a weaker but often sufficient probabilistic uniqueness. Specifically, rather than being absolutely sure of uniqueness, we settle for a statement such as "there is less than a 1 in a billion chance that two of my strings are the same". This strategy requires much less overhead, but does require we have some manner of qualifying what we mean by "there is less than a 1 in a billion chance that 1 million strings of this form will have a repeat".

Understanding probabilistic uniqueness of random strings requires an understanding of entropy and of estimating the probability of a collision (i.e., the probability that two strings in a set of randomly generated strings might be the same). The blog posting Hash Collision Probabilities provides an excellent overview of deriving an expression for calculating the probability of a collision in some number of hashes using a perfect hash with an N-bit output. Thef Entropy Bits section below discribes how entropy_string takes this idea a step further to address a common need in generating unique identifiers.

We'll begin investigating entropy_string by considering our Real Need when generating random strings.

TOC

<a name="RealNeed"></a>Real Need

Let's start by reflecting on a common statement of need for developers, who might say:

I need random strings 16 characters long.

Okay. There are libraries available that address that exact need. But first, there are some questions that arise from the need as stated, such as:

What characters do you want to use?
How many of these strings do you need?
Why do you need these strings?

The available libraries often let you specify the characters to use. So we can assume for now that question 1 is answered with:

Hexadecimal will do fine.

As for question 2, the developer might respond:

I need 10,000 of these things.

Ah, now we're getting somewhere. The answer to question 3 might lead to the further qualification:

I need to generate 10,000 random, unique IDs.

And the cat's out of the bag. We're getting at the real need, and it's not the same as the original statement. The developer needs uniqueness across some potential number of strings. The length of the string is a by-product of the uniqueness, not the goal, and should not be the primary specification for the random string.

As noted in the Overview, guaranteeing uniqueness is difficult, so we'll replace that declaration with one of probabilistic uniqueness by asking:

What risk of a repeat are you willing to accept?

Probabilistic uniqueness contains risk. That's the price we pay for giving up on the stronger declaration of strict uniqueness. But the developer can quantify an appropriate risk for a particular scenario with a statement like:

I guess I can live with a 1 in a million chance of a repeat.

So now we've gotten to the developer's real need:

I need 10,000 random hexadecimal IDs with less than 1 in a million chance of any repeats.

Not only is this statement more specific, there is no mention of string length. The developer needs probabilistic uniqueness, and strings are to be used to capture randomness for this purpose. As such, the length of the string is simply a by-product of the encoding used to represent the required uniqueness as a string.

How do you address this need using a library designed to generate strings of specified length? Well, you don't directly, because that library was designed to answer the originally stated need, not the real need we've uncovered. We need a library that deals with probabilistic uniqueness of a total number of some strings. And that's exactly what entropy_string does.

Let's use entropy_string to help this developer by generating 5 IDs:

2> Bits = entropy_string:bits(10000, 1000000).
45.50699332842307
3> lists:map(fun(_) -> entropy_string:random_string(Bits, charset16) end, lists:seq(1,5)).
[<<"9fd4090d336f">>,<<"692c599701c9">>,<<"175a5f34bb89">>,<<"144cc6119460">>,<<"dd61e0a66605">>]

To generate the IDs, we first use

Bits = entropy_string:bits(10000, 1000000).

to determine how much entropy is needed to generate a potential of 10000 strings while satisfy the probabilistic uniqueness of a 1 in a million risk of repeat. We can see from the output of the Erland shell it's about 45.51 bits. Inside the list comprehension we used

entropy_string:random_string(Bits, charset16)

to actually generate a random string of the specified entropy using hexadecimal (charset16) characters. Looking at the IDs, we can see each is 12 characters long. Again, the string length is a by-product of the characters used to represent the entropy we needed. And it seems the developer didn't really need 16 characters after all.

Finally, given that the strings are 12 hexadecimals long, each string actually has an information carrying capacity of 12 * 4 = 48 bits of entropy (a hexadecimal character carries 4 bits). That's fine. Assuming all characters are equally probable, a string can only carry entropy equal to a multiple of the amount of entropy represented per character. entropy_string produces the smallest strings that exceed the specified entropy.

TOC

<a name="CharacterSets"></a>Character Sets

As we've seen in the previous sections, entropy_string provides predefined characters for each of the supported character set lengths. Let's see what's under the hood. The predefined character sets are charset64, charset32, charset16, charset8, charset4 and charset2. The characters for each were chosen as follows:

CharSet 64: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
- The file system and URL safe char set from RFC 4648.
CharSet 32: 2346789bdfghjmnpqrtBDFGHJLMNPQRT
- Remove all upper and lower case vowels (including y)
- Remove all numbers that look like letters
- Remove all letters that look like numbers
- Remove all letters that have poor distinction between upper and lower case values. The resulting strings don't look like English words and are easy to parse visually.
CharSet 16: 0123456789abcdef
- Hexadecimal
CharSet 8: 01234567

Erlang

Install / Use

README