Reed-Solomon

Reed-Solomon Erasure Coding in Go, with speeds exceeding 1GB/s/cpu core implemented in pure Go.

This is a Go port of the JavaReedSolomon library released by Backblaze, with some additional optimizations.

For an introduction on erasure coding, see the post on the Backblaze blog.

For encoding high shard counts (>256) a Leopard implementation is used. For most platforms this performs close to the original Leopard implementation in terms of speed.

Package home: https://github.com/klauspost/reedsolomon

Godoc: https://pkg.go.dev/github.com/klauspost/reedsolomon

Installation

To get the package use the standard:

go get -u github.com/klauspost/reedsolomon

Using Go modules is recommended.

Usage

This section assumes you know the basics of Reed-Solomon encoding. A good start is this Backblaze blog post.

This package performs the calculation of the parity sets. The usage is therefore relatively simple.

First of all, you need to choose your distribution of data and parity shards. A 'good' distribution is very subjective, and will depend a lot on your usage scenario.

To create an encoder with 10 data shards (where your data goes) and 3 parity shards (calculated):

    enc, err := reedsolomon.New(10, 3)

This encoder will work for all parity sets with this distribution of data and parity shards.

If you will primarily be using it with one shard size it is recommended to use WithAutoGoroutines(shardSize) as an additional parameter. This will attempt to calculate the optimal number of goroutines to use for the best speed. It is not required that all shards are this size.

Then you send and receive data that is a simple slice of byte slices; [][]byte. In the example above, the top slice must have a length of 13.

    data := make([][]byte, 13)

You should then fill the 10 first slices with equally sized data, and create parity shards that will be populated with parity data. In this case we create the data in memory, but you could for instance also use mmap to map files.

    // Create all shards, size them at 50000 each
    for i := range input {
      data[i] := make([]byte, 50000)
    }
    
    // The above allocations can also be done by the encoder:
    // data := enc.(reedsolomon.Extended).AllocAligned(50000)
    
    // Fill some data into the data shards
    for i, in := range data[:10] {
      for j:= range in {
         in[j] = byte((i+j)&0xff)
      }
    }

To populate the parity shards, you simply call Encode() with your data.

    err = enc.Encode(data)

The only cases where you should get an error is, if the data shards aren't of equal size. The last 3 shards now contain parity data. You can verify this by calling Verify():

    ok, err = enc.Verify(data)

The final (and important) part is to be able to reconstruct missing shards. For this to work, you need to know which parts of your data is missing. The encoder does not know which parts are invalid, so if data corruption is a likely scenario, you need to implement a hash check for each shard.

If a byte has changed in your set, and you don't know which it is, there is no way to reconstruct the data set.

To indicate missing data, you set the shard to nil before calling Reconstruct():

    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct the missing shards
    err := enc.Reconstruct(data)

The missing data and parity shards will be recreated. If more than 3 shards are missing, the reconstruction will fail.

If you are only interested in the data shards (for reading purposes) you can call ReconstructData():

    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct just the missing data shards
    err := enc.ReconstructData(data)

If you don't need all data shards you can use ReconstructSome():

    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct just the shard 3
    err := enc.ReconstructSome(data, []bool{false, false, false, true, false, false, false, false})

So to sum up reconstruction:

The number of data/parity shards must match the numbers used for encoding.
The order of shards must be the same as used when encoding.
You may only supply data you know is valid.
Invalid shards should be set to nil.

For complete examples of an encoder and decoder see the examples folder.

Splitting/Joining Data

You might have a large slice of data. To help you split this, there are some helper functions that can split and join a single byte slice.

   bigfile, _ := os.Readfile("myfile.data")
   
   // Split the file
   split, err := enc.Split(bigfile)

This will split the file into the number of data shards set when creating the encoder and create empty parity shards.

An important thing to note is that you have to keep track of the exact input size. If the size of the input isn't divisible by the number of data shards, extra zeros will be inserted in the last shard.

To join a data set, use the Join() function, which will join the shards and write it to the io.Writer you supply:

   // Join a data set and write it to io.Discard.
   err = enc.Join(io.Discard, data, len(bigfile))

Aligned Allocations

For AMD64 aligned inputs can make a big speed difference.

This is an example of the speed difference when inputs are unaligned/aligned:

BenchmarkEncode100x20x10000-32    	    7058	    172648 ns/op	6950.57 MB/s
BenchmarkEncode100x20x10000-32    	    8406	    137911 ns/op	8701.24 MB/s

This is mostly the case when dealing with odd-sized shards.

To facilitate this the package provides an AllocAligned(shards, each int) [][]byte. This will allocate a number of shards, each with the size each. Each shard will then be aligned to a 64 byte boundary.

Each encoder also has a AllocAligned(each int) [][]byte as an extended interface which will return the same, but with the shard count configured in the encoder.

It is not possible to re-aligned already allocated slices, for example when using Split. When it is not possible to write to aligned shards, you should not copy to them.

Progressive encoding

It is possible to encode individual shards using EncodeIdx:

	// EncodeIdx will add parity for a single data shard.
	// Parity shards should start out as 0. The caller must zero them.
	// Data shards must be delivered exactly once. There is no check for this.
	// The parity shards will always be updated and the data shards will remain the same.
	EncodeIdx(dataShard []byte, idx int, parity [][]byte) error

This allows progressively encoding the parity by sending individual data shards. There is no requirement on shards being delivered in order, but when sent in order it allows encoding shards one at the time, effectively allowing the operation to be streaming.

The result will be the same as encoding all shards at once. There is a minor speed penalty using this method, so send shards at once if they are available.

Example

func test() {
    // Create an encoder with 7 data and 3 parity slices.
    enc, _ := reedsolomon.New(7, 3)

    // This will be our output parity.
    parity := make([][]byte, 3)
    for i := range parity {
        parity[i] = make([]byte, 10000)
    }

    for i := 0; i < 7; i++ {
        // Send data shards one at the time.
        _ = enc.EncodeIdx(make([]byte, 10000), i, parity)
    }

    // parity now contains parity, as if all data was sent in one call.
}

Progressive decoding

For advanced use cases, you can progressively decode missing shards (data or parity) using DecodeIdx. This allows you to:

Reconstruct shards from multiple sources arriving at different times
Merge partial reconstructions from different nodes in a distributed system
Incrementally add input shards as they become available

To access progressive decoding, cast your encoder to the Extensions interface:

// Cast to Extensions interface
ext := enc.(reedsolomon.Extensions)

Basic Usage

DecodeIdx works with three parameters:

dst [][]byte - Destination slices for reconstructed data (pre-allocated, initially zeroed)
expectInput []bool - Which shards you expect to receive (true = expected)
input [][]byte - The actual input shards for this call

<details> <summary>Click to see example</summary>

func doReconstruct() {
	// Create encoder:
	enc, _ := reedsolomon.New(5, 3)
	ext := enc.(reedsolomon.Extensions)

	// Set up reconstruction - we want to reconstruct shards 1 and 4
	dst := make([][]byte, 5+3)
	dst[1] = make([]byte, shardSize) // Will reconstruct shard 1
	dst[4] = make([]byte, shardSize) // Will reconstruct shard 4

	// Mark which shards we expect to receive
	expectInput := []bool{
		0: true,
		1: false, // Reconstructing this shard.
		2: true,
		3: true,
		4: false, // Reconstructing this shard. 
		5: true,
		6: true,
		7: false, // We only need to supply 6 shards, so we skip this.
	}

	// First call - provide some shards
	input1 := make([][]byte, 5+3)
	input1[0] = shard0data
	input1[2] = shard2data

	err := ext.DecodeIdx(dst, expectInput, input1)

	// Second call - provide remaining shards
	input2 := make([][]byte, 5+3)
	input2[3] = shard3data
	input2[5] = shard5data
	input2[

Reedsolomon

Install / Use

README