This project has moved to https://github.com/minio/minio/tree/master/pkg/dsync

dsync

A distributed locking and syncing package for Go.

Introduction

dsync is a package for doing distributed locks over a network of n nodes. It is designed with simplicity in mind and hence offers limited scalability (n <= 32). Each node will be connected to all other nodes and lock requests from any node will be broadcast to all connected nodes. A node will succeed in getting the lock if n/2 + 1 nodes (whether or not including itself) respond positively. If the lock is acquired it can be held for as long as the client desires and needs to be released afterwards. This will cause the release to be broadcast to all nodes after which the lock becomes available again.

Motivation

This package was developed for the distributed server version of Minio Object Storage. For this we needed a distributed locking mechanism for up to 32 servers that each would be running minio server. The locking mechanism itself should be a reader/writer mutual exclusion lock meaning that it can be held by a single writer or an arbitrary number of readers.

For minio the distributed version is started as follows (for a 6-server system):

$ minio server http://server1/disk http://server2/disk http://server3/disk http://server4/disk http://server5/disk http://server6/disk

(note that the same identical command should be run on servers server1 through to server6)

Design goals

Simple design: by keeping the design simple, many tricky edge cases can be avoided.
No master node: there is no concept of a master node which, if this would be used and the master would be down, causes locking to come to a complete stop. (Unless you have a design with a slave node but this adds yet more complexity.)
Resilient: if one or more nodes go down, the other nodes should not be affected and can continue to acquire locks (provided not more than n/2 - 1 nodes are down).
Drop-in replacement for sync.RWMutex and supports sync.Locker interface.
Automatically reconnect to (restarted) nodes.

Restrictions

Limited scalability: up to 32 nodes.
Fixed configuration: changes in the number and/or network names/IP addresses need a restart of all nodes in order to take effect.
If a down node comes up, it will not try to (re)acquire any locks that it may have held.
Not designed for high performance applications such as key/value stores.

Performance

Support up to a total of 7500 locks/second for a size of 16 nodes (consuming 10% CPU usage per server) on moderately powerful server hardware.
Lock requests (successful) should not take longer than 1ms (provided decent network connection of 1 Gbit or more between the nodes).

The tables below show detailed performance numbers.

Performance with varying number of nodes

This table shows test performance on the same (EC2) instance type but with a varying number of nodes:

| EC2 Instance Type | Nodes | Locks/server/sec | Total Locks/sec | CPU Usage | | -------------------- | -----:| --------------------:| ---------------:| ---------:| | c3.2xlarge | 4 | (min=3110, max=3376) | 12972 | 25% | | c3.2xlarge | 8 | (min=1884, max=2096) | 15920 | 25% | | c3.2xlarge | 12 | (min=1239, max=1558) | 16782 | 25% | | c3.2xlarge | 16 | (min=996, max=1391) | 19096 | 25% |

The min and max locks/server/sec gradually declines but due to the larger number of nodes the overall total number of locks rises steadily (at the same CPU usage level).

Performance with difference instance types

This table shows test performance for a fixed number of 8 nodes on different EC2 instance types:

| EC2 Instance Type | Nodes | Locks/server/sec | Total Locks/sec | CPU Usage | | -------------------- | -----:| --------------------:| ---------------:| ---------:| | c3.large (2 vCPU) | 8 | (min=823, max=896) | 6876 | 75% | | c3.2xlarge (8 vCPU) | 8 | (min=1884, max=2096) | 15920 | 25% | | c3.8xlarge (32 vCPU) | 8 | (min=2601, max=2898) | 21996 | 10% |

With the rise in the number of cores the CPU load decreases and overall performance increases.

Stress test

Stress test on a c3.8xlarge (32 vCPU) instance type:

| EC2 Instance Type | Nodes | Locks/server/sec | Total Locks/sec | CPU Usage | | -------------------- | -----:| --------------------:| ---------------:| ---------:| | c3.8xlarge | 8 | (min=2601, max=2898) | 21996 | 10% | | c3.8xlarge | 8 | (min=4756, max=5227) | 39932 | 20% | | c3.8xlarge | 8 | (min=7979, max=8517) | 65984 | 40% | | c3.8xlarge | 8 | (min=9267, max=9469) | 74944 | 50% |

The system can be pushed to 75K locks/sec at 50% CPU load.

Usage

NOTE: Previously if you were using dsync.Init([]NetLocker, nodeIndex) to initialize dsync has been changed to dsync.New([]NetLocker, nodeIndex) which returns a *Dsync object to be used in every instance of NewDRWMutex("test", *Dsync)

Exclusive lock

Here is a simple example showing how to protect a single resource (drop-in replacement for sync.Mutex):

import (
	"github.com/minio/dsync/v3"
)

func lockSameResource() {

	// Create distributed mutex to protect resource 'test'
	dm := dsync.NewDRWMutex(context.Background(), "test", ds)

	dm.Lock("lock-1", "example.go:505:lockSameResource()")
	log.Println("first lock granted")

	// Release 1st lock after 5 seconds
	go func() {
		time.Sleep(5 * time.Second)
		log.Println("first lock unlocked")
		dm.Unlock()
	}()

	// Try to acquire lock again, will block until initial lock is released
	log.Println("about to lock same resource again...")
	dm.Lock("lock-1", "example.go:515:lockSameResource()")
	log.Println("second lock granted")

	time.Sleep(2 * time.Second)
	dm.Unlock()
}

which gives the following output:

2016/09/02 14:50:00 first lock granted
2016/09/02 14:50:00 about to lock same resource again...
2016/09/02 14:50:05 first lock unlocked
2016/09/02 14:50:05 second lock granted

Read locks

DRWMutex also supports multiple simultaneous read locks as shown below (analogous to sync.RWMutex)

func twoReadLocksAndSingleWriteLock() {

	drwm := dsync.NewDRWMutex(context.Background(), "resource", ds)

	drwm.RLock("RLock-1", "example.go:416:twoReadLocksAndSingleWriteLock()")
	log.Println("1st read lock acquired, waiting...")

	drwm.RLock("RLock-2", "example.go:420:twoReadLocksAndSingleWriteLock()")
	log.Println("2nd read lock acquired, waiting...")

	go func() {
		time.Sleep(1 * time.Second)
		drwm.RUnlock()
		log.Println("1st read lock released, waiting...")
	}()

	go func() {
		time.Sleep(2 * time.Second)
		drwm.RUnlock()
		log.Println("2nd read lock released, waiting...")
	}()

	log.Println("Trying to acquire write lock, waiting...")
	drwm.Lock("Lock-1", "example.go:445:twoReadLocksAndSingleWriteLock()")
	log.Println("Write lock acquired, waiting...")

	time.Sleep(3 * time.Second)

	drwm.Unlock()
}

which gives the following output:

2016/09/02 15:05:20 1st read lock acquired, waiting...
2016/09/02 15:05:20 2nd read lock acquired, waiting...
2016/09/02 15:05:20 Trying to acquire write lock, waiting...
2016/09/02 15:05:22 1st read lock released, waiting...
2016/09/02 15:05:24 2nd read lock released, waiting...
2016/09/02 15:05:24 Write lock acquired, waiting...

Basic architecture

Lock process

The basic steps in the lock process are as follows:

broadcast lock message to all n nodes
collect all responses within certain time-out window
- if quorum met (minimally n/2 + 1 responded positively) then grant lock
- otherwise release all underlying locks and try again after a (semi-)random delay
release any locks that (still) came in after time time-out window

Unlock process

The unlock process is really simple:

broadcast unlock message to all nodes that granted lock
if a destination is not available, retry with gradually longer back-off window to still deliver
ignore the 'result' (cover for cases where destination node has gone down and came back up)

Dealing with Stale Locks

A 'stale' lock is a lock that is left at a node while the client that originally acquired the client either:

never released the lock (due to eg a crash) or
is disconnected from the network and henceforth not able to deliver the unlock message.

Too many stale locks can prevent a new lock on a resource from being acquired, that is, if the sum of the stale locks and the number of down nodes is greater than n/2 - 1. In dsync a recovery mechanism is implemented to remove stale locks (see here for the details).

Known deficiencies

Known deficiencies can be divided into two categories, namely a) more than one write lock granted and b) lock not becoming available anymore.

More than one write lock

So far we have identified one case during which this can happen (example for 8 node system):

3 nodes are down (say 6, 7, and 8)
node 1 acquires a lock on "test" (nodes 1 through to 5 giving quorum)
node 4 and 5 crash (dropping the lock)
nodes 4 through to 8 restart
node 4 acquires a lock on "test" (nodes 4 through to 8 giving quorum)

Now we h

Dsync

Install / Use

README