DatenLord

DatenLord is a next-generation cloud-native distributed storage platform, which aims to meet the performance-critical storage needs from next-generation cloud-native applications, such as microservice, serverless, AI, etc. On one hand, DatenLord is designed to be a cloud-native storage system, which itself is distributed, fault-tolerant, and graceful upgrade. These cloud-native features make DatenLord easy to use and easy to maintain. On the other hand, DatenLord is designed as an application-orientated storage system, in that DatenLord is optimized for many performance-critical scenarios, such as databases, AI machine learning, big data. Meanwhile, DatenLord provides high-performance storage service for containers, which facilitates stateful applications running on top of Kubernetes (K8S). The high performance of DatenLord is achieved by leveraging the most recent technology revolution in hardware and software, such as NVMe, non-volatile memory, asynchronous programming, and the native Linux asynchronous IO support.

Why DatenLord?

Why do we build DatenLord? The reason is two-fold:

Firstly, the recent computer hardware architecture revolution stimulates storage software refractory. The storage related functionalities inside Linux kernel haven't changed much in recent 10 years, whenas hard-disk drive (HDD) was the main storage device. Nowadays, solid-state drive (SSD) becomes the mainstream, not even mention the most advanced SSD, NVMe and non-volatile memory. The performance of SSD is hundreds of times faster than HDD, in that the HDD latency is around 1~10 ms, whereas the SSD latency is around 50–150 μs, the NVMe latency is around 25 μs, and the non-volatile memory latency is 350 ns. With the performance revolution of storage devices, traditional blocking-style/synchronous IO in Linux kernel becomes very inefficient, and non-blocking-style/asynchronous IO is much more applicable. The Linux kernel community already realized that, and recently Linux kernel has proposed native-asynchronous IO mechanism, io_uring, to improve IO performance. Beside blocking-style/synchronous IO, the context switch overhead in Linux kernel becomes no longer negligible w.r.t. SSD latency. Many modern programming languages have proposed asynchronous programming, green thread or coroutine to manage asynchronous IO tasks in user space, in order to avoid context switch overhead introduced by blocking IO. Therefore we think it’s time to build a next-generation storage system that takes advantage of the storage performance revolution as far as possible, by leveraging non-blocking/asynchronous IO, asynchronous programming, NVMe, and even non-volatile memory, etc.
Secondly, most distributed/cloud-native systems are computing and storage isolated, that computing tasks/applications and storage systems are of dedicated clusters, respectively. This isolated architecture is best to reduce maintenance, that it decouples the maintenance tasks of computing clusters and storage clusters into separate ones, such as upgrade, expansion, migration of each cluster respectively, which is much simpler than of coupled clusters. Nowadays, however, applications are dealing with much larger datasets than ever before. One notorious example is that an AI training job takes one hour to load data whereas the training job itself finishes in only 45 minutes. Therefore, isolating computing and storage makes IO very inefficient, as transferring data between applications and storage systems via network takes quite a lot of time. Further, with the isolated architecture, applications have to be aware of the different data location, and the varying access cost due to the difference of data location, network distance, etc. DatenLord tackles the IO performance issue of isolated architecture in a novel way, which abstracts the heterogeneous storage details and makes the difference of data location, access cost, etc, transparent to applications. Furthermore, with DatenLord, applications can assume all the data to be accessed are local, and DatenLord will access the data on behalf of applications. Besides, DatenLord can help K8S to schedule jobs close to cached data, since DatenLord knows the exact location of all cached data. By doing so, applications are greatly simplified w.r.t. to data access, and DatenLord can leverage local cache, neighbor cache, and remote cache to speed up data access, so as to boost performance.

Target scenarios

The main scenario of DatenLord is to facilitate high availability across multi-cloud, hybrid-cloud, multiple data centers, etc. Concretely, there are many online business providers whose business is too important to afford any downtime. To achieve high availability, the service providers have to leverage multi-cloud, hybrid-cloud, and multiple data centers to hopefully avoid single point failure of each single cloud or data center, by deploying applications and services across multiple clouds or data centers. It's relatively easier to deploy applications and services to multiple clouds and data centers, but it's much harder to duplicate all data to all clouds or all data centers in a timely manner, due to the huge data size. If data is not equally available across multiple clouds or data centers, the online business might still suffer from single point failure of a cloud or a data center, because data unavailability resulted from a cloud or a data center failure.

DatenLord can alleviate data unavailable of cloud or data center failure by caching data to multiple layers, such as local cache, neighbor cache, remote cache, etc. Although the total data size is huge, the hot data involved in online business is usually of limited size, which is called data locality. DatenLord leverages data locality and builds a set of large scale distributed and automatic cache layers to buffer hot data in a smart manner. The benefit of DatenLord is two-fold:

DatenLord is transparent to applications, namely DatenLord does not need any modification to applications;
DatenLord is high performance, that it automatically caches data by means of the data hotness, and it's performance is achieved by applying different caching strategies according to target applications. For example, least recent use (LRU) caching strategy for some kind of random access, most recent use (MRU) caching strategy for some kind of sequential access, etc.

Architecture

Single Data Center

DatenLord Single Data Center

Multiple Data Centers and Hybrid Cloud

DatenLord Multiple Data Centers and Hybrid Cloud

DatenLord provides 3 kinds of user interfaces: KV interface, S3 interface and file interface. The backend storage is supported by the underlying distributed cache layer which is strong consistent. The strong consistency is guaranteed by the metadata management module which is built on high performance consensus protocol. The persistence storage layer can be local disk or S3 storage. For the network, RDMA is used to provide high throughput and low latency networks. If RDMA is not supported, TCP is an alternative option. For the multiple data center and hybrid clouds scenario, there will be a dedicated metadata server which supports metadata requests within the same data center. While in the same data center scenario, the metadata module can run on the same machine as the cache node. The network between data centers and public clouds are managed by a private network to guarantee high quality data transfer.

<!--- DatenLord is of master-slave architecture. To achieve better storage performance, DatenLord has a coupled architecture with K8S, that DatenLord can be deployed within a K8S cluster, in order to leverage data locality to speed up data access. The above figure is the overall DatenLord architecture, the green parts are DatenLord components, the blue parts are K8S components, the yellow part represents containerized applications. There are several major components of DatenLord: master node (marked as DatenLord), slave node (marked as Daten Sklavin), and K8S plugins. The master node has three parts: S3 compatible interface (S3I), Lord, and Meta Storage Engine (MSE). S3I provides a convenient way to read and write data in DatenLord via S3 protocol, especially for bulk upload and download scenarios, e.g. uploading large amounts of data for big data batch jobs or AI machine learning training jobs. Lord is the overall controller of DatenLord, which controls all the internal behaviors of DatenLord, such as where and how to write data, synchronize data, etc. MSE stores all the meta information of DatenLord, such as the file paths of all the data stored in each slave node, the user-defined labels of each data file, etc. MSE is similar to HDFS namenode. The slave node has four parts: Data Storage Engine (DSE), Sklavin, Meta Storage Engine (MSE), S3/P2P interface. DSE is the distributed cache layer, which is in charge of local IO and network IO, that it not only reads/writes data from/to memory or local disks, but also queries neighbor nodes to read neighbor cached data, further if local and neighbor cache missed, it reads data from remote persistent storage, and it can write data back to remote storage if necessary. More specifically, DatenLord sets up a filesystem in userspace (FUSE) in a slave node. DSE implements the FUSE API's, executing all the underlying FUSE operations, such

Datenlord

Install / Use

README