V6d

vineyard (v6d): an in-memory immutable data manager. (Project under CNCF, TAG-Storage)

Generate Convert Improve

Install / Use

/learn @v6d-io/V6d

About this skill

Quality Score

0/100

README

.. raw:: html

<h1 align="center" style="clear: both;">
    <img src="https://v6d.io/_static/vineyard-logo-rect.png" width="397" alt="vineyard">
</h1>
<p align="center">
    an in-memory immutable data manager
</p>

Vineyard (v6d) is an innovative in-memory immutable data manager that offers out-of-the-box high-level abstractions and zero-copy in-memory sharing for distributed data in various big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.

.. image:: https://v6d.io/_static/cncf-color.svg :width: 400 :alt: Vineyard is a CNCF sandbox project

Vineyard is a CNCF sandbox project_ and indeed made successful by its community.

Overview <#what-is-vineyard>_
Features of vineyard <#features>_
- Efficient sharing for in-memory immutable data <#in-memory-immutable-data-sharing>_
- Out-of-the-box high level data structures <#out-of-the-box-high-level-data-abstraction>_
- Pipelining using stream <#stream-pipelining>_
- I/O Drivers <#drivers>_
Getting started with Vineyard <#try-vineyard>_
Deploying on Kubernetes <#deploying-on-kubernetes>_
Frequently asked questions <#faq>_
Getting involved in our community <#getting-involved>_
Third-party dependencies <#acknowledgements>_

What is vineyard

Vineyard is specifically designed to facilitate zero-copy data sharing among big data systems. To illustrate this, let's consider a typical machine learning task of time series prediction with LSTM_. This task can be broken down into several steps:

First, we read the data from the file system as a pandas.DataFrame.
Next, we apply various preprocessing tasks, such as eliminating null values, to the dataframe.
Once the data is preprocessed, we define the model and train it on the processed dataframe using PyTorch.
Finally, we evaluate the performance of the model.

In a single-machine environment, pandas and PyTorch, despite being two distinct systems designed for different tasks, can efficiently share data with minimal overhead. This is achieved through an end-to-end process within a single Python script.

.. image:: https://v6d.io/_static/vineyard_compare.png :alt: Comparing the workflow with and without vineyard

What if the input data is too large to be processed on a single machine?

As depicted on the left side of the figure, a common approach is to store the data as tables in a distributed file system (e.g., HDFS) and replace pandas with ETL processes using SQL over a big data system such as Hive and Spark. To share the data with PyTorch, the intermediate results are typically saved back as tables on HDFS. However, this can introduce challenges for developers.

For the same task, users must program for multiple systems (SQL & Python).
Data can be polymorphic. Non-relational data, such as tensors, dataframes, and graphs/networks (in GraphScope_) are becoming increasingly common. Tables and SQL may not be the most efficient way to store, exchange, or process them. Transforming the data from/to "tables" between different systems can result in significant overhead.
Saving/loading the data to/from external storage incurs substantial memory-copies and IO costs.

Vineyard addresses these issues by providing:

In-memory distributed data sharing in a zero-copy fashion to avoid introducing additional I/O costs by leveraging a shared memory manager derived from plasma.
Built-in out-of-the-box high-level abstractions to share distributed data with complex structures (e.g., distributed graphs) with minimal extra development cost, while eliminating transformation costs.

As depicted on the right side of the above figure, we demonstrate how to integrate vineyard to address the task in a big data context.

First, we utilize Mars_ (a tensor-based unified framework for large-scale data computation that scales Numpy, Pandas, and Scikit-learn) to preprocess the raw data, similar to the single-machine solution, and store the preprocessed dataframe in vineyard.

+-------------+-----------------------------------------------------------------------------+ | | .. code-block:: python | | single | | | | data_csv = pd.read_csv('./data.csv', usecols=[1]) | +-------------+-----------------------------------------------------------------------------+ | | .. code-block:: python | | | | | | import mars.dataframe as md | | distributed | dataset = md.read_csv('hdfs://server/data_full', usecols=[1]) | | | # after preprocessing, save the dataset to vineyard | | | vineyard_distributed_tensor_id = dataset.to_vineyard() | +-------------+-----------------------------------------------------------------------------+

Then, we modify the training phase to get the preprocessed data from vineyard. Here vineyard makes the sharing of distributed data between Mars_ and PyTorch just like a local variable in the single machine solution.

+-------------+-----------------------------------------------------------------------------+ | | .. code-block:: python | | single | | | | data_X, data_Y = create_dataset(dataset) | +-------------+-----------------------------------------------------------------------------+ | | .. code-block:: python | | | | | | client = vineyard.connect(vineyard_ipc_socket) | | distributed | dataset = client.get(vineyard_distributed_tensor_id).local_partition() | | | data_X, data_Y = create_dataset(dataset) | +-------------+-----------------------------------------------------------------------------+

Finally, we execute the training phase in a distributed manner across the cluster.

From this example, it is evident that with vineyard, the task in the big data context can be addressed with only minor adjustments to the single-machine solution. Compared to existing approaches, vineyard effectively eliminates I/O and transformation overheads.

Features

Efficient In-Memory Immutable Data Sharing ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Vineyard serves as an in-memory immutable data manager, enabling efficient data sharing across different systems via shared memory without additional overheads. By eliminating serialization/deserialization and IO costs during data exchange between systems, Vineyard significantly improves performance.

Out-of-the-Box High-Level Data Abstractions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Computation frameworks often have their own data abstractions for high-level concepts. For example, tensors can be represented as torch.tensor, tf.Tensor, mxnet.ndarray, etc. Moreover, every graph processing engine <https://github.com/alibaba/GraphScope>_ has its unique graph structure representation.

The diversity of data abstractions complicates data sharing. Vineyard addresses this issue by providing out-of-the-box high-level data abstractions over in-memory blobs, using hierarchical metadata to describe objects. Various computation systems can leverage these built-in high-level data abstractions to exchange data with other systems in a computation pipeline concisely and efficiently.

Stream Pipelining for Enhanced Performance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A computation doesn't need to wait for all preceding results to arrive before starting its work. Vineyard provides a stream as a special kind of immutable data for pipelining scenarios. The preceding job can write immutable data chunk by chunk to Vineyard while maintaining data structure semantics. The successor job reads shared-memory chunks from Vineyard's stream without extra copy costs and triggers its work. This overlapping reduces the overall processing time and memory consumption.

Versatile Drivers for Common Tasks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Many big data analytical tasks involve numerous boilerplate routines that are unrelated to the computation itself, such as various IO adapters, data partition strategies, and migration jobs. Since data structure abstractions usually differ between systems, these routines cannot be easily reused.

Vineyard provides common manipulation routines for immutable data as drivers. In addition to sharing high-level data abstractions, Vineyard extends the capability of data structures with drivers, enabling out-of-the-box reusable routines for the boilerplate parts in computation jobs.

Try Vineyard

Vineyard is available as a python package_ and can be effortlessly installed using pip:

.. code:: shell

pip3 install vineyard

For comprehensive and up-to-date documentation, please visit https://v6d.io.

If you wish to build vineyard from source, please consult the Installation_ guide. For instructions on building and running unittests locally, refer to the Contributing_ section.

After installation, you can initiate a vineyard inst