BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Zero-Copy In-Memory Sharing of Large Distributed Data: V6d

Zero-Copy In-Memory Sharing of Large Distributed Data: V6d

The team behind the zero-copy and in-memory data manager Vineyard (v6d) have recently released version 0.13.2, which brought improved features for Python/C++ development, and Kubernetes deployment. It is maintained as a CNCF sandbox project and provides distributed operators that can be utilized to share immutable data within or across cluster nodes. V6d is of particular interest for deep network training (e.g. large language and graph models) on big (sharded) datasets. An Alibaba engineering team currently leads its development.

Zero-copy and in-memory data distribution is a central problem for many real-time applications. From image processing pipelines to deep learning models such as LLM and graph mining algorithms, many data-crunching applications require ingesting large data from many independent processes. In machine learning engineering, this bottleneck has become more evident as deep networks are getting larger and the distribution of model parameters mandates access to shared state and data. As an early-stage project, V6d aims to bring a high-level API for such use cases.

Architectures of real-time applications generally exploit in-memory key-value storages/caches (e.g. etcd, Memcached, Redis) for storing and interchanging frequently reached data. According to service type, engineering teams must consider related trade-offs with these tools. V6d consists of two main components: Apache Arrow Plasma-derived shared-memory data manager (within a node) and a metadata server backed by etcd (between different nodes). While the Plasma-derived service allows zero-copy data transfer, etcd service handles the global distribution of (possibly partitioned) data's properties.

V6d places itself within the Python community. In a way, it can be considered to scale Python's native multiprocess shared_memory to multiple machines for immutable blobs. V6d offers two different Python client interfaces IPCClient and RPCClient for manipulating local and remote objects, respectively. Both client APIs permit uniform data insertion and retrieval patterns based on object IDs. However, v6d does not automatically move data between cluster nodes unless instructed due to the high network cost of such operations.

We could present a simple example that can be run on a local machine. Let's start with creating a local v6d instance:

python -m vineyard --socket /tmp/vineyard.sock --size 16733650944

First, let's show how we can utilize Python's native API. For this purpose, we will create a dummy 10k resolution RGB image using NumPy and share it quickly using the shared_memory() interface:

import numpy as np
from multiprocessing import shared_memory

shape_, dtype_ = (3, 10000, 10000), np.uint8
array_to_share = np.random.randint(0, high=255, size=shape_, dtype=dtype_)

# Create shared memory
shm = shared_memory.SharedMemory(create=True, size=array_to_share.nbytes)
array_shm = np.ndarray(shape_, dtype=array_to_share.dtype, buffer=shm.buf)
array_shm[:] = array_to_share[:] # Here we need to copy as we use existing array

# Use the shared memory name, size and type info to retrieve data in another process
existing_shm = shared_memory.SharedMemory(name=shm.name)
array_retrieved = np.ndarray(shape=shape_, dtype=dtype_, buffer=existing_shm.buf)

Here, we could carry out the same operation using v6d:

import vineyard

client = vineyard.connect('/tmp/vineyard.sock')
array_id = client.put(array_to_share)

# Retrieve the previous array_to_share in another process
array_retrieved = client.get(array_id)

As shown above, the API is quite easy to use and propagates the dtype and array shape to the retrieved object. Because of the common array protocol (aka buffer protocol), the NumPy interface also accepts zero-copy operations on PyTorch, TensorFlow, and MxNet tensors. In addition to that, v6d enables the same operations on Pandas/Arrow dataframes. Further information on library integrations can be reached in the related documentation page. An example machine learning training tutorial can also be found in the webpage.

For multi-node settings, V6d allows the deployment of Vineyard operators on Kubernetes clusters via the Python API and Helm charts. The official documentation provides a more detailed overview of the Vineyard architecture.

About the Author

Rate this Article

Adoption
Style

BT