Streaming

A Data Streaming Library for Efficient Neural Network Training

Generate Convert Improve

Install / Use

/learn @mosaicml/Streaming

About this skill

Quality Score

0/100

README

<br /> <p align="center"> <a href="https://github.com/mosaicml/streaming#gh-light-mode-only" class="only-light"> <img src="./docs/source/_static/images/streaming-logo-light-mode.png" width="50%"/> </a>   <a href="https://github.com/mosaicml/streaming#gh-dark-mode-only" class="only-dark"> <img src="./docs/source/_static/images/streaming-logo-dark-mode.png" width="50%"/> </a>  </p> <h2><p align="center">Fast, accurate streaming of training data from cloud storage</p></h2> <h4><p align='center'> <a href="https://www.mosaicml.com">[Website]</a> - <a href="https://docs.mosaicml.com/projects/streaming/en/latest/getting_started/quick_start.html">[Quick Start]</a> - <a href="https://streaming.docs.mosaicml.com/">[Docs] - <a href="https://www.databricks.com/company/careers/open-positions?department=Mosaic%20AI&location=all">[We're Hiring!]</a> </p></h4> <p align="center"> <a href="https://pypi.org/project/mosaicml-streaming/"> <img alt="PyPi Version" src="https://img.shields.io/pypi/pyversions/mosaicml-streaming"> </a> <a href="https://pypi.org/project/mosaicml-streaming/"> <img alt="PyPi Package Version" src="https://img.shields.io/pypi/v/mosaicml-streaming"> </a> <a href="https://github.com/mosaicml/streaming/actions?query=workflow%3ATest"> <img alt="Unit test" src="https://github.com/mosaicml/streaming/actions/workflows/pytest.yaml/badge.svg"> </a> <a href="https://pepy.tech/project/mosaicml-streaming/"> <img alt="PyPi Downloads" src="https://static.pepy.tech/personalized-badge/mosaicml-streaming?period=month&units=international_system&left_color=grey&right_color=blue&left_text=Downloads/month"> </a> <a href="https://streaming.docs.mosaicml.com"> <img alt="Documentation" src="https://readthedocs.org/projects/streaming/badge/?version=stable"> </a> <a href="https://dub.sh/mcomm"> <img alt="Chat @ Slack" src="https://img.shields.io/badge/slack-chat-2eb67d.svg?logo=slack"> </a> <a href="https://github.com/mosaicml/streaming/blob/main/LICENSE"> <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green.svg?logo=slack"> </a> <a href="https://gurubase.io/g/streaming"> <img alt="License" src="https://img.shields.io/badge/Gurubase-Ask%20Streaming%20Guru-006BFF"> </a> </p> <br />

👋 Welcome

We built StreamingDataset to make training on large datasets from cloud storage as fast, cheap, and scalable as possible.

It’s specially designed for multi-node, distributed training for large models—maximizing correctness guarantees, performance, and ease of use. Now, you can efficiently train anywhere, independent of your training data location. Just stream in the data you need, when you need it. To learn more about why we built StreamingDataset, read our announcement blog.

StreamingDataset is compatible with any data type, including images, text, video, and multimodal data.

With support for major cloud storage providers (AWS, OCI, GCS, Azure, Databricks, and any S3 compatible object store such as Cloudflare R2, Coreweave, Backblaze b2, etc. ) and designed as a drop-in replacement for your PyTorch IterableDataset class, StreamingDataset seamlessly integrates into your existing training workflows.

The flow of samples from shards in the cloud to devices in your cluster

🚀 Getting Started

💾 Installation

Streaming can be installed with pip:

pip install mosaicml-streaming

🏁 Quick Start

1. Prepare Your Data

Convert your raw dataset into one of our supported streaming formats:

MDS (Mosaic Data Shard) format which can encode and decode any Python object
CSV / TSV
JSONL

import numpy as np
from PIL import Image
from streaming import MDSWriter

# Local or remote directory in which to store the compressed output files
data_dir = 'path-to-dataset'

# A dictionary mapping input fields to their data types
columns = {
    'image': 'jpeg',
    'class': 'int'
}

# Shard compression, if any
compression = 'zstd'

# Save the samples as shards using MDSWriter
with MDSWriter(out=data_dir, columns=columns, compression=compression) as out:
    for i in range(10000):
        sample = {
            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),
            'class': np.random.randint(10),
        }
        out.write(sample)

2. Upload Your Data to Cloud Storage

Upload your streaming dataset to the cloud storage of your choice (AWS, OCI, or GCP). Below is one example of uploading a directory to an S3 bucket using the AWS CLI.

$ aws s3 cp --recursive path-to-dataset s3://my-bucket/path-to-dataset

3. Build a StreamingDataset and DataLoader

from torch.utils.data import DataLoader
from streaming import StreamingDataset

# Remote path where full dataset is persistently stored
remote = 's3://my-bucket/path-to-dataset'

# Local working dir where dataset is cached during operation
local = '/tmp/path-to-dataset'

# Create streaming dataset
dataset = StreamingDataset(local=local, remote=remote, shuffle=True)

# Let's see what is in sample #1337...
sample = dataset[1337]
img = sample['image']
cls = sample['class']

# Create PyTorch DataLoader
dataloader = DataLoader(dataset)

📚 What next?

Getting started guides, examples, API references, and other useful information can be found in our docs.

We have end-to-end tutorials for training a model on:

We also have starter code for the following popular datasets, which can be found in the streaming directory:

| Dataset | Task | Read | Write | | --- | --- | --- | --- | | LAION-400M | Text and image | Read | Write | | WebVid | Text and video | Read | Write | | C4 | Text | Read | Write | | EnWiki | Text | Read | Write | | Pile | Text | Read | Write | ADE20K | Image segmentation | Read | Write | CIFAR10 | Image classification | Read | Write | | COCO | Image classification | Read | Write | | ImageNet | Image classification | Read | Write |

To start training on these datasets:

Convert raw data into .mds format using the corresponding script from the convert directory.

For example:

$ python -m streaming.multimodal.convert.webvid --in <CSV file> --out <MDS output directory>

Import dataset class to start training the model.

from streaming.multimodal import StreamingInsideWebVid
dataset = StreamingInsideWebVid(local=local, remote=remote, shuffle=True)

🔑 Key Features

Seamless data mixing

Easily experiment with dataset mixtures with Stream. Dataset sampling can be controlled in relative (proportion) or absolute (repeat or samples terms). During streaming, the different datasets are streamed, shuffled, and mixed seamlessly just-in-time.

# mix C4, github code, a

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

API

A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.

openclaw-plugin-loom

Loom Learning Graph Skill This skill guides agents on how to use the Loom plugin to build and expand a learning graph over time. Purpose - Help users navigate learning paths (e.g., Nix, German)

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app